Refine metrics logs #817

haitwang-cloud · 2025-01-16T07:26:04Z

What type of PR is this?

/kind design
What this PR does / why we need it:
This pull request includes several changes aimed at improving logging, refactoring code for better readability, and enhancing test coverage. The most important changes include updating logging statements to use structured logging, refactoring code for clarity, and adding new test cases.

Logging Improvements:

cmd/scheduler/metrics.go: Updated logging statements to use klog.InfoS for structured logging and added more detailed log messages for better traceability. [1] [2]
pkg/scheduler/pods.go: Enhanced logging in podManager methods to provide more context and use structured logging. [1] [2]
pkg/scheduler/scheduler.go: Improved logging in Scheduler methods to use structured logging and added more detailed log messages. [1] [2] [3]

Code Refactoring:

pkg/util/client/client.go: Refactored the client initialization to use a singleton pattern with sync.Once and improved error handling. [1] [2]

Test Enhancements:

pkg/scheduler/scheduler_test.go: Added new test cases to improve coverage and ensure proper handling of node annotations, including cases with nil annotations. [1] [2]

These changes collectively enhance the robustness, readability, and maintainability of the codebase while providing more detailed and structured logging for easier debugging and monitoring.
Which issue(s) this PR fixes:
Fixes # Avoid the client init error by adding ut & singleton pattern

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Signed-off-by: haitwang-cloud <[email protected]>

haitwang-cloud · 2025-01-21T01:59:31Z

This pull request includes several changes aimed at improving logging, refactoring code for better readability, and enhancing test coverage. The most important changes include updating logging statements to use structured logging, refactoring code for clarity, and adding new test cases.

Logging Improvements:

cmd/scheduler/metrics.go: Updated logging statements to use klog.InfoS for structured logging and added more detailed log messages for better traceability. [1] [2]
pkg/scheduler/pods.go: Enhanced logging in podManager methods to provide more context and use structured logging. [1] [2]
pkg/scheduler/scheduler.go: Improved logging in Scheduler methods to use structured logging and added more detailed log messages. [1] [2] [3]

Code Refactoring:

pkg/util/client/client.go: Refactored the client initialization to use a singleton pattern with sync.Once and improved error handling. [1] [2]

Test Enhancements:

pkg/scheduler/scheduler_test.go: Added new test cases to improve coverage and ensure proper handling of node annotations, including cases with nil annotations. [1] [2]

These changes collectively enhance the robustness, readability, and maintainability of the codebase while providing more detailed and structured logging for easier debugging and monitoring.

haitwang-cloud · 2025-01-21T02:19:23Z

Appending the scheduler logs after this PR's fix

0121 02:17:54.850403       1 util.go:303] checklist is [map[DCU:hami.io/dcu-devices-allocated Iluvatar:hami.io/iluvatar-vgpu-devices-allocated MLU:hami.io/cambricon-mlu-devices-allocated Metax:hami.io/metax-gpu-devices-allocated Mthreads:hami.io/mthreads-vgpu-devices-allocated NVIDIA:hami.io/vgpu-devices-allocated]], annos is [map[cni.projectcalico.org/containerID:9a6f90a87396138de769ffbe49360d904261236356fccf19c79726a65888d16b cni.projectcalico.org/podIP:100.96.0.157/32 cni.projectcalico.org/podIPs:100.96.0.157/32 hami.io/bind-phase:success hami.io/bind-time:1737425866 hami.io/vgpu-devices-allocated:GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:; hami.io/vgpu-devices-to-allocate:; hami.io/vgpu-node:shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp hami.io/vgpu-time:1737425866 istio.io/rev:default kubectl.kubernetes.io/default-container:nanogpt kubectl.kubernetes.io/default-logs-container:nanogpt prometheus.io/path:/stats/prometheus prometheus.io/port:15020 prometheus.io/scrape:true sidecar.istio.io/status:{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}]]
I0121 02:17:54.850531       1 util.go:278] Start to decode container device GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:
I0121 02:17:54.850550       1 util.go:298] Finished decoding container devices. Total devices: 1
I0121 02:17:54.850585       1 util.go:325] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:54.850635       1 pods.go:78] "Pod devices updated" pod="kbf-i742968/nanogpt-0" namespace="kbf-i742968" name="nanogpt-0" devices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.420883       1 scheduler.go:99] "Pod added" pod="nanogpt-0" namespace="kbf-i742968"
I0121 02:17:55.420920       1 util.go:303] checklist is [map[DCU:hami.io/dcu-devices-allocated Iluvatar:hami.io/iluvatar-vgpu-devices-allocated MLU:hami.io/cambricon-mlu-devices-allocated Metax:hami.io/metax-gpu-devices-allocated Mthreads:hami.io/mthreads-vgpu-devices-allocated NVIDIA:hami.io/vgpu-devices-allocated]], annos is [map[cni.projectcalico.org/containerID:9a6f90a87396138de769ffbe49360d904261236356fccf19c79726a65888d16b cni.projectcalico.org/podIP:100.96.0.157/32 cni.projectcalico.org/podIPs:100.96.0.157/32 hami.io/bind-phase:success hami.io/bind-time:1737425866 hami.io/vgpu-devices-allocated:GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:; hami.io/vgpu-devices-to-allocate:; hami.io/vgpu-node:shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp hami.io/vgpu-time:1737425866 istio.io/rev:default kubectl.kubernetes.io/default-container:nanogpt kubectl.kubernetes.io/default-logs-container:nanogpt prometheus.io/path:/stats/prometheus prometheus.io/port:15020 prometheus.io/scrape:true sidecar.istio.io/status:{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null,"revision":"default"}]]
I0121 02:17:55.420978       1 util.go:278] Start to decode container device GPU-59f8a413-1c74-960f-18e3-d32f52626071,NVIDIA,3000,25:
I0121 02:17:55.420993       1 util.go:298] Finished decoding container devices. Total devices: 1
I0121 02:17:55.421023       1 util.go:325] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.421051       1 pods.go:78] "Pod devices updated" pod="kbf-i742968/nanogpt-0" namespace="kbf-i742968" name="nanogpt-0" devices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-59f8a413-1c74-960f-18e3-d32f52626071","Type":"NVIDIA","Usedmem":3000,"Usedcores":25}]]}
I0121 02:17:55.613563       1 scheduler.go:99] "Pod added" pod="language-detection-postgresql-0" namespace="language-detection"
I0121 02:17:55.819818       1 metrics.go:65] Starting to collect metrics for scheduler
I0121 02:17:55.820109       1 pods.go:153] "Retrieved scheduled pods" podCount=6
I0121 02:17:55.820145       1 metrics.go:192] "Collecting metrics" namespace="kbf-i744932" podName="pyspark-img-v12-0" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820184       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820228       1 metrics.go:192] "Collecting metrics" namespace="sir-service" podName="sir-core-service-66b8c9f8c8-zzzbn" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820259       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-c5d16691-bb12-0aad-bb52-9238619d1ad6" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820304       1 metrics.go:192] "Collecting metrics" namespace="dask" podName="worker-777959db8f-gfls9" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" usedCores=25 usedMem=8000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820338       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820376       1 metrics.go:192] "Collecting metrics" namespace="ism" podName="deploy-ism-embedding-8b49866d-hcfz4" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820406       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820448       1 metrics.go:192] "Collecting metrics" namespace="ism" podName="dev-deploy-ism-embedding-849978f877-dvzf8" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820473       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-a346704d-b53d-8ef3-eb65-113a9d671130" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820507       1 metrics.go:192] "Collecting metrics" namespace="kbf-i742968" podName="nanogpt-0" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" usedCores=25 usedMem=3000 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"
I0121 02:17:55.820534       1 metrics.go:225] "Total memory for device" deviceUUID="GPU-59f8a413-1c74-960f-18e3-d32f52626071" totalMemory=16384 nodeID="shoot--ait--datalake-nonprod-c32m384-v100g3-z1-977d9-ctcpp"

archlitchi · 2025-01-21T02:25:42Z

Thanks:) /lgtm

haitwang-cloud force-pushed the refine-metrics-logs branch 2 times, most recently from a717171 to dedf01e Compare January 16, 2025 07:27

haitwang-cloud had a problem deploying to nvidia January 16, 2025 07:33 — with GitHub Actions Failure

haitwang-cloud force-pushed the refine-metrics-logs branch from dedf01e to 9fc23a0 Compare January 20, 2025 02:15

haitwang-cloud had a problem deploying to nvidia January 20, 2025 02:20 — with GitHub Actions Failure

haitwang-cloud force-pushed the refine-metrics-logs branch from 9fc23a0 to 6fda1a3 Compare January 20, 2025 05:28

haitwang-cloud temporarily deployed to nvidia January 20, 2025 05:35 — with GitHub Actions Inactive

enhance error handling and logging in node management functions

5d0e7c5

Signed-off-by: haitwang-cloud <[email protected]>

haitwang-cloud force-pushed the refine-metrics-logs branch from 6fda1a3 to 5d0e7c5 Compare January 20, 2025 06:05

haitwang-cloud temporarily deployed to nvidia January 20, 2025 06:11 — with GitHub Actions Inactive

archlitchi merged commit 7fe6183 into Project-HAMi:master Jan 21, 2025
11 checks passed

haitwang-cloud deleted the refine-metrics-logs branch January 21, 2025 06:31

bobsongplus mentioned this pull request Jan 21, 2025

fix: hami-scheduler crash #823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refine metrics logs #817

Refine metrics logs #817

Uh oh!

haitwang-cloud commented Jan 16, 2025 •

edited

Loading

Uh oh!

haitwang-cloud commented Jan 21, 2025

Uh oh!

haitwang-cloud commented Jan 21, 2025

Uh oh!

archlitchi commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

Refine metrics logs #817

Refine metrics logs #817

Uh oh!

Conversation

haitwang-cloud commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Logging Improvements:

Code Refactoring:

Test Enhancements:

Uh oh!

haitwang-cloud commented Jan 21, 2025

Logging Improvements:

Code Refactoring:

Test Enhancements:

Uh oh!

haitwang-cloud commented Jan 21, 2025

Uh oh!

archlitchi commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

haitwang-cloud commented Jan 16, 2025 •

edited

Loading