Skip to content

Conversation

AllenXu93
Copy link
Contributor

Fix multiprocess_utilization_watcher;
In pre version, summonitor is sum of all process's memory in one container, that will cause the all process's memory except the first one value wrong.

Copy link
Contributor

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pre version, summonitor is sum of all process's memory in one container, that will cause the all process's memory except the first one value wrong.

Could you remind me how I can reproduce this? To facilitate local testing of this PR.

Or can you provide some comparisons before and after this PR is fixed?

@AllenXu93
Copy link
Contributor Author

In pre version, summonitor is sum of all process's memory in one container, that will cause the all process's memory except the first one value wrong.

Could you remind me how I can reproduce this? To facilitate local testing of this PR.

Or can you provide some comparisons before and after this PR is fixed?

There are some problems when run mutiple process in one container.
For example , I run one process that allocate 100Mi memory(with context about 324Mi totally) , in debug log, we can see monitor record memory:

[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->339738624

Then run another process in same container, the log is

[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141488 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->679477248
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141488 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->679477248
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141488 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->679477248
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141488 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->679477248
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141488 0 339738624->339738624
[HAMI-core Info(2204:140316929261568:multiprocess_memory_limit.c:311)]: set_gpu_device_memory_monitor:3141423 0 339738624->679477248

The problem is that the second process in nvml result , will record both 2 process's memory as it's allocated memory, because of it use summonitor which is the sum memory of all process in this container:

sum += processes_sample[i].smUtil;
summonitor += infos[i].usedGpuMemory;
//LOG_WARN("monitorused=%lld %d %d %d",infos[i].usedGpuMemory,proc->hostpid,proc->pid,pidfound);
//LOG_WARN("smutil=%d %d %lu %u %u %u\n",virtual_map[devi],devi,summonitor,processes_sample[i].smUtil,processes_sample[i].encUtil,processes_sample[i].decUtil);
//}
}
set_gpu_device_memory_monitor(processes_sample[i].pid,cudadev,summonitor);
set_gpu_device_sm_utilization(processes_sample[i].pid,cudadev,processes_sample[i].smUtil);

Secondly, if nvmlDeviceGetProcessUtilization and nvmlDeviceGetComputeRunningProcesses return process not in the same order, the monitor will be wrong, because is use same index to get the memory and smUtil:
```

@archlitchi archlitchi merged commit f40e94f into Project-HAMi:main Jan 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants