1
1
# HAMi-core —— Hook library for CUDA Environments
2
2
3
+ English | [ 中文] ( README_CN.md )
4
+
3
5
## Introduction
4
6
5
- HAMi-core is the in-container gpu resource controller, it has beed adopted by [ HAMi] ( https://github.com/HAMi-project /HAMi ) , [ volcano] ( https://github.com/volcano-sh/devices )
7
+ HAMi-core is the in-container gpu resource controller, it has beed adopted by [ HAMi] ( https://github.com/Project-HAMi /HAMi ) , [ volcano] ( https://github.com/volcano-sh/devices )
6
8
7
9
<img src =" ./docs/images/hami-arch.png " width = " 600 " />
8
10
9
11
## Features
10
12
11
13
HAMi-core has the following features:
12
14
1 . Virtualize device meory
13
-
14
- ![ image] ( docs/images/sample_nvidia-smi.png )
15
-
16
15
2 . Limit device utilization by self-implemented time shard
17
-
18
16
3 . Real-time device utilization monitor
19
17
18
+ ![ image] ( docs/images/sample_nvidia-smi.png )
19
+
20
20
## Design
21
21
22
22
HAMi-core operates by Hijacking the API-call between CUDA-Runtime(libcudart.so) and CUDA-Driver(libcuda.so), as the figure below:
23
23
24
24
<img src =" ./docs/images/hami-core-position.png " width = " 400 " />
25
25
26
- ## Build
27
-
28
- ``` bash
29
- make
30
- ```
31
-
32
26
## Build in Docker
33
27
34
28
``` bash
@@ -42,25 +36,55 @@ _CUDA_DEVICE_MEMORY_LIMIT_ indicates the upper limit of device memory (eg 1g,102
42
36
_ CUDA_DEVICE_SM_LIMIT_ indicates the sm utility percentage of each device
43
37
44
38
``` bash
45
- # Add 1GB bytes limit And set max sm utility to 50% for all devices
39
+ # Add 1GiB memory limit and set max SM utility to 50% for all devices
46
40
export LD_PRELOAD=./libvgpu.so
47
41
export CUDA_DEVICE_MEMORY_LIMIT=1g
48
42
export CUDA_DEVICE_SM_LIMIT=50
49
43
```
50
44
51
45
## Docker Images
46
+
52
47
``` bash
53
- # Make docker image
54
- docker build . -f=dockerfiles/Dockerfile- tf1.8-cu90
48
+ # Build docker image
49
+ docker build . -f=dockerfiles/Dockerfile -t cuda_vmem: tf1.8-cu90
55
50
56
- # Launch the docker image
51
+ # Configure GPU device and library mounts for container
57
52
export DEVICE_MOUNTS=" --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidiactl:/dev/nvidiactl"
58
53
export LIBRARY_MOUNTS=" -v /usr/cuda_files:/usr/cuda_files -v $( which nvidia-smi) :/bin/nvidia-smi"
59
54
55
+ # Run container and check nvidia-smi output
60
56
docker run ${LIBRARY_MOUNTS} ${DEVICE_MOUNTS} -it \
61
57
-e CUDA_DEVICE_MEMORY_LIMIT=2g \
58
+ -e LD_PRELOAD=/libvgpu/build/libvgpu.so \
62
59
cuda_vmem:tf1.8-cu90 \
63
- python -c " import tensorflow; tensorflow.Session()"
60
+ nvidia-smi
61
+ ```
62
+
63
+ After running, you will see nvidia-smi output similar to the following, showing memory limited to 2GiB:
64
+
65
+ ```
66
+ ...
67
+ [HAMI-core Msg(1:140235494377280:libvgpu.c:836)]: Initializing.....
68
+ Mon Dec 2 04:38:12 2024
69
+ +-----------------------------------------------------------------------------------------+
70
+ | NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
71
+ |-----------------------------------------+------------------------+----------------------+
72
+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
73
+ | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
74
+ | | | MIG M. |
75
+ |=========================================+========================+======================|
76
+ | 0 NVIDIA GeForce RTX 3060 Off | 00000000:03:00.0 Off | N/A |
77
+ | 30% 36C P8 7W / 170W | 0MiB / 2048MiB | 0% Default |
78
+ | | | N/A |
79
+ +-----------------------------------------+------------------------+----------------------+
80
+
81
+ +-----------------------------------------------------------------------------------------+
82
+ | Processes: |
83
+ | GPU GI CI PID Type Process name GPU Memory |
84
+ | ID ID Usage |
85
+ |=========================================================================================|
86
+ +-----------------------------------------------------------------------------------------+
87
+ [HAMI-core Msg(1:140235494377280:multiprocess_memory_limit.c:497)]: Calling exit handler 1
64
88
```
65
89
66
90
## Log
@@ -74,7 +98,6 @@ Use environment variable LIBCUDA_LOG_LEVEL to set the visibility of logs
74
98
| 3 | infos,errors,warnings,messages |
75
99
| 4 | debugs,errors,warnings,messages |
76
100
77
-
78
101
## Test Raw APIs
79
102
80
103
``` bash
0 commit comments