腾讯gpu-manager原创

# 关于叫法

gpu-manager？？ vCUDA？？ GaiaGPU？？

# 基本原理

vCUDA通过劫持CUDA的显存申请和释放请求，为每个容器管理它的显存使用量，进而实现了显存隔离。唯一需要注意的是申请context并不通过malloc函数，因此无法知道进程在context使用了多少显存。因此vcuda每次都去向GPU查询当前的显存使用量。在算力隔离方面，使用者可以指定容器的GPU利用率。vCUDA将会监控利用率，并在超出限制利用率时做一些处理。此处可以支持硬隔离和软隔离。两者的不同点是，如果有资源空闲，软隔离允许任务超过设置，而硬隔离不允许。

由于使用的是监控调节的方案，因此无法在短时间内限制算力，只能保证长时间的效率公平。所以不适合推理等任务时间极短的场景。

显存隔离是属于硬隔离，容器实际使用量不能超出限制值；算力隔离属于软隔离，其实际使用量会在限制值上下波动，但是平均值基本满足限制条件。

# 缺陷

不适合推理等任务时间极短的场景由于该方案是依赖cuda库函数，对少部分cuda版本支持不足

似乎不怎么维护了，issue较多没什么回应

# 优点

不需要修改默认runc运行时同时支持碎片和整卡调度，提高GPU资源利用率支持同一张卡上容器间GPU和显存的使用隔离

# 参数

tencent.com/vcuda-core 和tencent.com/vcuda-memory 是新增的针对单卡共享的一个资源标记，core对应的是使用率，单张卡有100个core，memory是显存，每个单位是256MB的显存。
如果申请的资源为50%利用率，7680MB显存。tencent.com/vcuda-core 填写50，tencent.com/vcuda-memory 填写成30。
当然我们也同样支持原来的独占卡的方式，只需要在core的地方填写100的整数倍，memory值填写大于0的任意值即可。

# 部署

kubectl label node master01 nvidia-device-enable=enable

kubectl apply -f gpu-manager-svc.yaml gpu-manager.yaml

# kubectl describe node master01 | grep tencent
  tencent.com/vcuda-core:    100
  tencent.com/vcuda-memory:  32
  tencent.com/vcuda-core:    100
  tencent.com/vcuda-memory:  32
  tencent.com/vcuda-core    0            0
  tencent.com/vcuda-memory  0            0

1
2
3
4
5
6
7

# 验证

apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
  annotations:
    tencent.com/vcuda-core-limit: "50"
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda
      image: nvidia/cuda:10.0-base
      imagePullPolicy: IfNotPresent
      tty: true
      resources:
        requests:
          tencent.com/vcuda-core: 30
          tencent.com/vcuda-memory: 10
        limits:
          tencent.com/vcuda-core: 30
          tencent.com/vcuda-memory: 10

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# 问题： Unable to set Type=notify in systemd service file(TODO) ; Function Not Found (TODO);

rebuild ldcache
65
launch gpu manager
66
E0729 10:17:58.085641   14167 server.go:131] Unable to set Type=notify in systemd service file?
67
E0729 10:17:59.109882   14167 tree.go:333] No topology level found at 0

1
2
3
4
5
6
7

https://github.com/tkestack/gpu-manager/issues/7

root@test-gpu:/# nvidia-smi 
Fri Jul 29 04:40:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   38C    P8    13W / 200W | Function Not Found   |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

怀疑跟cuda版本有关

#gpu

← 第四范式vgpu 阿里gpu-share→