Monitoring Nvidia GPU devices

Installing collection

Pre-requirements

If not all nodes in your cluster have GPU devices attached, label them similarly to:

1kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet that we use below relies on this label.

Nvidia-SMI DaemonSet

We use the nvidia-smi tool to collect metrics from GPU devices. You can find documentation for this tool at https://developer.nvidia.com/nvidia-system-management-interface. We also use a set of annotations to convert the output from this tool into an easily parseable CSV format, which helps us configure field extraction with Splunk.

Create a file named nvidia-smi.yaml and save it with the following content.

 1apiVersion: apps/v1
 2kind: DaemonSet
 3metadata:
 4  name: collectorforkubernetes-nvidia-smi
 5  namespace: collectorforkubernetes
 6  labels:
 7    app: collectorforkubernetes-nvidia-smi
 8spec:
 9  updateStrategy:
10    type: RollingUpdate
11
12  selector:
13    matchLabels:
14      daemon: collectorforkubernetes-nvidia-smi
15
16  template:
17    metadata:
18      name: collectorforkubernetes-nvidia-smi
19      labels:
20        daemon: collectorforkubernetes-nvidia-smi
21      annotations:
22        collectord.io/logs-joinpartial: 'false'
23        collectord.io/logs-joinmultiline: 'false'
24        # remove headers
25        collectord.io/logs-replace.1-search: '^#.*$'
26        collectord.io/logs-replace.1-val: ''
27        # trim spaces from both sides
28        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
29        collectord.io/logs-replace.2-val: ''
30        # make a CSV from console presented line
31        collectord.io/logs-replace.3-search: '\s+'
32        collectord.io/logs-replace.3-val: ','
33        # empty values '-' replace with empty values
34        collectord.io/logs-replace.4-search: '-'
35        collectord.io/logs-replace.4-val: ''
36        # nothing to report from pmon - just ignore the line
37        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
38        collectord.io/pmon--logs-replace.0-val: ''
39        # set log source types
40        collectord.io/pmon--logs-type: kubernetes_gpu_nvidia_pmon
41        collectord.io/dmon--logs-type: kubernetes_gpu_nvidia_dmon
42    spec:
43      # Make sure to attach matching label to the GPU node
44      # $ kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
45      nodeSelector:
46        hardware-type: NVIDIAGPU  
47      hostPID: true
48      containers:
49      - name: pmon
50        image: nvidia/cuda:latest
51        args:
52          - "bash"
53          - "-c"
54          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
55      - name: dmon
56        image: nvidia/cuda:latest
57        args:
58          - "bash"
59          - "-c"
60          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply this DaemonSet to your cluster with:

1kubectl apply -f nvidia-smi.yaml

Starting with version 5.11, you should see the data in the dashboard.

NVIDIA (GPU)

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all container environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and easy-to-deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and help operators keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.

Red Hat
Splunk
AWS