Monitoring GPU

Monitoring Nvidia GPU devices

Installing collection

Pre-requirements

If in your cluster not all nodes have GPU devices attached, label them similarly to

1oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet that we use below rely on this label.

Nvidia-SMI DaemonSet

We use the nvidia-smi tool to collect metrics from the GPU devices. You can find documentation for this tool at https://developer.nvidia.com/nvidia-system-management-interface. We also use a set of annotations to convert the output from this tool into an easily parsable CSV format, which helps us configure field extraction with Splunk.

Create a file nvidia-smi.yaml and save it with the following content:

 1apiVersion: apps/v1
 2kind: DaemonSet
 3metadata:
 4  name: collectorforopenshift-nvidia-smi
 5  namespace: collectorforopenshift
 6  labels:
 7    app: collectorforopenshift-nvidia-smi
 8spec:
 9  updateStrategy:
10    type: RollingUpdate
11
12  selector:
13    matchLabels:
14      daemon: collectorforopenshift-nvidia-smi
15
16  template:
17    metadata:
18      name: collectorforopenshift-nvidia-smi
19      labels:
20        daemon: collectorforopenshift-nvidia-smi
21      annotations:
22        collectord.io/logs-joinpartial: 'false'
23        collectord.io/logs-joinmultiline: 'false'
24        # remove headers
25        collectord.io/logs-replace.1-search: '^#.*$'
26        collectord.io/logs-replace.1-val: ''
27        # trim spaces from both sides
28        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
29        collectord.io/logs-replace.2-val: ''
30        # make a CSV from console presented line
31        collectord.io/logs-replace.3-search: '\s+'
32        collectord.io/logs-replace.3-val: ','
33        # empty values '-' replace with empty values
34        collectord.io/logs-replace.4-search: '-'
35        collectord.io/logs-replace.4-val: ''
36        # nothing to report from pmon - just ignore the line
37        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
38        collectord.io/pmon--logs-replace.0-val: ''
39        # set log source types
40        collectord.io/pmon--logs-type: openshift_gpu_nvidia_pmon
41        collectord.io/dmon--logs-type: openshift_gpu_nvidia_dmon
42    spec:
43      # Make sure to attach matching label to the GPU node
44      # $ oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU
45      # nodeSelector:
46      #   hardware-type: NVIDIAGPU  
47      hostPID: true
48      containers:
49      - name: pmon
50        image: nvidia/cuda:latest
51        args:
52          - "bash"
53          - "-c"
54          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
55      - name: dmon
56        image: nvidia/cuda:latest
57        args:
58          - "bash"
59          - "-c"
60          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply this DaemonSet to your cluster with

1oc apply -f nvidia-smi.yaml

With version 5.11, you should see the data in the dashboard.

NVIDIA (GPU)

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all container environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and easy-to-deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and help operators keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.

Red Hat
Splunk
AWS