Monitoring Nvidia GPU devices
Installing collection
Pre-requirements
If not all nodes in your cluster have GPU devices attached, label them similarly to:
1kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
The DaemonSet that we use below relies on this label.
Nvidia-SMI DaemonSet
We use the nvidia-smi tool to collect metrics from GPU devices. You can find documentation for this tool at
https://developer.nvidia.com/nvidia-system-management-interface.
We also use a set of annotations to convert the output from this tool into an easily parseable CSV format, which helps us
configure field extraction with Splunk.
Create a file named nvidia-smi.yaml and save it with the following content.
1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: collectorforkubernetes-nvidia-smi
5 namespace: collectorforkubernetes
6 labels:
7 app: collectorforkubernetes-nvidia-smi
8spec:
9 updateStrategy:
10 type: RollingUpdate
11
12 selector:
13 matchLabels:
14 daemon: collectorforkubernetes-nvidia-smi
15
16 template:
17 metadata:
18 name: collectorforkubernetes-nvidia-smi
19 labels:
20 daemon: collectorforkubernetes-nvidia-smi
21 annotations:
22 collectord.io/logs-joinpartial: 'false'
23 collectord.io/logs-joinmultiline: 'false'
24 # remove headers
25 collectord.io/logs-replace.1-search: '^#.*$'
26 collectord.io/logs-replace.1-val: ''
27 # trim spaces from both sides
28 collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
29 collectord.io/logs-replace.2-val: ''
30 # make a CSV from console presented line
31 collectord.io/logs-replace.3-search: '\s+'
32 collectord.io/logs-replace.3-val: ','
33 # empty values '-' replace with empty values
34 collectord.io/logs-replace.4-search: '-'
35 collectord.io/logs-replace.4-val: ''
36 # nothing to report from pmon - just ignore the line
37 collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
38 collectord.io/pmon--logs-replace.0-val: ''
39 # set log source types
40 collectord.io/pmon--logs-type: kubernetes_gpu_nvidia_pmon
41 collectord.io/dmon--logs-type: kubernetes_gpu_nvidia_dmon
42 spec:
43 # Make sure to attach matching label to the GPU node
44 # $ kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
45 nodeSelector:
46 hardware-type: NVIDIAGPU
47 hostPID: true
48 containers:
49 - name: pmon
50 image: nvidia/cuda:latest
51 args:
52 - "bash"
53 - "-c"
54 - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
55 - name: dmon
56 image: nvidia/cuda:latest
57 args:
58 - "bash"
59 - "-c"
60 - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"
Apply this DaemonSet to your cluster with:
1kubectl apply -f nvidia-smi.yaml
Starting with version 5.11, you should see the data in the dashboard.

Links
- Installation
- Start monitoring your Kubernetes environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30-day evaluation license.
- Collectord Configuration
- Collectord configuration reference.
- Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
- Audit Logs
- Configure audit logs.
- Forwarding audit logs.
- Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure the collectord to forward metrics from the services in Prometheus format.
- Configuring Splunk Indexes
- Using non-default HTTP Event Collector index.
- Configure the Splunk application to use indexes that are not searchable by default.
- Splunk fields extraction for container logs
- Configure search-time field extractions for container logs.
- Container logs source pattern.
- Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
- Monitoring multiple clusters
- Learn how to monitor multiple clusters.
- Learn how to set up ACL in Splunk.
- Streaming Kubernetes Objects from the API Server
- Learn how to stream all changes from the Kubernetes API Server.
- Stream changes and objects from Kubernetes API Server, including Pods, Deployments or ConfigMaps.
- License Server
- Learn how to configure a remote License URL for Collectord.
- Monitoring GPU
- Alerts
- Troubleshooting
- Release History
- Upgrade instructions
- Security
- FAQ and the common questions
- License agreement
- Pricing
- Contact