Monitoring Nvidia GPU devices
Installing collection
Pre-requirements
If in your cluster not all nodes have GPU devices attached, label them similarly to
kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
The DaemonSet that we use below rely on this label.
Nvidia-SMI DaemonSet
We use the nvidia-smi
tool to collect metrics from the GPU devices. You can find documentation for this tool at
https://developer.nvidia.com/nvidia-system-management-interface.
We also use a set of annotations to convert the output from this tool into an easily parsable CSV format, which helps us
configure field extraction with Splunk.
Create a file named nvidia-smi.yaml
and save it with the following content.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: collectorforkubernetes-nvidia-smi
namespace: collectorforkubernetes
labels:
app: collectorforkubernetes-nvidia-smi
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
daemon: collectorforkubernetes-nvidia-smi
template:
metadata:
name: collectorforkubernetes-nvidia-smi
labels:
daemon: collectorforkubernetes-nvidia-smi
annotations:
collectord.io/logs-joinpartial: 'false'
collectord.io/logs-joinmultiline: 'false'
# remove headers
collectord.io/logs-replace.1-search: '^#.*$'
collectord.io/logs-replace.1-val: ''
# trim spaces from both sides
collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
collectord.io/logs-replace.2-val: ''
# make a CSV from console presented line
collectord.io/logs-replace.3-search: '\s+'
collectord.io/logs-replace.3-val: ','
# empty values '-' replace with empty values
collectord.io/logs-replace.4-search: '-'
collectord.io/logs-replace.4-val: ''
# nothing to report from pmon - just ignore the line
collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
collectord.io/pmon--logs-replace.0-val: ''
# set log source types
collectord.io/pmon--logs-type: kubernetes_gpu_nvidia_pmon
collectord.io/dmon--logs-type: kubernetes_gpu_nvidia_dmon
spec:
# Make sure to attach matching label to the GPU node
# $ kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
nodeSelector:
hardware-type: NVIDIAGPU
hostPID: true
containers:
- name: pmon
image: nvidia/cuda:latest
args:
- "bash"
- "-c"
- "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
- name: dmon
image: nvidia/cuda:latest
args:
- "bash"
- "-c"
- "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"
Apply this DaemonSet to your cluster with
kubectl apply -f nvidia-smi.yaml
Starting with version 5.11, you should see the data in the dashboard.

Links
- Installation
- Start monitoring your Kubernetes environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30 days evaluation license.
- Collectord Configuration
- Collectord configuration reference.
- Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
- Audit Logs
- Configure audit logs.
- Forwarding audit logs.
- Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure the collectord to forward metrics from the services in Prometheus format.
- Configuring Splunk Indexes
- Using not default HTTP Event Collector index.
- Configure the Splunk application to use not searchable by default indexes.
- Splunk fields extraction for container logs
- Configure search-time field extractions for container logs.
- Container logs source pattern.
- Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
- Monitoring multiple clusters
- Learn how to monitor multiple clusters.
- Learn how to set up ACL in Splunk.
- Streaming Kubernetes Objects from the API Server
- Learn how to stream all changes from the Kubernetes API Server.
- Stream changes and objects from Kubernetes API Server, including Pods, Deployments or ConfigMaps.
- License Server
- Learn how to configure a remote License URL for Collectord.
- Monitoring GPU
- Alerts
- Troubleshooting
- Release History
- Upgrade instructions
- Security
- FAQ and the common questions
- License agreement
- Pricing
- Contact