Monitoring Kubernetes

GPU monitoring

GPU utilization, memory pressure, and per-process consumption rarely show up in the standard Kubernetes metrics surface, which makes troubleshooting an underperforming training job or an idle inference pod painful. This page walks you through deploying a small DaemonSet that runs nvidia-smi on each GPU node, formats the output as CSV, and lets Collectord forward it into Splunk where the Monitoring Kubernetes app builds the GPU dashboard.

Installing collection

Pre-requirements

If only some nodes in your cluster have GPUs attached, label those nodes so the DaemonSet only schedules where it can actually run:

bash
1kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet below targets that label via nodeSelector.

Nvidia-SMI DaemonSet

The DaemonSet runs two containers per GPU node — one for nvidia-smi pmon (per-process) and one for nvidia-smi dmon (device-level) — calling each every 30 seconds. Both tools emit human-readable, space-padded text by default; the Collectord annotations on the pod template strip the headers, collapse whitespace into commas, and drop empty - placeholders so the result is clean CSV that’s trivial to extract fields from in Splunk. The pmon-- and dmon-- annotation prefixes scope per-container overrides — separate sourcetypes plus a regex that drops the “nothing to report” line pmon emits when no processes are using a GPU.

For background on nvidia-smi, see NVIDIA System Management Interface.

Save the manifest below as nvidia-smi.yaml:

nvidia-smi.yaml yaml
 1apiVersion: apps/v1
 2kind: DaemonSet
 3metadata:
 4  name: collectorforkubernetes-nvidia-smi
 5  namespace: collectorforkubernetes
 6  labels:
 7    app: collectorforkubernetes-nvidia-smi
 8spec:
 9  updateStrategy:
10    type: RollingUpdate
11
12  selector:
13    matchLabels:
14      daemon: collectorforkubernetes-nvidia-smi
15
16  template:
17    metadata:
18      name: collectorforkubernetes-nvidia-smi
19      labels:
20        daemon: collectorforkubernetes-nvidia-smi
21      annotations:
22        collectord.io/logs-joinpartial: 'false'
23        collectord.io/logs-joinmultiline: 'false'
24        # remove headers
25        collectord.io/logs-replace.1-search: '^#.*$'
26        collectord.io/logs-replace.1-val: ''
27        # trim spaces from both sides
28        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
29        collectord.io/logs-replace.2-val: ''
30        # make a CSV from console presented line
31        collectord.io/logs-replace.3-search: '\s+'
32        collectord.io/logs-replace.3-val: ','
33        # empty values '-' replace with empty values
34        collectord.io/logs-replace.4-search: '-'
35        collectord.io/logs-replace.4-val: ''
36        # nothing to report from pmon - just ignore the line
37        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
38        collectord.io/pmon--logs-replace.0-val: ''
39        # set log source types
40        collectord.io/pmon--logs-type: kubernetes_gpu_nvidia_pmon
41        collectord.io/dmon--logs-type: kubernetes_gpu_nvidia_dmon
42    spec:
43      # Make sure to attach matching label to the GPU node
44      # $ kubectl label nodes <gpu-node-name> hardware-type=NVIDIAGPU
45      nodeSelector:
46        hardware-type: NVIDIAGPU  
47      hostPID: true
48      containers:
49      - name: pmon
50        image: nvidia/cuda:13.2.0-base-ubuntu22.04
51        args:
52          - "bash"
53          - "-c"
54          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
55      - name: dmon
56        image: nvidia/cuda:13.2.0-base-ubuntu22.04
57        args:
58          - "bash"
59          - "-c"
60          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply it:

bash
1kubectl apply -f nvidia-smi.yaml

On version 5.11 or later, the GPU dashboard starts populating within a minute or two.

NVIDIA (GPU)