Monitoring GPU

Monitoring Nvidia GPU devices

Installing collection

Pre-requirements

If in your cluster not all nodes have GPU devices attached, label them similarly to

oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet that we use below rely on this label.

Nvidia-SMI DaemonSet

We use the nvidia-smi tool to collect metrics from the GPU devices. You can find documentation for this tool at https://developer.nvidia.com/nvidia-system-management-interface. We also use a set of annotations to convert the output from this tool into an easily parsable CSV format, which helps us configure field extraction with Splunk.

Create a file nvidia-smi.yaml and save it with the following content:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: collectorforopenshift-nvidia-smi
  namespace: collectorforopenshift
  labels:
    app: collectorforopenshift-nvidia-smi
spec:
  updateStrategy:
    type: RollingUpdate

  selector:
    matchLabels:
      daemon: collectorforopenshift-nvidia-smi

  template:
    metadata:
      name: collectorforopenshift-nvidia-smi
      labels:
        daemon: collectorforopenshift-nvidia-smi
      annotations:
        collectord.io/logs-joinpartial: 'false'
        collectord.io/logs-joinmultiline: 'false'
        # remove headers
        collectord.io/logs-replace.1-search: '^#.*$'
        collectord.io/logs-replace.1-val: ''
        # trim spaces from both sides
        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
        collectord.io/logs-replace.2-val: ''
        # make a CSV from console presented line
        collectord.io/logs-replace.3-search: '\s+'
        collectord.io/logs-replace.3-val: ','
        # empty values '-' replace with empty values
        collectord.io/logs-replace.4-search: '-'
        collectord.io/logs-replace.4-val: ''
        # nothing to report from pmon - just ignore the line
        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
        collectord.io/pmon--logs-replace.0-val: ''
        # set log source types
        collectord.io/pmon--logs-type: openshift_gpu_nvidia_pmon
        collectord.io/dmon--logs-type: openshift_gpu_nvidia_dmon
    spec:
      # Make sure to attach matching label to the GPU node
      # $ oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU
      # nodeSelector:
      #   hardware-type: NVIDIAGPU  
      hostPID: true
      containers:
      - name: pmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
      - name: dmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply this DaemonSet to your cluster with

oc apply -f nvidia-smi.yaml

With version 5.11, you should see the data in the dashboard.

Links

Installation
- Start monitoring your OpenShift environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30-day evaluation license.
Collectord Configuration
- Collectord configuration reference.
Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
Audit Logs
- Configure audit logs.
- Forwarding audit logs.
Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure the collectord to forward metrics from the services in Prometheus format.
Configuring Splunk Indexes
- Using non-default HTTP Event Collector index.
- Configure the Splunk application to use indexes that are not searchable by default.
Splunk fields extraction for container logs
- Configure search-time field extractions for container logs.
- Container logs source pattern.
Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
Monitoring multiple clusters
- Learn how to monitor multiple clusters.
- Learn how to set up ACL in Splunk.
Streaming OpenShift Objects from the API Server
- Learn how to stream all changes from the OpenShift API Server.
- Stream changes and objects from OpenShift API Server, including Pods, Deployments or ConfigMaps.
License Server
- Learn how to configure a remote License URL for Collectord.
Monitoring GPU
Alerts
Troubleshooting
Release History
Upgrade instructions
Security
FAQ and the common questions
License agreement
Pricing
Contact

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all container environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and easy-to-deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and help operators keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.

Monitoring OpenShift