If you are using NVIDIA GPU devices for your workloads, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding, you want to be able to monitor how efficiently you are using these devices.
We provide a solution, based on the
nvidia-smi tool, that will allow you to monitor GPU attached devices to
your Kubernetes and OpenShift nodes, to review CPU/Memory utilization, Power consumption and more.
Currently it is in the beta mode, and you will need to add the required dashboards to configurations manually. With the next versions we will include these dashboards as part of our application.
Please review documentations on installation
We are using
nvidia-smi tool to collect the data, which allows us to install the collection part on any Kubernetes or
OpenShift version. The official NVIDIA monitoring tool relies on Kubernetes 1.13+,
which is a sugnificant limitation, considering that you can't run it on the most popular OpenShift version 3.11 (which is based on Kubernetes 1.11). If you prefer to use
NVIDIA/gpu-monitoring-tools you can easily use our Prometheus annotations
to collect these metrics and forward them to Splunk.