If you are using NVIDIA GPU devices for your workloads, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding, you want to be able to monitor how efficiently you are using these devices.

We provide a solution, based on the nvidia-smi tool, that will allow you to monitor GPU devices attached to your Kubernetes and OpenShift nodes, to review GPU/Memory utilization, Power consumption and more. Currently it is in beta mode, and you will need to add the required dashboards to the configurations manually. In future versions we will include these dashboards as part of our application.

Please review the documentation on installation

We are using the nvidia-smi tool to collect the data, which allows us to install the collection part on any Kubernetes or OpenShift version. The official NVIDIA monitoring tool relies on Kubernetes 1.13+, which is a significant limitation, considering that you can’t run it on the most popular OpenShift version 3.11 (which is based on Kubernetes 1.11). If you prefer to use NVIDIA/gpu-monitoring-tools you can easily use our Prometheus annotations to collect these metrics and forward them to Splunk.

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all container environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and easy-to-deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and help operators keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.