Outcold Solutions LLC

Monitoring Kubernetes and OpenShift - Monitoring GPU (beta)

August 6, 2019

If you are using NVIDIA GPU devices for your workloads, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding, you want to be able to monitor how efficiently you are using these devices.

We provide a solution, based on the nvidia-smi tool, that will allow you to monitor GPU attached devices to your Kubernetes and OpenShift nodes, to review CPU/Memory utilization, Power consumption and more. Currently it is in the beta mode, and you will need to add the required dashboards to configurations manually. With the next versions we will include these dashboards as part of our application.

Please review documentations on installation

NVIDIA (GPU)

We are using nvidia-smi tool to collect the data, which allows us to install the collection part on any Kubernetes or OpenShift version. The official NVIDIA monitoring tool relies on Kubernetes 1.13+, which is a sugnificant limitation, considering that you can't run it on the most popular OpenShift version 3.11 (which is based on Kubernetes 1.11). If you prefer to use NVIDIA/gpu-monitoring-tools you can easily use our Prometheus annotations to collect these metrics and forward them to Splunk.

kubernetes, openshift, splunk, nvidia, gpu

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.