Outcold Solutions LLC

Monitoring OpenShift - Version 5

Monitoring GPU (beta)

Monitoring Nvidia GPU devices

Installing collection

Pre-requirements

If in your cluster not all nodes have GPU devices attached, label them similarly to

oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet that we use below rely on this label.

Nvidia-SMI DaemonSet

We use nvidia-smi tool to collect metrics from the GPU devices. You can find documentation of this tool at https://developer.nvidia.com/nvidia-system-management-interface. We also use set of annotations to conver the output from this tool into easy parsable CSV format, which helps ut to configure fields extraction with Splunk.

Create a file nvidia-smi.yaml and save it with the following content.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: collectorforopenshift-nvidia-smi
  namespace: collectorforopenshift
spec:
  template:
    metadata:
      name: collectorforopenshift-nvidia-smi
      labels:
        app: collectorforopenshift-nvidia-smi
      annotations:
        collectord.io/logs-joinpartial: 'false'
        collectord.io/logs-joinmultiline: 'false'
        # remove headers
        collectord.io/logs-replace.1-search: '^#.*$'
        collectord.io/logs-replace.1-val: ''
        # trim spaces from both sides
        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
        collectord.io/logs-replace.2-val: ''
        # make a CSV from console presented line
        collectord.io/logs-replace.3-search: '\s+'
        collectord.io/logs-replace.3-val: ','
        # empty values '-' replace with empty values
        collectord.io/logs-replace.4-search: '-'
        collectord.io/logs-replace.4-val: ''
        # nothing to report from pmon - just ignore the line
        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
        collectord.io/pmon--logs-replace.0-val: ''
        # set log source types
        collectord.io/pmon--logs-type: openshift_gpu_nvidia_pmon
        collectord.io/dmon--logs-type: openshift_gpu_nvidia_dmon
    spec:
      # Make sure to attach matching label to the GPU node
      # $ oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU
      # nodeSelector:
      #   hardware-type: NVIDIAGPU  
      hostPID: true
      containers:
      - name: pmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
      - name: dmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply this DaemonSet to your cluster with

oc apply -f nvidia-smi.yaml

Splunk configuration

Configuration

In Splunk add following in the props.conf. Feel free to add it to the $SPLUNK_HOME/etc/apps/monitoringopenshift/local/props.conf as we will ship it within the application in the next update

# nvidia gpu
[openshift_gpu_nvidia_pmon]
SEGMENTATION = none
REPORT-extract_openshift_gpu_nvidia_pmon = openshift_gpu_nvidia_pmon_fields

[openshift_gpu_nvidia_dmon]
SEGMENTATION = none
REPORT-extract_openshift_gpu_nvidia_dmon = openshift_gpu_nvidia_dmon_fields

In Splunk add following in transforms.conf. Feel free to add it to the $SPLUNK_HOME/etc/apps/monitoringopenshift/local/transforms.conf as we will ship it within the application in the next update

[openshift_gpu_nvidia_dmon_fields]
DELIMS = ","
FIELDS = gpu_idx, pwr_w, gtemp_c, mtemp_c, sm_percent, mem_percent, enc_percent, dec_percent, mclk_mhz, pclk_mhz, pviol_percent, tviol_bool, fb_mb, bar1_mb, sbecc_errs, dbecc_errs, pci_errs, rxpci_mbps, txpci_mbps

[openshift_gpu_nvidia_pmon_fields]
DELIMS = ","
FIELDS = gpu_idx, pid, type_c_g, sm_percent, mem_percent, enc_percent, dec_percent, fb_mb, command_name

Dashboard

Create a dashboard with the name nvidia_gpu.xml and add the following content, save it.

<form hideEdit="true">
  <label>Nvidia (GPU)</label>
  <init>
    <set token="gpu_idx_search"></set>
    <set token="host_search"></set>
  </init>
  <fieldset submitButton="false">
    <input type="time" token="period">
      <label>Period</label>
      <default>
        <earliest>-60m@m</earliest>
        <latest>now</latest>
      </default>
    </input>
    <input type="dropdown" token="refresh">
      <label>Auto-Refresh</label>
      <choice value="0">Off</choice>
      <choice value="30">30s</choice>
      <choice value="60">1m</choice>
      <choice value="300">5m</choice>
      <choice value="600">10m</choice>
      <choice value="1800">30m</choice>
      <default>0</default>
    </input>
    <input type="multiselect" token="openshift_cluster_eval">
      <label>Cluster</label>
      <choice value="*">All</choice>
      <fieldForLabel>openshift_cluster_eval</fieldForLabel>
      <fieldForValue>openshift_cluster_eval</fieldForValue>
      <default>*</default>
      <search>
        <query>`macro_openshift_logs` sourcetype=openshift_gpu_nvidia_pmon* | stats count by openshift_cluster_eval | sort openshift_cluster_eval</query>
        <earliest>$period.earliest$</earliest>
        <latest>$period.latest$</latest>
        <sampleRatio>1</sampleRatio>
        <refresh>$form.refresh$</refresh>
      </search>
      <allowCustomValues>true</allowCustomValues>
      <delimiter> OR </delimiter>
      <valuePrefix>openshift_cluster_eval="</valuePrefix>
      <valueSuffix>"</valueSuffix>
      <change>
        <condition label="All">
          <set token="openshift_cluster_eval"></set>
        </condition>
      </change>
    </input>
    <input type="multiselect" token="openshift_node_label">
      <label>Node Label</label>
      <choice value="*">All</choice>
      <fieldForLabel>openshift_node_labels</fieldForLabel>
      <fieldForValue>openshift_node_labels</fieldForValue>
      <default>*</default>
      <search>
        <query>`macro_openshift_logs` sourcetype=openshift_gpu_nvidia_pmon* | stats values(openshift_node_labels) as openshift_node_labels | mvexpand openshift_node_labels | sort openshift_node_labels</query>
        <earliest>$period.earliest$</earliest>
        <latest>$period.latest$</latest>
        <sampleRatio>1</sampleRatio>
        <refresh>$form.refresh$</refresh>
      </search>
      <allowCustomValues>true</allowCustomValues>
      <delimiter> OR </delimiter>
      <valuePrefix>openshift_node_labels="</valuePrefix>
      <valueSuffix>"</valueSuffix>
      <change>
        <condition label="All">
          <set token="openshift_node_label"></set>
        </condition>
      </change>
    </input>
  </fieldset>
  <row>
    <panel>
      <html>
        Theme: <a href="?theme=light">Light</a> | <a href="?theme=dark">Dark</a>
      </html>
    </panel>
  </row>
  <row>
    <panel>
      <title>GPU Monitoring</title>
      <table>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_dmon" $openshift_cluster_eval$ $openshift_node_label$ |
stats
  latest(pwr_w) as "Power Usage"
  latest(gtemp_c) as "Temperature",
  latest(sm_percent) as "SM",
  latest(mem_percent) as "Memory",
  latest(enc_percent) as "Encoder",
  latest(dec_percent) as "Decoder",
  latest(mclk_mhz) as "Mem Clocks",
  latest(pclk_mhz) as "Proc Clocks",
  latest(pviol_percent) as "Power Violations",
  latest(tviol_bool) as "Thermal Violations",
  latest(fb_mb) as "Frame Buffer",
  latest(bar1_mb) as "Bar1",
  latest(sbecc_errs) as "ECC errors (single bit)",
  latest(dbecc_errs) as "ECC errors (double bit)",
  latest(pci_errs) as "PCIe Replay errors",
  latest(rxpci_mbps) as "PCIe Rx Throughput",
  latest(txpci_mbps) as "PCIe Tx Throughput",
  latest(eval(floor((now()-_time)/60))) as "Last Seen"
by openshift_cluster_eval, host, gpu_idx |
eval "GPU Idx"=gpu_idx |
rename openshift_cluster_eval as "Cluster" |
sort "Last Seen" asc</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <fields>["Last Seen","Cluster","host","GPU Idx","Power Usage","Temperature","SM","Memory","Encoder","Decoder", "Mem Clocks", "Proc Clocks", "Power Violations", "Thermal Violations", "Frame Buffer", "Bar1", "ECC errors (single bit)", "ECC errors (double bit)", "PCIe Replay errors", "PCIe Rx Throughput", "PCIe Tx Throughput"]</fields>
        <option name="count">50</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">row</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
        <format type="number" field="Power Usage">
          <option name="precision">0</option>
          <option name="unit">Watts</option>
        </format>
        <format type="number" field="Temperature">
          <option name="precision">0</option>
          <option name="unit">C</option>
        </format>
        <format type="number" field="SM">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Memory">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Encoder">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Decoder">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Mem Clocks">
          <option name="precision">0</option>
          <option name="unit">MHz</option>
        </format>
        <format type="number" field="Proc Clocks">
          <option name="precision">0</option>
          <option name="unit">MHz</option>
        </format>
        <format type="number" field="Power Violations">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Thermal Violations">
          <option name="precision">0</option>
        </format>
        <format type="number" field="Frame Buffer">
          <option name="precision">0</option>
          <option name="unit">MB</option>
        </format>
        <format type="number" field="Bar1">
          <option name="precision">0</option>
          <option name="unit">MB</option>
        </format>
        <format type="number" field="PCIe Rx Throughput">
          <option name="precision">0</option>
          <option name="unit">MB/s</option>
        </format>
        <format type="number" field="PCIe Tx Throughput">
          <option name="precision">0</option>
          <option name="unit">MB/s</option>
        </format>
        <format type="color" field="SM">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Memory">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Encoder">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Decoder">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Temperature">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,105</scale>
        </format>
        <format type="color" field="Last Seen">
          <colorPalette type="minMidMax" maxColor="#006D9C" minColor="#FFFFFF"></colorPalette>
          <scale type="minMidMax" maxValue="60" midValue="5" minValue="0"></scale>
        </format>
        <format type="number" field="Last Seen">
          <option name="precision">0</option>
          <option name="unit">min ago</option>
          <option name="useThousandSeparators">false</option>
        </format>
        <drilldown>
          <set token="gpu_idx">$row.gpu_idx$</set>
          <set token="gpu_idx_search">gpu_idx=$row.gpu_idx$</set>
          <set token="host">$row.host$</set>
          <set token="host_search">host=$row.host$</set>
        </drilldown>
      </table>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>SM (%)</title>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_dmon" |
eval gpu_idx=(openshift_cluster_eval + " - " + host + " - " + gpu_idx) |
timechart avg(sm_percent) by gpu_idx limit=200</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">collapsed</option>
        <option name="charting.axisTitleY.visibility">collapsed</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.abbreviation">none</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.abbreviation">none</option>
        <option name="charting.axisY.maximumNumber">100</option>
        <option name="charting.axisY.minimumNumber">0</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.abbreviation">none</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">line</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.mode">standard</option>
        <option name="charting.legend.placement">right</option>
        <option name="charting.lineWidth">2</option>
        <option name="refresh.display">progressbar</option>
        <option name="trellis.enabled">0</option>
        <option name="trellis.scales.shared">1</option>
        <option name="trellis.size">medium</option>
      </chart>
    </panel>
    <panel>
      <chart>
        <title>Memory (%)</title>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_dmon"  |
eval gpu_idx=(openshift_cluster_eval + " - " + host + " - " + gpu_idx) |
timechart avg(mem_percent) by gpu_idx limit=200</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">collapsed</option>
        <option name="charting.axisTitleY.visibility">collapsed</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.abbreviation">none</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.abbreviation">none</option>
        <option name="charting.axisY.maximumNumber">100</option>
        <option name="charting.axisY.minimumNumber">0</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.abbreviation">none</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">line</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.mode">standard</option>
        <option name="charting.legend.placement">right</option>
        <option name="charting.lineWidth">2</option>
        <option name="refresh.display">progressbar</option>
        <option name="trellis.enabled">0</option>
        <option name="trellis.scales.shared">1</option>
        <option name="trellis.size">medium</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <title>Process Monitoring</title>
      <table>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_pmon" $openshift_cluster_eval$ $openshift_node_label$  |
stats
  latest(type_c_g) as "Type (C/G)",
  latest(sm_percent) as "SM",
  latest(mem_percent) as "Memory",
  latest(enc_percent) as "Encoder",
  latest(dec_percent) as "Decoder",
  latest(command_name) as "Command Name",
  latest(eval(floor((now()-_time)/60))) as "Last Seen",
  latest(_time) as _time
by openshift_cluster_eval, host, gpu_idx, pid |
rename gpu_idx as "GPU Idx" |
rename openshift_cluster_eval as "Cluster" |
sort "Last Seen" asc |
join type=left host pid [
  search `macro_openshift_proc_stats` $openshift_cluster_eval$ $openshift_node_label$ openshift_pod_name::*
  [search `macro_openshift_logs` sourcetype="openshift_gpu_nvidia_pmon" $openshift_cluster_eval$ $openshift_node_label$ | stats count by host | fields host] |
  stats
    latest(eval(_time - proc__p__uptime_seconds)) as starttime,
    latest(eval(_time + 60)) as endtime,
    latest(openshift_pod_name) as Pod,
    latest(openshift_namespace) as Namespace,
    latest(openshift_container_name) as Container,
    latest(proc__p__stat__pid) as pid,
  by openshift_proc_stats_unique_id, host
] |
where _time&gt;starttime and _time&lt;endtime |
table "Last Seen", "Cluster", host, "GPU Idx", Namespace, Pod, Container, pid, "Command Name", "Type (C/G)", "SM", "Memory", "Encoder",  "Decoder"</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <option name="count">50</option>
        <option name="dataOverlayMode">none</option>
        <option name="drilldown">none</option>
        <option name="percentagesRow">false</option>
        <option name="refresh.display">progressbar</option>
        <option name="rowNumbers">false</option>
        <option name="totalsRow">false</option>
        <option name="wrap">true</option>
        <format type="number" field="SM">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Memory">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Encoder">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="number" field="Decoder">
          <option name="precision">0</option>
          <option name="unit">%</option>
        </format>
        <format type="color" field="SM">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Memory">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Encoder">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Decoder">
          <colorPalette type="list">[#53A051,#F8BE34,#F1813F,#DC4E41]</colorPalette>
          <scale type="threshold">80,90,100</scale>
        </format>
        <format type="color" field="Last Seen">
          <colorPalette type="minMidMax" maxColor="#006D9C" minColor="#FFFFFF"></colorPalette>
          <scale type="minMidMax" maxValue="60" midValue="5" minValue="0"></scale>
        </format>
        <format type="number" field="Last Seen">
          <option name="precision">0</option>
          <option name="unit">min ago</option>
          <option name="useThousandSeparators">false</option>
        </format>
        <fields>["Last Seen","Cluster","host","GPU Idx", "Namespace", "Pod", "Container", "pid","Type (C/G)","SM","Memory","Encoder","Decoder","Command Name"]</fields>
      </table>
    </panel>
  </row>
  <row depends="$gpu_idx$,$host$">
    <panel>
      <html>
      <h3>Processes Running on host=$host$ and gpu=$gpu_idx$</h3>
    </html>
    </panel>
  </row>
  <row depends="$gpu_idx$,$host$">
    <panel>
      <title>SM (%) per process for host=$host$ gpu_idx=$gpu_idx$</title>
      <chart>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_pmon"  $host_search$ | where $gpu_idx_search$ |
eval process=(openshift_cluster_eval + " - " + host + " - " + gpu_idx + " - " + pid + " - " + command_name) |
timechart avg(sm_percent) as "SM" by process limit=200</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">collapsed</option>
        <option name="charting.axisTitleY.visibility">collapsed</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.abbreviation">none</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.abbreviation">none</option>
        <option name="charting.axisY.maximumNumber">100</option>
        <option name="charting.axisY.minimumNumber">0</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.abbreviation">none</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">line</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.mode">standard</option>
        <option name="charting.legend.placement">right</option>
        <option name="charting.lineWidth">2</option>
        <option name="refresh.display">progressbar</option>
        <option name="trellis.enabled">0</option>
        <option name="trellis.scales.shared">1</option>
        <option name="trellis.size">medium</option>
      </chart>
    </panel>
  </row>
  <row depends="$gpu_idx$,$host$">
    <panel>
      <title>Memory (%) per process for host=$host$ gpu_idx=$gpu_idx$</title>
      <chart>
        <search>
          <query>`macro_openshift_logs` sourcetype="openshift_gpu_nvidia_pmon"  $host_search$ | where $gpu_idx_search$ |
eval process=(openshift_cluster_eval + " - " + host + " - " + gpu_idx + " - " + pid + " - " + command_name) |
timechart avg(mem_percent) as "Mem" by process limit=200</query>
          <earliest>$period.earliest$</earliest>
          <latest>$period.latest$</latest>
          <sampleRatio>1</sampleRatio>
          <refresh>$form.refresh$</refresh>
          <refreshType>delay</refreshType>
        </search>
        <option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
        <option name="charting.axisLabelsX.majorLabelStyle.rotation">0</option>
        <option name="charting.axisTitleX.visibility">collapsed</option>
        <option name="charting.axisTitleY.visibility">collapsed</option>
        <option name="charting.axisTitleY2.visibility">visible</option>
        <option name="charting.axisX.abbreviation">none</option>
        <option name="charting.axisX.scale">linear</option>
        <option name="charting.axisY.abbreviation">none</option>
        <option name="charting.axisY.maximumNumber">100</option>
        <option name="charting.axisY.minimumNumber">0</option>
        <option name="charting.axisY.scale">linear</option>
        <option name="charting.axisY2.abbreviation">none</option>
        <option name="charting.axisY2.enabled">0</option>
        <option name="charting.axisY2.scale">inherit</option>
        <option name="charting.chart">line</option>
        <option name="charting.chart.bubbleMaximumSize">50</option>
        <option name="charting.chart.bubbleMinimumSize">10</option>
        <option name="charting.chart.bubbleSizeBy">area</option>
        <option name="charting.chart.nullValueMode">gaps</option>
        <option name="charting.chart.showDataLabels">none</option>
        <option name="charting.chart.sliceCollapsingThreshold">0.01</option>
        <option name="charting.chart.stackMode">default</option>
        <option name="charting.chart.style">shiny</option>
        <option name="charting.drilldown">none</option>
        <option name="charting.layout.splitSeries">0</option>
        <option name="charting.layout.splitSeries.allowIndependentYRanges">0</option>
        <option name="charting.legend.labelStyle.overflowMode">ellipsisMiddle</option>
        <option name="charting.legend.mode">standard</option>
        <option name="charting.legend.placement">right</option>
        <option name="charting.lineWidth">2</option>
        <option name="refresh.display">progressbar</option>
        <option name="trellis.enabled">0</option>
        <option name="trellis.scales.shared">1</option>
        <option name="trellis.size">medium</option>
      </chart>
    </panel>
  </row>
</form>

If you have all the macros and data collection configured, you should see the data within the dashboard.

NVIDIA (GPU)


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.