Outcold Solutions LLC

Monitoring OpenShift - Version 5

Alerts

Predefined alerts

Available since version 5.2

Monitoring OpenShift application has predefined alerts, that help to monitor the health of your clusters and performance of containers.

Alerts

Monitoring OpenShift: Collector License Expiration (less than 14 days)

One or more collectors use license with expiration in less than 14 days.

Monitoring OpenShift: Collector Failed License Checks

One or more collectors constantly failing to check the license.

Monitoring OpenShift: Collector outdated

One or more collectors is outdated.

Monitoring OpenShift: Collector license overuse

You are exceeding number of running collectors allowed by license. Contact sales@outcoldsolutions.com.

Monitoring OpenShift: Cluster Critical: Kubernetes API is down

Collector has not published metrics for one of the Kubernetes API Servers. Possible missing Kubernetes API Server.

Monitoring OpenShift: Cluster Critical: Controller Manager is down

Collector has not published metrics for one of the Controller Managers. Possible missing Controller Manager on Master nodes.

Monitoring OpenShift: Cluster Critical: Kubelet is down

Collector has not published metrics for one of the Kubelet. Possible missing Node in the cluster.

Monitoring OpenShift: Cluster Critical: etcd member is down

Collector has not published metrics for one of the etcd members. Possible missing etcd member in the cluster.

Monitoring OpenShift: Events: Constant Warning

Cluster reports the same warnings more than 3 times

Monitoring OpenShift: Cluster Info: mismatched versions

Mismatched build versions for the server components.

Monitoring OpenShift: Cluster Info: mismatched kubelet versions

Mismatched build versions for the kubelets.

Monitoring OpenShift: Cluster Warning: high number of errors to Kubernetes API

Kubelet experience a high number of errors (more than 1%) to API Server

Monitoring OpenShift: Cluster Warning: pods capacity on node

Node has too many pods. Above 90% of capacity

Monitoring OpenShift: Cluster Warning: Kubernetes API Latency

The API Server has a 99th percentile latency above 1 second

Monitoring OpenShift: Cluster Critical: Kubernetes API High Number of 5xx

The API Server returned more than 5% of errors (5xx)

Monitoring OpenShift: Cluster Warning: Kubernetes API certificate expires

Kubernetes API certificate expires in less than 7 days.

Monitoring OpenShift: Cluster Critical: etcd does not have a leader

etcd cluster does not have a leader.

Monitoring OpenShift: Cluster Warning: etcd frequent leader change

etcd changed leader more than 3 times in last hour

Monitoring OpenShift: Cluster Warning: high amount of GRPC errors

High amount of GRPC errors in etcd cluster

Monitoring OpenShift: Cluster Warning: etcd member communication is slow

etcd instance member communication is slow

Monitoring OpenShift: Cluster Warning: etcd hight number of failed proposals

etcd hight number of failed proposals

Monitoring OpenShift: Cluster Warning: etcd member fsync is slow

etcd member fsync is slow

Monitoring OpenShift: Cluster Warning: etcd member commit durations are slow

etcd member commit durations are slow

Monitoring OpenShift: Cluster Warning: etcd member fd usage is high

etcd member uses more than 80% of max fds

Monitoring OpenShift: Cluster Warning: unhealthy nodes

Controller reports about one or more unhealthy nodes

Monitoring OpenShift: Cluster Warning: kubelet runtime disk space is low

Node has less than 20% of available space for kubelet runtime

Monitoring OpenShift: Cluster Warning: Persistent Volume Claim space is low

Persistent Volume Claim has less than 20% of available space

Monitoring OpenShift: Cluster Warning: high host memory usage

High host memory usage. Above 85%

Monitoring OpenShift: Cluster Warning: high host CPU usage

OpenShift host uses more than 90% of CPU on average for the last 5 minutes

Monitoring OpenShift: Cluster Warning: high container memory usage

Container uses more than 85% of memory limit

Monitoring OpenShift: Cluster Warning: container cpu is throttled

Container is getting throttled for more than 20% of cpu

Monitoring OpenShift: Warning: collectord reports errors in one or more pipelines

Collectord reports errors in one or more pipelines

Monitoring OpenShift: Warning: collectord has WARN or ERROR logs

Collectord reports warnings or errors

Monitoring OpenShift: Warning: Increasing lag between event time and indexing time in container logs

Increasing lag between event time and indexing time in container logs

Monitoring OpenShift: Warning: Node reservation of memory is above 90 percent

Node reservation of memory is above 90 percent

Monitoring OpenShift: Warning: Node reservation of cpu is above 90 percent

Node reservation of cpu is above 90 percent

Monitoring OpenShift: Collectord diagnostics

Monitors Collectord logs and triggers when one or more ALARMs are ON, that getting triggered by diagnostics:: enabled in configuration.

Alert triggers

By default we show triggered alerts at the Overview page at the very top. We populate this table using the rest call /alerts/fired_alerts/.

Alerts Example

Other triggers

You can find various alerts actions on Splunk Base to integrate Splunk with the messaging applications and services for managing incidents.

After installing new alert action, you can modify existing alerts to add more triggers.


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.