Outcold Solutions LLC

Monitoring Kubernetes - Version 5

Alerts

Predefined alerts

Available since version 5.2

Monitoring Kubernetes application has predefined alerts, that help to monitor the health of your clusters and performance of containers.

Alerts

Monitoring Kubernetes: Collector License Expiration (less than 14 days)

One or more collectors use license with expiration in less than 14 days.

Monitoring Kubernetes: Collector Failed License Checks

One or more collectors constantly failing to check the license.

Monitoring Kubernetes: Collector outdated

One or more collectors is outdated.

Monitoring Kubernetes: Collector license overuse

You are exceeding number of running collectors allowed by license. Contact sales@outcoldsolutions.com.

Monitoring Kubernetes: Cluster Critical: Kubernetes API is down

Collector has not published metrics for one of the Kubernetes API Servers. Possible missing Kubernetes API Server.

Monitoring Kubernetes: Cluster Critical: Controller Manager is down

Collector has not published metrics for one of the Controller Managers. Possible missing Controller Manager on Master nodes.

Monitoring Kubernetes: Cluster Critical: Kubelet is down

Collector has not published metrics for one of the Kubelet. Possible missing Node in the cluster.

Monitoring Kubernetes: Cluster Critical: etcd member is down

Collector has not published metrics for one of the etcd members. Possible missing etcd member in the cluster.

Monitoring Kubernetes: Events: Constant Warning

Cluster reports the same warnings more than 3 times

Monitoring Kubernetes: Cluster Info: mismatched versions

Mismatched build versions for the server components.

Monitoring Kubernetes: Cluster Info: mismatched kubelet versions

Mismatched build versions for the kubelets.

Monitoring Kubernetes: Cluster Warning: high number of errors to Kubernetes API

Kubelet experience a high number of errors (more than 1%) to API Server

Monitoring Kubernetes: Cluster Warning: pods capacity on node

Node has too many pods. Above 90% of capacity

Monitoring Kubernetes: Cluster Warning: Kubernetes API Latency

The API Server has a 99th percentile latency above 1 second

Monitoring Kubernetes: Cluster Critical: Kubernetes API High Number of 5xx

The API Server returned more than 5% of errors (5xx)

Monitoring Kubernetes: Cluster Warning: Kubernetes API certificate expires

Kubernetes API certificate expires in less than 7 days.

Monitoring Kubernetes: Cluster Critical: etcd does not have a leader

etcd cluster does not have a leader.

Monitoring Kubernetes: Cluster Warning: etcd frequent leader change

etcd changed leader more than 3 times in last hour

Monitoring Kubernetes: Cluster Warning: high amount of GRPC errors

High amount of GRPC errors in etcd cluster

Monitoring Kubernetes: Cluster Warning: etcd member communication is slow

etcd instance member communication is slow

Monitoring Kubernetes: Cluster Warning: etcd hight number of failed proposals

etcd hight number of failed proposals

Monitoring Kubernetes: Cluster Warning: etcd member fsync is slow

etcd member fsync is slow

Monitoring Kubernetes: Cluster Warning: etcd member commit durations are slow

etcd member commit durations are slow

Monitoring Kubernetes: Cluster Warning: etcd member fd usage is high

etcd member uses more than 80% of max fds

Monitoring Kubernetes: Cluster Warning: unhealthy nodes

Controller reports about one or more unhealthy nodes

Monitoring Kubernetes: Cluster Warning: kubelet runtime disk space is low

Node has less than 20% of available space for kubelet runtime

Monitoring Kubernetes: Cluster Warning: Persistent Volume Claim space is low

Persistent Volume Claim has less than 20% of available space

Monitoring Kubernetes: Cluster Warning: high host memory usage

High host memory usage. Above 85%

Monitoring Kubernetes: Cluster Warning: high host CPU usage

Kubernetes host uses more than 90% of CPU on average for the last 5 minutes

Monitoring Kubernetes: Cluster Warning: high container memory usage

Container uses more than 85% of memory limit

Monitoring Kubernetes: Cluster Warning: container cpu is throttled

Container is getting throttled for more than 20% of cpu

Monitoring Kubernetes: Warning: collectord reports errors in one or more pipelines

Collectord reports errors in one or more pipelines

Monitoring Kubernetes: Warning: collectord has WARN or ERROR logs

Collectord reports warnings or errors

Monitoring Kubernetes: Warning: Increasing lag between event time and indexing time in container logs

Increasing lag between event time and indexing time in container logs

Monitoring Kubernetes: Warning: Node reservation of memory is above 90 percent

Node reservation of memory is above 90 percent

Monitoring Kubernetes: Warning: Node reservation of cpu is above 90 percent

Node reservation of cpu is above 90 percent

Monitoring Kubernetes: Collectord diagnostics

Monitors Collectord logs and triggers when one or more ALARMs are ON, that getting triggered by diagnostics:: enabled in configuration.

Alert triggers

By default we show triggered alerts at the Overview page at the very top. We populate this table using the rest call /alerts/fired_alerts/.

Alerts Example

Other triggers

You can find various alerts actions on Splunk Base to integrate Splunk with the messaging applications and services for managing incidents.

After installing new alert action, you can modify existing alerts to add more triggers.


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.