Alerts

Alerts

Predefined alerts

Available since version 5.2

Monitoring OpenShift application has predefined alerts, that help to monitor the health of your clusters and performance of containers.

Alerts

Monitoring OpenShift: Collector License Expiration (less than 14 days)

One or more collectors use license with expiration in less than 14 days.

Monitoring OpenShift: Collector Failed License Checks

One or more collectors constantly failing to check the license.

Monitoring OpenShift: Collector outdated

One or more collectors is outdated.

Monitoring OpenShift: Collector license overuse

You are exceeding number of running collectors allowed by license. Contact sales@outcoldsolutions.com.

Monitoring OpenShift: Cluster Critical: Kubernetes API is down

Collector has not published metrics for one of the Kubernetes API Servers. Possible missing Kubernetes API Server.

Monitoring OpenShift: Cluster Critical: Controller Manager is down

Collector has not published metrics for one of the Controller Managers. Possible missing Controller Manager on Master nodes.

Monitoring OpenShift: Cluster Critical: Kubelet is down

Collector has not published metrics for one of the Kubelet. Possible missing Node in the cluster.

Monitoring OpenShift: Cluster Critical: etcd member is down

Collector has not published metrics for one of the etcd members. Possible missing etcd member in the cluster.

Monitoring OpenShift: Events: Constant Warning

Cluster reports the same warnings more than 3 times

Monitoring OpenShift: Cluster Info: mismatched versions

Mismatched build versions for the server components.

Monitoring OpenShift: Cluster Info: mismatched kubelet versions

Mismatched build versions for the kubelets.

Monitoring OpenShift: Cluster Warning: high number of errors to Kubernetes API

Kubelet experience a high number of errors (more than 1%) to API Server

Monitoring OpenShift: Cluster Warning: pods capacity on node

Node has too many pods. Above 90% of capacity

Monitoring OpenShift: Cluster Warning: Kubernetes API Latency

The API Server has a 99th percentile latency above 1 second

Monitoring OpenShift: Cluster Critical: Kubernetes API High Number of 5xx

The API Server returned more than 5% of errors (5xx)

Monitoring OpenShift: Cluster Warning: Kubernetes API certificate expires

Kubernetes API certificate expires in less than 7 days.

Monitoring OpenShift: Cluster Critical: etcd does not have a leader

etcd cluster does not have a leader.

Monitoring OpenShift: Cluster Warning: etcd frequent leader change

etcd changed leader more than 3 times in last hour

Monitoring OpenShift: Cluster Warning: high amount of GRPC errors

High amount of GRPC errors in etcd cluster

Monitoring OpenShift: Cluster Warning: etcd member communication is slow

etcd instance member communication is slow

Monitoring OpenShift: Cluster Warning: etcd hight number of failed proposals

etcd hight number of failed proposals

Monitoring OpenShift: Cluster Warning: etcd member fsync is slow

etcd member fsync is slow

Monitoring OpenShift: Cluster Warning: etcd member commit durations are slow

etcd member commit durations are slow

Monitoring OpenShift: Cluster Warning: etcd member fd usage is high

etcd member uses more than 80% of max fds

Monitoring OpenShift: Cluster Warning: unhealthy nodes

Controller reports about one or more unhealthy nodes

Monitoring OpenShift: Cluster Warning: kubelet runtime disk space is low

Node has less than 20% of available space for kubelet runtime

Monitoring OpenShift: Cluster Warning: Persistent Volume Claim space is low

Persistent Volume Claim has less than 20% of available space

Monitoring OpenShift: Cluster Warning: high host memory usage

High host memory usage. Above 85%

Monitoring OpenShift: Cluster Warning: high host CPU usage

OpenShift host uses more than 90% of CPU on average for the last 5 minutes

Monitoring OpenShift: Cluster Warning: high container memory usage

Container uses more than 85% of memory limit

Monitoring OpenShift: Cluster Warning: container cpu is throttled

Container is getting throttled for more than 20% of cpu

Monitoring OpenShift: Warning: collectord reports errors in one or more pipelines

Collectord reports errors in one or more pipelines

Monitoring OpenShift: Warning: collectord has WARN or ERROR logs

Collectord reports warnings or errors

Monitoring OpenShift: Warning: Increasing lag between event time and indexing time in container logs

Increasing lag between event time and indexing time in container logs

Monitoring OpenShift: Warning: Node reservation of memory is above 90 percent

Node reservation of memory is above 90 percent

Monitoring OpenShift: Warning: Node reservation of cpu is above 90 percent

Node reservation of cpu is above 90 percent

Monitoring OpenShift: Collectord diagnostics

Monitors Collectord logs and triggers when one or more ALARMs are ON, that getting triggered by diagnostics:: enabled in configuration.

Alert triggers

By default we show triggered alerts at the Overview page at the very top. We populate this table using the rest call /alerts/fired_alerts/.

Alerts Example

Other triggers

You can find various alerts actions on Splunk Base to integrate Splunk with the messaging applications and services for managing incidents.

After installing new alert action, you can modify existing alerts to add more triggers.

Links

Installation
- Start monitoring your OpenShift environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30 days evaluation license.
Collector Configuration
- Collector configuration reference.
Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
Audit Logs
- Configure audit logs.
- Forwarding audit logs.
Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure collector to forward metrics from the services in Prometheus format.
Configuring Splunk Indexes
- Using not default HTTP Event Collector index.
- Configure the Splunk application to use not searchable by default indexes.
Splunk fields extraction for container logs
- Configure search-time fields extractions for container logs.
- Container logs source pattern.
Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
Monitoring multiple clusters
- Learn how you can monitor multiple clusters.
- Learn how to set up ACL in Splunk.
Streaming OpenShift Objects from the API Server
- Learn how you can stream all changes from the OpenShift API Server.
- Stream changes and objects from OpenShift API Server, including Pods, Deployments or ConfigMaps.
License Server
- Learn how you can configure remote License URL for Collectord.
Monitoring GPU
Alerts
Troubleshooting
Release History
Upgrade instructions
Security
FAQ and the common questions
License agreement
Pricing
Contact

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.

Monitoring OpenShift - Version 5

Alerts

Predefined alerts

Monitoring OpenShift: Collector License Expiration (less than 14 days)

Monitoring OpenShift: Collector Failed License Checks

Monitoring OpenShift: Collector outdated

Monitoring OpenShift: Collector license overuse

Monitoring OpenShift: Cluster Critical: Kubernetes API is down

Monitoring OpenShift: Cluster Critical: Controller Manager is down

Monitoring OpenShift: Cluster Critical: Kubelet is down

Monitoring OpenShift: Cluster Critical: etcd member is down

Monitoring OpenShift: Events: Constant Warning

Monitoring OpenShift: Cluster Info: mismatched versions

Monitoring OpenShift: Cluster Info: mismatched kubelet versions

Monitoring OpenShift: Cluster Warning: high number of errors to Kubernetes API

Monitoring OpenShift: Cluster Warning: pods capacity on node

Monitoring OpenShift: Cluster Warning: Kubernetes API Latency

Monitoring OpenShift: Cluster Critical: Kubernetes API High Number of 5xx

Monitoring OpenShift: Cluster Warning: Kubernetes API certificate expires

Monitoring OpenShift: Cluster Critical: etcd does not have a leader

Monitoring OpenShift: Cluster Warning: etcd frequent leader change

Monitoring OpenShift: Cluster Warning: high amount of GRPC errors

Monitoring OpenShift: Cluster Warning: etcd member communication is slow

Monitoring OpenShift: Cluster Warning: etcd hight number of failed proposals

Monitoring OpenShift: Cluster Warning: etcd member fsync is slow

Monitoring OpenShift: Cluster Warning: etcd member commit durations are slow

Monitoring OpenShift: Cluster Warning: etcd member fd usage is high

Monitoring OpenShift: Cluster Warning: unhealthy nodes

Monitoring OpenShift: Cluster Warning: kubelet runtime disk space is low

Monitoring OpenShift: Cluster Warning: Persistent Volume Claim space is low

Monitoring OpenShift: Cluster Warning: high host memory usage

Monitoring OpenShift: Cluster Warning: high host CPU usage

Monitoring OpenShift: Cluster Warning: high container memory usage

Monitoring OpenShift: Cluster Warning: container cpu is throttled

Monitoring OpenShift: Warning: collectord reports errors in one or more pipelines

Monitoring OpenShift: Warning: collectord has WARN or ERROR logs

Monitoring OpenShift: Warning: Increasing lag between event time and indexing time in container logs

Monitoring OpenShift: Warning: Node reservation of memory is above 90 percent

Monitoring OpenShift: Warning: Node reservation of cpu is above 90 percent

Monitoring OpenShift: Collectord diagnostics

Alert triggers

Other triggers

Links

About Outcold Solutions