Predefined alerts

Available since version 5.2

The Monitoring Kubernetes application has predefined alerts that help to monitor the health of your clusters and performance of containers.

Monitoring Kubernetes: Collector License Expiration (less than 14 days)

One or more collectors use license with expiration in less than 14 days.

Monitoring Kubernetes: Collector Failed License Checks

One or more collectors constantly failing to check the license.

Monitoring Kubernetes: Collector outdated

One or more collectors are outdated.

Monitoring Kubernetes: Collector license overuse

You are exceeding the number of running collectors allowed by license. Contact sales@outcoldsolutions.com.

Monitoring Kubernetes: Cluster Critical: Kubernetes API is down

Collector has not published metrics for one of the Kubernetes API Servers. Possible missing Kubernetes API Server.

Monitoring Kubernetes: Cluster Critical: Controller Manager is down

Collector has not published metrics for one of the Controller Managers. Possible missing Controller Manager on Master nodes.

Monitoring Kubernetes: Cluster Critical: Kubelet is down

Collector has not published metrics for one of the Kubelets. Possible missing node in the cluster.

Monitoring Kubernetes: Cluster Critical: etcd member is down

Collector has not published metrics for one of the etcd members. Possible missing etcd member in the cluster.

Monitoring Kubernetes: Events: Constant Warning

Cluster reports the same warnings more than 3 times.

Monitoring Kubernetes: Cluster Info: mismatched versions

Mismatched build versions for the server components.

Monitoring Kubernetes: Cluster Info: mismatched kubelet versions

Mismatched build versions for the kubelets.

Monitoring Kubernetes: Cluster Warning: high number of errors to Kubernetes API

Kubelet experiences a high number of errors (more than 1%) to API Server.

Monitoring Kubernetes: Cluster Warning: pods capacity on node

Node has too many pods. Above 90% of capacity.

Monitoring Kubernetes: Cluster Warning: Kubernetes API Latency

The API Server has a 99th percentile latency above 1 second.

Monitoring Kubernetes: Cluster Critical: Kubernetes API High Number of 5xx

The API Server returned more than 5% of errors (5xx).

Monitoring Kubernetes: Cluster Warning: Kubernetes API certificate expires

Kubernetes API certificate expires in less than 7 days.

Monitoring Kubernetes: Cluster Critical: etcd does not have a leader

etcd cluster does not have a leader.

Monitoring Kubernetes: Cluster Warning: etcd frequent leader change

etcd changed leader more than 3 times in last hour.

Monitoring Kubernetes: Cluster Warning: high amount of GRPC errors

High amount of GRPC errors in etcd cluster.

Monitoring Kubernetes: Cluster Warning: etcd member communication is slow

etcd instance member communication is slow.

Monitoring Kubernetes: Cluster Warning: etcd high number of failed proposals

etcd high number of failed proposals.

Monitoring Kubernetes: Cluster Warning: etcd member fsync is slow

etcd member fsync is slow.

Monitoring Kubernetes: Cluster Warning: etcd member commit durations are slow

etcd member commit durations are slow.

Monitoring Kubernetes: Cluster Warning: etcd member fd usage is high

etcd member uses more than 80% of max fds.

Monitoring Kubernetes: Cluster Warning: unhealthy nodes

Controller reports about one or more unhealthy nodes.

Monitoring Kubernetes: Cluster Warning: kubelet runtime disk space is low

Node has less than 20% of available space for kubelet runtime.

Monitoring Kubernetes: Cluster Warning: Persistent Volume Claim space is low

Persistent Volume Claim has less than 20% of available space.

Monitoring Kubernetes: Cluster Warning: high host memory usage

High host memory usage. Above 85%.

Monitoring Kubernetes: Cluster Warning: high host CPU usage

Kubernetes host uses more than 90% of CPU on average for the last 5 minutes.

Monitoring Kubernetes: Cluster Warning: high container memory usage

Container uses more than 85% of memory limit.

Monitoring Kubernetes: Cluster Warning: container cpu is throttled

Container is getting throttled for more than 20% of CPU.

Monitoring Kubernetes: Warning: collectord reports errors in one or more pipelines

Collectord reports errors in one or more pipelines.

Monitoring Kubernetes: Warning: collectord has WARN or ERROR logs

Collectord reports warnings or errors.

Monitoring Kubernetes: Warning: Increasing lag between event time and indexing time in container logs

Increasing lag between event time and indexing time in container logs.

Monitoring Kubernetes: Warning: Node reservation of memory is above 90 percent

Node reservation of memory is above 90 percent.

Monitoring Kubernetes: Warning: Node reservation of cpu is above 90 percent

Node reservation of CPU is above 90 percent.

Monitoring Kubernetes: Collectord diagnostics

Monitors Collectord logs and triggers when one or more ALARMs are ON, that getting triggered by diagnostics:: enabled in configuration.

Alert triggers

By default we show triggered alerts at the Overview page at the very top. We populate this table using the REST call /alerts/fired_alerts/.

Other triggers

You can find various alerts actions on Splunk Base to integrate Splunk with the messaging applications and services for managing incidents.

After installing a new alert action, you can modify existing alerts to add more triggers.

Links

Installation
- Start monitoring your Kubernetes environments in under 10 minutes.
- Automatically forward host, container and application logs.
- Test our solution with the embedded 30-day evaluation license.
Collectord Configuration
- Collectord configuration reference.
Annotations
- Changing index, source, sourcetype for namespaces, workloads and pods.
- Forwarding application logs.
- Multi-line container logs.
- Fields extraction for application and container logs (including timestamp extractions).
- Hiding sensitive data, stripping terminal escape codes and colors.
- Forwarding Prometheus metrics from Pods.
Audit Logs
- Configure audit logs.
- Forwarding audit logs.
Prometheus metrics
- Collect metrics from control plane (etcd cluster, API server, kubelet, scheduler, controller).
- Configure the collectord to forward metrics from the services in Prometheus format.
Configuring Splunk Indexes
- Using non-default HTTP Event Collector index.
- Configure the Splunk application to use indexes that are not searchable by default.
Splunk fields extraction for container logs
- Configure search-time field extractions for container logs.
- Container logs source pattern.
Configurations for Splunk HTTP Event Collector
- Configure multiple HTTP Event Collector endpoints for Load Balancing and Fail-overs.
- Secure HTTP Event Collector endpoint.
- Configure the Proxy for HTTP Event Collector endpoint.
Monitoring multiple clusters
- Learn how to monitor multiple clusters.
- Learn how to set up ACL in Splunk.
Streaming Kubernetes Objects from the API Server
- Learn how to stream all changes from the Kubernetes API Server.
- Stream changes and objects from Kubernetes API Server, including Pods, Deployments or ConfigMaps.
License Server
- Learn how to configure a remote License URL for Collectord.
Monitoring GPU
Alerts
Troubleshooting
Release History
Upgrade instructions
Security
FAQ and the common questions
License agreement
Pricing
Contact

Monitoring Kubernetes