The five commands (Command reference) are ordinary SPL, so anything you can express in a search you can run, save, schedule, and alert on. This page collects searches by goal and shows how to turn them into alerts that watch your clusters live. The bundled dashboards package many of these as ready-made panels - this page is for building your own.
Incident triage
When something is wrong, these find it fast.
Pods that aren’t Running (excluding finished jobs):
| k8s kind=pods namespace=*
| spath path=status.phase output=phase
| where phase!="Running" AND phase!="Succeeded"
| table namespace name phaseContainers in CrashLoopBackOff:
| k8s kind=pods namespace=*
| spath path=status.containerStatuses{}.state.waiting.reason output=reason
| search reason=CrashLoopBackOff
| table namespace name reasonImage pull failures:
| k8sevents namespace=* type=Warning reason=Failed
| search message="*ImagePull*" OR message="*ErrImagePull*"
| table involved_namespace involved_name messageRecent warnings, grouped by reason:
| k8sevents namespace=* type=Warning
| stats count latest(message) as message by reason involved_kind
| sort - countHealth and readiness
Nodes that aren’t Ready:
| k8s kind=nodes
| spath path=status.conditions{} output=conditions
| mvexpand conditions
| spath input=conditions
| where type="Ready" AND status!="True"
| table name status reasonPods stuck Pending (usually unschedulable):
| k8sevents namespace=* reason=FailedScheduling
| table involved_namespace involved_name message countDeployments below their desired replica count:
| k8s kind=deployments namespace=*
| spath path=spec.replicas output=desired
| spath path=status.availableReplicas output=available
| eval available=coalesce(available, 0)
| where available < desired
| table namespace name desired availableCapacity and hygiene
Pods by total restart count, worst first:
| k8s kind=pods namespace=*
| spath path=status.containerStatuses{}.restartCount output=restarts
| mvexpand restarts
| stats sum(restarts) as restarts by namespace name
| where restarts > 10
| sort - restartsThe Resource Hygiene dashboard covers the rest of this ground - missing requests and limits, absent probes, and stale images - as ready-made panels.
Security and governance
Privileged containers:
| k8s kind=pods namespace=*
| spath path=spec.containers{}.securityContext.privileged output=privileged
| search privileged=true
| table namespace namePods mounting host paths:
| k8s kind=pods namespace=*
| spath path=spec.volumes{}.hostPath.path output=hostpaths
| where isnotnull(hostpaths)
| table namespace name hostpathsFor who-ran-what auditing of Kubernetes Search itself, see Access control - Auditing usage.
Across clusters
Fan out with context=* and aggregate by cluster.
Node and pod inventory per cluster:
| k8s kind=pods namespace=* context=* view=metadata
| stats dc(namespace) as namespaces count as pods by clusterKubernetes version skew across the fleet:
| k8s kind=nodes context=*
| spath path=status.nodeInfo.kubeletVersion output=kubelet
| stats values(kubelet) as versions by clusterAlerting on live cluster state
Turn any of these into an alert: build the search, choose Save As - Alert, set a schedule, and trigger on the result count (for example, “number of results is greater than 0”). Because each run queries the API live, the alert evaluates the cluster’s current state every time it fires - ideal for “something is wrong right now” conditions like a node going NotReady, pods stuck Pending, or a workload below its desired replicas.
Two things follow from how scheduled searches run:
- Each run is a snapshot. An alert sees the cluster at the moment it runs, not a window of history. You can alert on “a pod is CrashLooping now,” but not on “the error rate over the last hour” - that second kind is what ingestion (Monitoring Kubernetes) is for. The two complement each other; see the comparison.
- Use a shared cluster credential, not impersonation. A scheduled search runs without an interactive user, so it can’t impersonate one (see Access control). Point an alert at a cluster configured with a shared, least-privilege, read-only credential.
Mind the cache relative to your schedule: a result cached for 30 seconds won’t reflect a change from 10 seconds ago. For time-sensitive alerts add cache=0 so each run hits the API fresh, and weigh the load that adds on busy clusters (see Performance).
Example - alert when any node is not Ready. Use the node-readiness search above, save it as an alert, schedule it every few minutes, and trigger when the result count is greater than zero. The triggered alert lists exactly which nodes are affected, so the notification is actionable on its own.