Kubernetes Search

Use cases

The five commands (Command reference) are ordinary SPL, so anything you can express in a search you can run, save, schedule, and alert on. This page collects searches by goal and shows how to turn them into alerts that watch your clusters live. The bundled dashboards package many of these as ready-made panels - this page is for building your own.

Incident triage

When something is wrong, these find it fast.

Pods that aren’t Running (excluding finished jobs):

| k8s kind=pods namespace=*
| spath path=status.phase output=phase
| where phase!="Running" AND phase!="Succeeded"
| table namespace name phase

Containers in CrashLoopBackOff:

| k8s kind=pods namespace=*
| spath path=status.containerStatuses{}.state.waiting.reason output=reason
| search reason=CrashLoopBackOff
| table namespace name reason

Image pull failures:

| k8sevents namespace=* type=Warning reason=Failed
| search message="*ImagePull*" OR message="*ErrImagePull*"
| table involved_namespace involved_name message

Recent warnings, grouped by reason:

| k8sevents namespace=* type=Warning
| stats count latest(message) as message by reason involved_kind
| sort - count

Health and readiness

Nodes that aren’t Ready:

| k8s kind=nodes
| spath path=status.conditions{} output=conditions
| mvexpand conditions
| spath input=conditions
| where type="Ready" AND status!="True"
| table name status reason

Pods stuck Pending (usually unschedulable):

| k8sevents namespace=* reason=FailedScheduling
| table involved_namespace involved_name message count

Deployments below their desired replica count:

| k8s kind=deployments namespace=*
| spath path=spec.replicas output=desired
| spath path=status.availableReplicas output=available
| eval available=coalesce(available, 0)
| where available < desired
| table namespace name desired available

Capacity and hygiene

Pods by total restart count, worst first:

| k8s kind=pods namespace=*
| spath path=status.containerStatuses{}.restartCount output=restarts
| mvexpand restarts
| stats sum(restarts) as restarts by namespace name
| where restarts > 10
| sort - restarts

The Resource Hygiene dashboard covers the rest of this ground - missing requests and limits, absent probes, and stale images - as ready-made panels.

Security and governance

Privileged containers:

| k8s kind=pods namespace=*
| spath path=spec.containers{}.securityContext.privileged output=privileged
| search privileged=true
| table namespace name

Pods mounting host paths:

| k8s kind=pods namespace=*
| spath path=spec.volumes{}.hostPath.path output=hostpaths
| where isnotnull(hostpaths)
| table namespace name hostpaths

For who-ran-what auditing of Kubernetes Search itself, see Access control - Auditing usage.

Across clusters

Fan out with context=* and aggregate by cluster.

Node and pod inventory per cluster:

| k8s kind=pods namespace=* context=* view=metadata
| stats dc(namespace) as namespaces count as pods by cluster

Kubernetes version skew across the fleet:

| k8s kind=nodes context=*
| spath path=status.nodeInfo.kubeletVersion output=kubelet
| stats values(kubelet) as versions by cluster

Alerting on live cluster state

Turn any of these into an alert: build the search, choose Save As - Alert, set a schedule, and trigger on the result count (for example, “number of results is greater than 0”). Because each run queries the API live, the alert evaluates the cluster’s current state every time it fires - ideal for “something is wrong right now” conditions like a node going NotReady, pods stuck Pending, or a workload below its desired replicas.

Two things follow from how scheduled searches run:

  • Each run is a snapshot. An alert sees the cluster at the moment it runs, not a window of history. You can alert on “a pod is CrashLooping now,” but not on “the error rate over the last hour” - that second kind is what ingestion (Monitoring Kubernetes) is for. The two complement each other; see the comparison.
  • Use a shared cluster credential, not impersonation. A scheduled search runs without an interactive user, so it can’t impersonate one (see Access control). Point an alert at a cluster configured with a shared, least-privilege, read-only credential.

Mind the cache relative to your schedule: a result cached for 30 seconds won’t reflect a change from 10 seconds ago. For time-sensitive alerts add cache=0 so each run hits the API fresh, and weigh the load that adds on busy clusters (see Performance).

Example - alert when any node is not Ready. Use the node-readiness search above, save it as an alert, schedule it every few minutes, and trigger when the result count is greater than zero. The triggered alert lists exactly which nodes are affected, so the notification is actionable on its own.