Monitoring OpenShift

Troubleshooting

Verify configuration

Available since Collectord version 5.2

The first thing to do when something looks off is to run collectord verify from inside a Collectord pod. It checks the configuration end-to-end — license, Splunk output, container runtime, file inputs, Prometheus endpoints — and reports each item as OK or FAILED.

Start by listing the Collectord pods:

bash
1$ oc get pods -n collectorforopenshift
2NAME                                           READY     STATUS    RESTARTS   AGE
3collectorforopenshift-addon-857fccb8b9-t9qgq   1/1       Running   1          1h
4collectorforopenshift-master-bwmwr             1/1       Running   0          1h
5collectorforopenshift-xbnaa                    1/1       Running   0          1h

Collectord runs as three workloads — a DaemonSet on master nodes (collectorforopenshift-master), a DaemonSet on the rest of the nodes (collectorforopenshift), and a single Deployment add-on (collectorforopenshift-addon). Run verify against one pod from each so every code path is exercised:

bash
1$ oc exec -n collectorforopenshift collectorforopenshift-addon-857fccb8b9-t9qgq -- /collectord verify
2$ oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify
3$ oc exec -n collectorforopenshift collectorforopenshift-xbnaa -- /collectord verify

Each command produces output similar to:

text
 1Version = 5.2.176
 2Build date = 181012
 3Environment = openshift
 4
 5
 6  General:
 7  + conf: OK
 8  + db: OK
 9  + db-meta: OK
10  + instanceID: OK
11    instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG
12  + license load: OK
13    trial
14  + license expiration: OK
15    license expires 2018-11-12 15:51:18.200772266 -0500 EST
16  + license connection: OK
17
18  Splunk output:
19  + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK
20  + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK
21
22  Kubernetes configuration:
23  + api: OK
24  + pod cgroup: OK
25    pods = 18
26  + container cgroup: OK
27    containers = 39
28  x volumes root: FAILED
29    failed to find any volumes under /rootfs/var/lib/minishift/openshift.local.volumes/
30  + runtime: OK
31    docker
32
33  Docker configuration:
34  + connect: OK
35    containers = 43
36  + path: OK
37  + cgroup: OK
38    containers = 40
39  + files: OK
40
41  CRI-O configuration:
42  - ignored: OK
43    kubernetes uses other container runtime
44
45  File Inputs:
46  x input(syslog): FAILED
47    no matches
48  + input(logs): OK
49    path /rootfs/var/log/
50  x input(audit-logs): FAILED
51    cannot access /rootfs/var/lib/origin/openpaas-oscp-audit/ (err = stat /rootfs/var/lib/origin/openpaas-oscp-audit/: no such file or directory)
52
53  System Input:
54  + path cgroup: OK
55  + path proc: OK
56
57  Network stats Input:
58  + path proc: OK
59
60  Network socket table Input:
61  + path proc: OK
62
63  Proc Input:
64  + path proc: OK
65
66  Mount Input:
67  + stats: OK
68
69  Prometheus input:
70  + input(kubernetes-api): OK
71  + input(webconsole): OK
72  x input(etcd): FAILED
73    failed to load metrics from specified endpoints [https://:2379/metrics]
74  x input(controller): FAILED
75    failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics]
76  + input(kubelet): OK
77  
78Errors: 5

The total number of errors appears at the bottom. Not every failure is a real problem — some are expected on smaller or non-standard clusters. The example above is from minishift, where these failures are benign:

  • volumes root: FAILED — this version of minishift mounts the runtime under /var/lib/minishift/base/openshift.local.volumes, so /rootfs/var/lib/minishift/openshift.local.volumes/ is empty.
  • input(syslog) — minishift doesn’t persist syslog to disk, so those logs aren’t available.
  • input(audit-logs) — audit logs aren’t enabled on this cluster, so there’s nothing to forward.
  • input(etcd) — etcd is embedded in the origin image and doesn’t expose a separate endpoint here.
  • input(controller) — same story for the controller.

If you fix a real configuration error — say, a wrong Splunk URL — oc apply -f ./collectorforopenshift.yaml won’t restart the running pods. Delete them so the workloads recreate them with the new config: oc delete pods --all -n collectorforopenshift.

Describe command

Available since Collectord version 5.12

When you apply annotations through the namespace, workload, configurations and pods it could be hard to track which annotations are applied to the Pod or Container. You can use a describe command of collectord to get information which annotations are used for the specific Pod. You can use any collectord Pod to run this command on the cluster

bash
1oc exec -n collectorforopenshift collectorforopenshift-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgres

Starting with version 26.04, the describe command also tags each resolved field with its origin in square brackets:

  • [pod] — the value comes from a pod annotation
  • [namespace] — the value comes from a namespace annotation
  • [configuration:<name>] — the value comes from a Collectord CRD Configuration resource (the <name> matches the resource name)

This makes it easy to trace which level of the configuration hierarchy is winning when the same annotation is defined at multiple levels — for example, when a CRD-level default is being overridden by a pod-level annotation, or when a namespace annotation is unexpectedly routing logs to a different output:

bash
1$ oc exec -n collectorforopenshift collectorforopenshift-fqhmv -- /collectord describe --namespace webportal --pod audit-logger-774675c89c-rpfwx | grep '\['
2logs-type [pod] = audit_logs
3volume.1-logs-name [pod] = data
4volume.1-logs-glob [pod] = *.log

This is especially useful when debugging why a pod is routing to an unexpected output, using the wrong sourcetype, or picking up a field extraction you didn’t expect.

Collect diagnostic information

When you open a support case, attach a diagnostic bundle so we can reproduce the issue without a back-and-forth. The bundle includes performance profiles, memory and telemetry metrics, host Linux information, and the Collectord configuration — Splunk URL and HEC token are stripped out.

Run all four steps below.

1. Collect internal diag information from Collectord instance run following command

Available since Collectord version 5.2

Pick any Collectord pod and run collectord diag. The command takes a few minutes:

bash
1oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord diag --stream 1>diag.tar.gz

You can extract the archive yourself to see exactly what’s in it — performance and memory profiles, basic telemetry metrics, host Linux info, and license metadata.

Since 5.20.400, performance profiles aren’t collected by default. Add --include-performance-profiles if you need them.

Since 5.24, two more flags are available: --quiet suppresses stdout output, and --keep writes the diag file to Collectord’s data directory instead of streaming it.

If you’re running oc on a Windows or macOS host, streaming the archive directly back to your machine sometimes corrupts the tar. Use --keep instead (Collectord 5.24+):

bash
1oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord diag --keep

The command prints the path of the archive at the end — something like collected diag data/diag-1745363135.tar.gz. Copy that file off the node where the pod was running; it lives under /var/lib/collectorforopenshift/.

2. Collect logs

bash
1oc logs -n collectorforopenshift --timestamps collectorforopenshift-master-bwmwr  1>collectorforopenshift.log 2>&1

3. Run verify

Available since Collectord version 5.2
bash
1oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify > verify.log

4. Prepare tar archive

bash
1tar -czvf collectorforopenshift-$(date +%s).tar.gz verify.log collectorforopenshift.log diag*.tar.gz

Pod is not getting scheduled

If Collectord pods never appear, the DaemonSets aren’t placing them on any node. Check the desired/current counts:

bash
1oc get daemonset --namespace collectorforopenshift

Zeros under DESIRED, CURRENT, READY, or UP-TO-DATE mean the controller couldn’t create a single pod:

text
1NAME                           DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
2collectorforopenshift          0         0         0         0            0           <none>          1m
3collectorforopenshift-master   0         0         0         0            0           <none>          1m

Describe the DaemonSets to see the underlying reason:

bash
1$ oc describe daemonsets --namespace collectorforopenshift

Both DaemonSets are listed; the Events section at the bottom of each tells you what failed:

text
1...
2Events:
3  FirstSeen	LastSeen	Count	From		SubObjectPath	Type		Reason		Message
4  ---------	--------	-----	----		-------------	--------	------		-------
5  2m		43s		15	daemon-set			Warning		FailedCreate	Error creating: pods "collectorforopenshift-" is forbidden: unable to validate against any security context constraint: [provider anyuid: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider anyuid: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used securityContext.runAsUser: Invalid value: 0: UID on container collectorforopenshift does not match required range.  Found 0, required min: 1000000000 max: 1000009999 provider restricted: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider restricted: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]

That error means the collectorforopenshift service account isn’t bound to the privileged Security Context Constraint — Collectord needs a privileged SCC to mount host paths and run as root. Add it:

bash
1$ oc adm policy add-scc-to-user privileged system:serviceaccount:collectorforopenshift:collectorforopenshift

Wait a minute or two and run describe again:

bash
1$ oc describe daemonsets --namespace collectorforopenshift

The old FailedCreate event will still be there, but you should also see a new SuccessfulCreate:

text
1Events:
2  FirstSeen	LastSeen	Count	From		SubObjectPath	Type		Reason		Message
3  ---------	--------	-----	----		-------------	--------	------		-------
4  ...
5  1m		1m		1	daemon-set			Normal		SuccessfulCreate	Created pod: collectorforopenshift-55t61

Failed to pull the image

If the DaemonSet shows pods but READY is below DESIRED, the kubelet is probably failing to pull the image:

bash
1$ oc get daemonsets --namespace collectorforopenshift
text
1NAMESPACE   NAME                    DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
2default     collectorforopenshift   1         1         0         1            0           <none>          6m

List the pods to confirm:

bash
1$ oc get pods --namespace collectorforopenshift

A status of ImagePullBackOff confirms the kubelet can’t reach the registry:

text
1NAMESPACE   NAME                            READY     STATUS             RESTARTS   AGE
2default     collectorforopenshift-55t61     0/1       ImagePullBackOff   0          2m

That means the cluster doesn’t have network access to hub.docker.com or registry.connect.redhat.com — depending on which image you picked in the Configuration Reference. Describe the pods to see the kubelet’s actual error:

bash
1$ oc describe pods --namespace collectorforopenshift

The Events section shows the pull attempts:

text
1Events:
2  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason		Message
3  ---------	--------	-----	----			-------------				--------	------		-------
4  3m		2m		4	kubelet, localhost	spec.containers{collectorforopenshift}	Normal		Pulling		pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1"
5  3m		2m		4	kubelet, localhost	spec.containers{collectorforopenshift}	Warning		Failed		Failed to pull image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1": rpc error: code = 2 desc = unexpected http code: 500, URL: https://registry.connect.redhat.com/auth/realms/rhc4tp/protocol/docker-v2/auth?scope=repository%3Aoutcoldsolutions%2Fcollectorforopenshift%3Apull&service=docker-registry
6  3m		1m		6	kubelet, localhost	spec.containers{collectorforopenshift}	Normal		BackOff		Back-off pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1"
7  3m		1m		11	kubelet, localhost						Warning		FailedSync	Error syncing pod

Failed to pull image from registry.connect.redhat.com

The Red Hat Container Catalog lists images under two registries — registry.access.redhat.com and registry.connect.redhat.com. Originally everything (Red Hat and partner images) lived in registry.access.redhat.com, but starting in early 2018 partner images moved to registry.connect.redhat.com. OpenShift Container Platform has solid built-in support for registry.connect.redhat.com, while documentation and out-of-the-box support for registry.access.redhat.com are still thinner.

If the pod events show that OpenShift can’t authenticate against registry.connect.redhat.com, you have two options: fall back to the image on hub.docker.com, or store credentials for the Red Hat registry as a pull secret. See the Configuration Reference for how to authenticate.

Blocked access to external registries

If your cluster can’t reach hub.docker.com or registry.connect.redhat.com for security reasons, mirror the image to an internal registry from a host that does have outbound access.

Copying image from hub.docker.com to your own registry

Pull the image:

bash
1$ docker pull outcoldsolutions/collectorforopenshift:26.04.1

Re-tag it under your registry:

bash
1docker tag  outcoldsolutions/collectorforopenshift:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

Push it:

bash
1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

Then update the manifest to point at your registry:

yaml
1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

If you need to move the image between hosts that can’t talk to each other, save it to a tar:

bash
1$ docker image save outcoldsolutions/collectorforopenshift:26.04.1 > collectorforopenshift.tar

And load it on the other host:

bash
1$ cat collectorforopenshift.tar | docker image load

Copying image from registry.connect.redhat.com to your own registry

Log in to registry.connect.redhat.com with your Red Hat account:

bash
1$ docker login registry.connect.redhat.com
2Username: [redhat-username]
3Password: [redhat-user-password]
4Login Succeeded

Use your Red Hat username, not your email. Both forms succeed at the login step, but logging in with email leaves you unable to pull images.

Pull the image:

bash
1$ docker pull registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1

Re-tag it under your registry:

bash
1docker tag registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

Push it:

bash
1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

Then update the manifest to point at your registry:

yaml
1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1

If you need to move the image between hosts, save it to a tar:

bash
1$ docker image save registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1 > collectorforopenshift.tar

And load it on the other host:

bash
1$ cat collectorforopenshift.tar | docker image load

Pod is crashing or running, but you don’t see any data

Get the Pod information

When a pod is crash-looping, the most useful thing in its YAML is lastState — it tells you why the previous container exited. Replace the pod name with the one that’s crashing:

bash
1oc get pod -n collectorforopenshift -o yaml collectorforopenshift-master-mshxd

If lastState looks like this:

yaml
1lastState:
2  terminated:
3    containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db
4    exitCode: 128
5    finishedAt: "2022-10-16T05:58:13Z"
6    message: path / is mounted on / but it is not a shared or slave mount
7    reason: ContainerCannotRun
8    startedAt: "2022-10-16T05:58:13Z"

The host’s root filesystem isn’t a shared or slave mount, so Collectord can’t propagate mounts back. In collectorforopenshift.yaml, find every mountPropagation: HostToContainer and comment it out. The only feature you’ll lose is Containerd’s auto-discovery of volumes containing application logs.

Email us at support@outcoldsolutions.com and we’ll help you configure it properly.

Check Collectord logs

The Collectord logs themselves usually tell you what’s going wrong. A healthy startup looks like this:

text
 1$ oc logs -f collectorforopenshift-gvhgw --namespace collectorforopenshift
 2INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65
 3
 4
 5You are running trial version of this software.
 6Trial version valid for 30 days.
 7
 8Contact sales@outcoldsolutions.com to purchase the license or extend trial.
 9
10See details on https://www.outcoldsolutions.com
11
12INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
13INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
14INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
15INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
16INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
17INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
18INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
19INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
20INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
21INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
22INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
23INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
24INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
25INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
26INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
27INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check openshift  1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0 

If you forget to set url and token for the Splunk output, Collectord refuses to start:

text
1INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
2Configuration validation failed
3[output.splunk]/url is required

If the license server is unreachable, the logs say so. For air-gapped clusters, contact us for a license that doesn’t require internet access.

If Splunk itself is unreachable, the logs say that too.

If you don’t see any *-json.log files mentioned but you do have containers running, the node is probably using the journald logging driver instead of json-file. See Monitoring OpenShift Installation for the supported logging drivers. The startup log will look like this — note the missing *-json.log files:

text
 1INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
 2You are running trial version of this software.
 3Trial version valid for 30 days.
 4Contact sales@outcoldsolutions.com to purchase the license or extend trial.
 5See details on https://www.outcoldsolutions.com
 6INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
 7INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
 8INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
 9INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
10INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
11INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
12INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check openshift  1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0 

If the logs are clean but the Monitoring OpenShift app is empty, the most common cause is that your HEC token writes to a non-default index that the Splunk role can’t search. Two fixes: add the index as a default index for the role, or update the app’s macros to scope them to your index. The macros live in Splunk Web UI under Settings > Advanced Search > Search Macros. For example, change macro_openshift_logs from (sourcetype=openshift_logs) to (index=your_index sourcetype=openshift_logs) — every dashboard is built on these macros, so the change takes effect immediately.