Verify configuration
Available since Collectord version5.2The first thing to do when something looks off is to run collectord verify from inside a Collectord pod. It checks the configuration end-to-end — license, Splunk output, container runtime, file inputs, Prometheus endpoints — and reports each item as OK or FAILED.
Start by listing the Collectord pods:
1$ oc get pods -n collectorforopenshift
2NAME READY STATUS RESTARTS AGE
3collectorforopenshift-addon-857fccb8b9-t9qgq 1/1 Running 1 1h
4collectorforopenshift-master-bwmwr 1/1 Running 0 1h
5collectorforopenshift-xbnaa 1/1 Running 0 1hCollectord runs as three workloads — a DaemonSet on master nodes (collectorforopenshift-master), a DaemonSet on the rest of the nodes (collectorforopenshift), and a single Deployment add-on (collectorforopenshift-addon). Run verify against one pod from each so every code path is exercised:
1$ oc exec -n collectorforopenshift collectorforopenshift-addon-857fccb8b9-t9qgq -- /collectord verify
2$ oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify
3$ oc exec -n collectorforopenshift collectorforopenshift-xbnaa -- /collectord verifyEach command produces output similar to:
1Version = 5.2.176
2Build date = 181012
3Environment = openshift
4
5
6 General:
7 + conf: OK
8 + db: OK
9 + db-meta: OK
10 + instanceID: OK
11 instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG
12 + license load: OK
13 trial
14 + license expiration: OK
15 license expires 2018-11-12 15:51:18.200772266 -0500 EST
16 + license connection: OK
17
18 Splunk output:
19 + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK
20 + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK
21
22 Kubernetes configuration:
23 + api: OK
24 + pod cgroup: OK
25 pods = 18
26 + container cgroup: OK
27 containers = 39
28 x volumes root: FAILED
29 failed to find any volumes under /rootfs/var/lib/minishift/openshift.local.volumes/
30 + runtime: OK
31 docker
32
33 Docker configuration:
34 + connect: OK
35 containers = 43
36 + path: OK
37 + cgroup: OK
38 containers = 40
39 + files: OK
40
41 CRI-O configuration:
42 - ignored: OK
43 kubernetes uses other container runtime
44
45 File Inputs:
46 x input(syslog): FAILED
47 no matches
48 + input(logs): OK
49 path /rootfs/var/log/
50 x input(audit-logs): FAILED
51 cannot access /rootfs/var/lib/origin/openpaas-oscp-audit/ (err = stat /rootfs/var/lib/origin/openpaas-oscp-audit/: no such file or directory)
52
53 System Input:
54 + path cgroup: OK
55 + path proc: OK
56
57 Network stats Input:
58 + path proc: OK
59
60 Network socket table Input:
61 + path proc: OK
62
63 Proc Input:
64 + path proc: OK
65
66 Mount Input:
67 + stats: OK
68
69 Prometheus input:
70 + input(kubernetes-api): OK
71 + input(webconsole): OK
72 x input(etcd): FAILED
73 failed to load metrics from specified endpoints [https://:2379/metrics]
74 x input(controller): FAILED
75 failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics]
76 + input(kubelet): OK
77
78Errors: 5The total number of errors appears at the bottom. Not every failure is a real problem — some are expected on smaller or non-standard clusters. The example above is from minishift, where these failures are benign:
volumes root: FAILED— this version of minishift mounts the runtime under/var/lib/minishift/base/openshift.local.volumes, so/rootfs/var/lib/minishift/openshift.local.volumes/is empty.input(syslog)— minishift doesn’t persist syslog to disk, so those logs aren’t available.input(audit-logs)— audit logs aren’t enabled on this cluster, so there’s nothing to forward.input(etcd)— etcd is embedded in the origin image and doesn’t expose a separate endpoint here.input(controller)— same story for the controller.
If you fix a real configuration error — say, a wrong Splunk URL —
oc apply -f ./collectorforopenshift.yamlwon’t restart the running pods. Delete them so the workloads recreate them with the new config:oc delete pods --all -n collectorforopenshift.
Describe command
Available since Collectord version5.12When you apply annotations through the namespace, workload, configurations and pods it could be hard to track which annotations are applied to the Pod or Container. You can use a describe command of collectord to get information which annotations are used for the specific Pod. You can use any collectord Pod to run this command on the cluster
1oc exec -n collectorforopenshift collectorforopenshift-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgresStarting with version 26.04, the describe command also tags each resolved field with its origin in square brackets:
[pod]— the value comes from a pod annotation[namespace]— the value comes from a namespace annotation[configuration:<name>]— the value comes from a Collectord CRDConfigurationresource (the<name>matches the resource name)
This makes it easy to trace which level of the configuration hierarchy is winning when the same annotation is defined at multiple levels — for example, when a CRD-level default is being overridden by a pod-level annotation, or when a namespace annotation is unexpectedly routing logs to a different output:
1$ oc exec -n collectorforopenshift collectorforopenshift-fqhmv -- /collectord describe --namespace webportal --pod audit-logger-774675c89c-rpfwx | grep '\['
2logs-type [pod] = audit_logs
3volume.1-logs-name [pod] = data
4volume.1-logs-glob [pod] = *.logThis is especially useful when debugging why a pod is routing to an unexpected output, using the wrong sourcetype, or picking up a field extraction you didn’t expect.
Collect diagnostic information
When you open a support case, attach a diagnostic bundle so we can reproduce the issue without a back-and-forth. The bundle includes performance profiles, memory and telemetry metrics, host Linux information, and the Collectord configuration — Splunk URL and HEC token are stripped out.
Run all four steps below.
1. Collect internal diag information from Collectord instance run following command
Available since Collectord version5.2Pick any Collectord pod and run collectord diag. The command takes a few minutes:
1oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord diag --stream 1>diag.tar.gzYou can extract the archive yourself to see exactly what’s in it — performance and memory profiles, basic telemetry metrics, host Linux info, and license metadata.
Since 5.20.400, performance profiles aren’t collected by default. Add
--include-performance-profilesif you need them.
Since 5.24, two more flags are available:
--quietsuppresses stdout output, and--keepwrites the diag file to Collectord’s data directory instead of streaming it.
If you’re running oc on a Windows or macOS host, streaming the archive directly back to your machine sometimes corrupts the tar. Use --keep instead (Collectord 5.24+):
1oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord diag --keepThe command prints the path of the archive at the end — something like collected diag data/diag-1745363135.tar.gz. Copy that file off the node where the pod was running; it lives under /var/lib/collectorforopenshift/.
2. Collect logs
1oc logs -n collectorforopenshift --timestamps collectorforopenshift-master-bwmwr 1>collectorforopenshift.log 2>&13. Run verify
Available since Collectord version5.21oc exec -n collectorforopenshift collectorforopenshift-master-bwmwr -- /collectord verify > verify.log4. Prepare tar archive
1tar -czvf collectorforopenshift-$(date +%s).tar.gz verify.log collectorforopenshift.log diag*.tar.gzPod is not getting scheduled
If Collectord pods never appear, the DaemonSets aren’t placing them on any node. Check the desired/current counts:
1oc get daemonset --namespace collectorforopenshiftZeros under DESIRED, CURRENT, READY, or UP-TO-DATE mean the controller couldn’t create a single pod:
1NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
2collectorforopenshift 0 0 0 0 0 <none> 1m
3collectorforopenshift-master 0 0 0 0 0 <none> 1mDescribe the DaemonSets to see the underlying reason:
1$ oc describe daemonsets --namespace collectorforopenshiftBoth DaemonSets are listed; the Events section at the bottom of each tells you what failed:
1...
2Events:
3 FirstSeen LastSeen Count From SubObjectPath Type Reason Message
4 --------- -------- ----- ---- ------------- -------- ------ -------
5 2m 43s 15 daemon-set Warning FailedCreate Error creating: pods "collectorforopenshift-" is forbidden: unable to validate against any security context constraint: [provider anyuid: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider anyuid: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used securityContext.runAsUser: Invalid value: 0: UID on container collectorforopenshift does not match required range. Found 0, required min: 1000000000 max: 1000009999 provider restricted: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider restricted: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]That error means the collectorforopenshift service account isn’t bound to the privileged Security Context Constraint — Collectord needs a privileged SCC to mount host paths and run as root. Add it:
1$ oc adm policy add-scc-to-user privileged system:serviceaccount:collectorforopenshift:collectorforopenshiftWait a minute or two and run describe again:
1$ oc describe daemonsets --namespace collectorforopenshiftThe old FailedCreate event will still be there, but you should also see a new SuccessfulCreate:
1Events:
2 FirstSeen LastSeen Count From SubObjectPath Type Reason Message
3 --------- -------- ----- ---- ------------- -------- ------ -------
4 ...
5 1m 1m 1 daemon-set Normal SuccessfulCreate Created pod: collectorforopenshift-55t61Failed to pull the image
If the DaemonSet shows pods but READY is below DESIRED, the kubelet is probably failing to pull the image:
1$ oc get daemonsets --namespace collectorforopenshift1NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
2default collectorforopenshift 1 1 0 1 0 <none> 6mList the pods to confirm:
1$ oc get pods --namespace collectorforopenshiftA status of ImagePullBackOff confirms the kubelet can’t reach the registry:
1NAMESPACE NAME READY STATUS RESTARTS AGE
2default collectorforopenshift-55t61 0/1 ImagePullBackOff 0 2mThat means the cluster doesn’t have network access to hub.docker.com or registry.connect.redhat.com — depending on which image you picked in the Configuration Reference. Describe the pods to see the kubelet’s actual error:
1$ oc describe pods --namespace collectorforopenshiftThe Events section shows the pull attempts:
1Events:
2 FirstSeen LastSeen Count From SubObjectPath Type Reason Message
3 --------- -------- ----- ---- ------------- -------- ------ -------
4 3m 2m 4 kubelet, localhost spec.containers{collectorforopenshift} Normal Pulling pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1"
5 3m 2m 4 kubelet, localhost spec.containers{collectorforopenshift} Warning Failed Failed to pull image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1": rpc error: code = 2 desc = unexpected http code: 500, URL: https://registry.connect.redhat.com/auth/realms/rhc4tp/protocol/docker-v2/auth?scope=repository%3Aoutcoldsolutions%2Fcollectorforopenshift%3Apull&service=docker-registry
6 3m 1m 6 kubelet, localhost spec.containers{collectorforopenshift} Normal BackOff Back-off pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1"
7 3m 1m 11 kubelet, localhost Warning FailedSync Error syncing podFailed to pull image from registry.connect.redhat.com
The Red Hat Container Catalog lists images under two registries —
registry.access.redhat.comandregistry.connect.redhat.com. Originally everything (Red Hat and partner images) lived inregistry.access.redhat.com, but starting in early 2018 partner images moved toregistry.connect.redhat.com. OpenShift Container Platform has solid built-in support forregistry.connect.redhat.com, while documentation and out-of-the-box support forregistry.access.redhat.comare still thinner.
If the pod events show that OpenShift can’t authenticate against registry.connect.redhat.com, you have two options: fall back to the image on hub.docker.com, or store credentials for the Red Hat registry as a pull secret. See the Configuration Reference for how to authenticate.
Blocked access to external registries
If your cluster can’t reach hub.docker.com or registry.connect.redhat.com for security reasons, mirror the image to an internal registry from a host that does have outbound access.
Copying image from hub.docker.com to your own registry
Pull the image:
1$ docker pull outcoldsolutions/collectorforopenshift:26.04.1Re-tag it under your registry:
1docker tag outcoldsolutions/collectorforopenshift:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1Push it:
1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1Then update the manifest to point at your registry:
1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1If you need to move the image between hosts that can’t talk to each other, save it to a tar:
1$ docker image save outcoldsolutions/collectorforopenshift:26.04.1 > collectorforopenshift.tarAnd load it on the other host:
1$ cat collectorforopenshift.tar | docker image loadCopying image from registry.connect.redhat.com to your own registry
Log in to registry.connect.redhat.com with your Red Hat account:
1$ docker login registry.connect.redhat.com
2Username: [redhat-username]
3Password: [redhat-user-password]
4Login SucceededUse your Red Hat username, not your email. Both forms succeed at the login step, but logging in with email leaves you unable to pull images.
Pull the image:
1$ docker pull registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1Re-tag it under your registry:
1docker tag registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1Push it:
1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1Then update the manifest to point at your registry:
1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:26.04.1If you need to move the image between hosts, save it to a tar:
1$ docker image save registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:26.04.1 > collectorforopenshift.tarAnd load it on the other host:
1$ cat collectorforopenshift.tar | docker image loadPod is crashing or running, but you don’t see any data
Get the Pod information
When a pod is crash-looping, the most useful thing in its YAML is lastState — it tells you why the previous container exited. Replace the pod name with the one that’s crashing:
1oc get pod -n collectorforopenshift -o yaml collectorforopenshift-master-mshxdIf lastState looks like this:
1lastState:
2 terminated:
3 containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db
4 exitCode: 128
5 finishedAt: "2022-10-16T05:58:13Z"
6 message: path / is mounted on / but it is not a shared or slave mount
7 reason: ContainerCannotRun
8 startedAt: "2022-10-16T05:58:13Z"The host’s root filesystem isn’t a shared or slave mount, so Collectord can’t propagate mounts back. In collectorforopenshift.yaml, find every mountPropagation: HostToContainer and comment it out. The only feature you’ll lose is Containerd’s auto-discovery of volumes containing application logs.
Email us at support@outcoldsolutions.com and we’ll help you configure it properly.
Check Collectord logs
The Collectord logs themselves usually tell you what’s going wrong. A healthy startup looks like this:
1$ oc logs -f collectorforopenshift-gvhgw --namespace collectorforopenshift
2INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65
3
4
5You are running trial version of this software.
6Trial version valid for 30 days.
7
8Contact sales@outcoldsolutions.com to purchase the license or extend trial.
9
10See details on https://www.outcoldsolutions.com
11
12INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
13INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
14INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
15INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
16INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
17INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
18INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
19INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
20INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
21INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
22INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
23INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
24INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
25INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
26INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
27INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check openshift 1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0 If you forget to set url and token for the Splunk output, Collectord refuses to start:
1INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
2Configuration validation failed
3[output.splunk]/url is requiredIf the license server is unreachable, the logs say so. For air-gapped clusters, contact us for a license that doesn’t require internet access.
If Splunk itself is unreachable, the logs say that too.
If you don’t see any *-json.log files mentioned but you do have containers running, the node is probably using the journald logging driver instead of json-file. See Monitoring OpenShift Installation for the supported logging drivers. The startup log will look like this — note the missing *-json.log files:
1INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
2You are running trial version of this software.
3Trial version valid for 30 days.
4Contact sales@outcoldsolutions.com to purchase the license or extend trial.
5See details on https://www.outcoldsolutions.com
6INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
7INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
8INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
9INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
10INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
11INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
12INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check openshift 1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0 If the logs are clean but the Monitoring OpenShift app is empty, the most common cause is that your HEC token writes to a non-default index that the Splunk role can’t search. Two fixes: add the index as a default index for the role, or update the app’s macros to scope them to your index. The macros live in Splunk Web UI under Settings > Advanced Search > Search Macros. For example, change macro_openshift_logs from (sourcetype=openshift_logs) to (index=your_index sourcetype=openshift_logs) — every dashboard is built on these macros, so the change takes effect immediately.