Verify configuration
Available since Collectord version5.2The first thing to do when something looks off is to run collectord verify from inside a Collectord pod. It checks the configuration end-to-end — license, Splunk output, container runtime, file inputs, Prometheus endpoints — and reports each item as OK or FAILED.
Start by listing the Collectord pods:
1$ kubectl get pods -n collectorforkubernetes
2NAME READY STATUS RESTARTS AGE
3collectorforkubernetes-addon-857fccb8b9-t9qgq 1/1 Running 1 1h
4collectorforkubernetes-master-bwmwr 1/1 Running 0 1h
5collectorforkubernetes-xbnaa 1/1 Running 0 1hCollectord runs as three workloads — a DaemonSet on master nodes (collectorforkubernetes-master), a DaemonSet on the rest of the nodes (collectorforkubernetes), and a single Deployment add-on (collectorforkubernetes-addon). Run verify against one pod from each so every code path is exercised:
1$ kubectl exec -n collectorforkubernetes collectorforkubernetes-addon-857fccb8b9-t9qgq -- /collectord verify
2$ kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify
3$ kubectl exec -n collectorforkubernetes collectorforkubernetes-xbnaa -- /collectord verifyEach command produces output similar to:
1Version = 5.2.176
2Build date = 181012
3Environment = kubernetes
4
5
6 General:
7 + conf: OK
8 + db: OK
9 + db-meta: OK
10 + instanceID: OK
11 instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG
12 + license load: OK
13 trial
14 + license expiration: OK
15 license expires 2018-11-12 15:51:18.200772266 -0500 EST
16 + license connection: OK
17
18 Splunk output:
19 + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK
20 + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK
21
22 Kubernetes configuration:
23 + api: OK
24 + pod cgroup: OK
25 pods = 18
26 + container cgroup: OK
27 containers = 39
28 + volumes root: OK
29 + runtime: OK
30 docker
31
32 Docker configuration:
33 + connect: OK
34 containers = 43
35 + path: OK
36 + cgroup: OK
37 containers = 40
38 + files: OK
39
40 CRI-O configuration:
41 - ignored: OK
42 kubernetes uses other container runtime
43
44 File Inputs:
45 x input(syslog): FAILED
46 no matches
47 + input(logs): OK
48 path /rootfs/var/log/
49
50 System Input:
51 + path cgroup: OK
52 + path proc: OK
53
54 Network stats Input:
55 + path proc: OK
56
57 Network socket table Input:
58 + path proc: OK
59
60 Proc Input:
61 + path proc: OK
62
63 Mount Input:
64 + stats: OK
65
66 Prometheus input:
67 + input(kubernetes-api): OK
68 x input(etcd): FAILED
69 failed to load metrics from specified endpoints [https://:2379/metrics]
70 x input(controller): FAILED
71 failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics]
72 + input(kubelet): OK
73
74Errors: 5The total number of errors appears at the bottom. Not every failure is a real problem — some are expected on smaller or non-standard clusters. The example above is from minikube, where these failures are benign:
input(syslog)— minikube doesn’t persist syslog to disk, so those logs aren’t available.input(etcd)— etcd isn’t reachable on this minikube instance.input(controller)— the controller endpoint isn’t reachable on this minikube instance.
If you fix a real configuration error — say, a wrong Splunk URL —
kubectl apply -f ./collectorforkubernetes.yamlwon’t restart the running pods. Delete them so the workloads recreate them with the new config:kubectl delete pods --all -n collectorforkubernetes.
Describe command
Available since Collectord version5.12When you apply annotations through namespaces, workloads, configurations, and pods, it could be hard to track which annotations are applied to the Pod or Container. You can use the describe command of collectord to get information about which annotations are used for a specific Pod. You can use any collectord Pod to run this command on the cluster:
1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgresStarting with version 26.04, the describe command also tags each resolved field with its origin in square brackets:
[pod]— the value comes from a pod annotation[namespace]— the value comes from a namespace annotation[configuration:<name>]— the value comes from a Collectord CRDConfigurationresource (the<name>matches the resource name)
This makes it easy to trace which level of the configuration hierarchy is winning when the same annotation is defined at multiple levels — for example, when a CRD-level default is being overridden by a pod-level annotation, or when a namespace annotation is unexpectedly routing logs to a different output:
1$ kubectl exec -n collectorforkubernetes collectorforkubernetes-fqhmv -- /collectord describe --namespace webportal --pod audit-logger-774675c89c-rpfwx | grep '\['
2logs-type [pod] = audit_logs
3volume.1-logs-name [pod] = data
4volume.1-logs-glob [pod] = *.logThis is especially useful when debugging why a pod is routing to an unexpected output, using the wrong sourcetype, or picking up a field extraction you didn’t expect.
Collect diagnostic information
When you open a support case, attach a diagnostic bundle so we can reproduce the issue without a back-and-forth. The bundle includes performance profiles, memory and telemetry metrics, host Linux information, and the Collectord configuration — Splunk URL and HEC token are stripped out.
Run all four steps below.
1. Collect internal diag information from Collectord instance run following command
Available since Collectord version5.2Pick any Collectord pod and run collectord diag. The command takes a few minutes:
1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord diag --stream 1>diag.tar.gzYou can extract the archive yourself to see exactly what’s in it — performance and memory profiles, basic telemetry metrics, host Linux info, and license metadata.
Since 5.20.400, performance profiles aren’t collected by default. Add
--include-performance-profilesif you need them.
Since 5.24, two more flags are available:
--quietsuppresses stdout output, and--keepwrites the diag file to Collectord’s data directory instead of streaming it.
If you’re running kubectl on a Windows or macOS host, streaming the archive directly back to your machine sometimes corrupts the tar. Use --keep instead (Collectord 5.24+):
1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord diag --keepThe command prints the path of the archive at the end — something like collected diag data/diag-1745363135.tar.gz. Copy that file off the node where the pod was running; it lives under /var/lib/collectorforkubernetes/.
2. Collect logs
1kubectl logs -n collectorforkubernetes --timestamps collectorforkubernetes-master-bwmwr 1>collectorforkubernetes.log 2>&13. Run verify
Available since Collectord version5.21kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify > verify.log4. Prepare tar archive
1tar -czvf collectorforkubernetes-$(date +%s).tar.gz verify.log collectorforkubernetes.log diag*.tar.gzPod is not getting scheduled
If Collectord pods never appear, the DaemonSets aren’t placing them on any node. Check the desired/current counts:
1kubectl get daemonset --namespace collectorforkubernetesZeros under DESIRED, CURRENT, READY, or UP-TO-DATE mean the controller couldn’t create a single pod:
1NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
2collectorforkubernetes 0 0 0 0 0 <none> 1m
3collectorforkubernetes-master 0 0 0 0 0 <none> 1mDescribe the DaemonSets to see the underlying reason:
1$ kubectl describe daemonsets --namespace collectorforkubernetesBoth DaemonSets are listed; the Events section at the bottom of each tells you what failed:
1...
2Events:
3 Type Reason Age From Message
4 ---- ------ ---- ---- -------
5 Warning FailedCreate 31m daemonset-controller Error creating: pods "collectorforkubernetes-" is forbidden: SecurityContext.RunAsUser is forbiddenThat particular error means Pod Security Policies are enforced on the cluster and the Collectord ClusterRole isn’t bound to the privileged PSP. Add the use permission for the privileged PSP to the Collectord ClusterRole:
1apiVersion: rbac.authorization.k8s.io/v1
2kind: ClusterRole
3metadata:
4 labels:
5 app: collectorforkubernetes
6 name: collectorforkubernetes
7rules:
8- apiGroups: ['extensions']
9 resources: ['podsecuritypolicies']
10 verbs: ['use']
11 resourceNames:
12 - privileged
13- apiGroups:
14 ...Failed to pull the image
If the DaemonSet shows pods but READY is below DESIRED, the kubelet is probably failing to pull the image:
1$ kubectl get daemonsets --namespace collectorforkubernetes1NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR AGE
2default collectorforkubernetes 1 1 0 1 0 <none> 6mList the pods to confirm:
1$ kubectl get pods --namespace collectorforkubernetesA status of ImagePullBackOff confirms the kubelet can’t reach the registry:
1NAMESPACE NAME READY STATUS RESTARTS AGE
2default collectorforkubernetes-55t61 0/1 ImagePullBackOff 0 2mThat means the cluster doesn’t have network access to hub.docker.com. Describe the pods to see the kubelet’s actual error:
1$ kubectl describe pods --namespace collectorforkubernetesThe Events section shows the pull attempts:
1Events:
2 FirstSeen LastSeen Count From SubObjectPath Type Reason Message
3 --------- -------- ----- ---- ------------- -------- ------ -------
4 3m 2m 4 kubelet, localhost spec.containers{collectorforkubernetes} Normal Pulling pulling image "hub.docker.com/outcoldsolutions/collectorforkubernetes:26.04.1"
5 3m 1m 6 kubelet, localhost spec.containers{collectorforkubernetes} Normal BackOff Back-off pulling image "rhub.docker.com/outcoldsolutions/collectorforkubernetes:26.04.1"
6 3m 1m 11 kubelet, localhost Warning FailedSync Error syncing podBlocked access to external registries
If your cluster can’t reach hub.docker.com for security reasons, mirror the image to an internal registry from a host that does have outbound access.
Copying image from hub.docker.com to your own registry
Pull the image:
1$ docker pull outcoldsolutions/collectorforkubernetes:26.04.1Re-tag it under your registry:
1docker tag outcoldsolutions/collectorforkubernetes:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1Push it:
1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1Then update the manifest to point at your registry:
1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1If you need to move the image between hosts that can’t talk to each other, save it to a tar:
1$ docker image save outcoldsolutions/collectorforkubernetes:26.04.1 > collectorforkubernetes.tarAnd load it on the other host:
1$ cat collectorforkubernetes.tar | docker image loadPod is crashing or running, but you don’t see any data
Get the Pod information
When a pod is crash-looping, the most useful thing in its YAML is lastState — it tells you why the previous container exited. Replace the pod name with the one that’s crashing:
1kubectl get pod -n collectorforkubernetes -o yaml collectorforkubernetes-master-mshxdIf lastState looks like this:
1lastState:
2 terminated:
3 containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db
4 exitCode: 128
5 finishedAt: "2022-10-16T05:58:13Z"
6 message: path / is mounted on / but it is not a shared or slave mount
7 reason: ContainerCannotRun
8 startedAt: "2022-10-16T05:58:13Z"The host’s root filesystem isn’t a shared or slave mount, so Collectord can’t propagate mounts back. In collectorforkubernetes.yaml, find every mountPropagation: HostToContainer and comment it out. The only feature you’ll lose is Containerd’s auto-discovery of volumes containing application logs.
Email us at support@outcoldsolutions.com and we’ll help you configure it properly.
Check containerd logs
The Collectord logs themselves usually tell you what’s going wrong. A healthy startup looks like this:
1$ kubectl logs -f collectorforkubernetes-gvhgw --namespace collectorforkubernetes
2INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65
3
4
5You are running trial version of this software.
6Trial version valid for 30 days.
7
8Contact sales@outcoldsolutions.com to purchase the license or extend trial.
9
10See details on https://www.outcoldsolutions.com
11
12INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
13INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
14INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
15INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
16INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
17INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
18INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
19INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
20INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
21INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
22INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
23INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
24INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
25INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
26INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
27INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check kubernetes 1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0 If you forget to set url and token for the Splunk output, Collectord refuses to start:
1INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
2Configuration validation failed
3[output.splunk]/url is requiredIf the license server is unreachable, the logs say so. For air-gapped clusters, contact us for a license that doesn’t require internet access.
If Splunk itself is unreachable, the logs say that too.
If you don’t see any *-json.log files mentioned but you do have containers running, the node is probably using the journald logging driver instead of json-file. The startup log will look like this — note the missing *-json.log files:
1INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
2You are running trial version of this software.
3Trial version valid for 30 days.
4Contact sales@outcoldsolutions.com to purchase the license or extend trial.
5See details on https://www.outcoldsolutions.com
6INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
7INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
8INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
9INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
10INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
11INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
12INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check kubernetes 1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0 If the logs are clean but the Monitoring Kubernetes app is empty, the most common cause is that your HEC token writes to a non-default index that the Splunk role can’t search. Two fixes: add the index as a default index for the role, or update the app’s macros to scope them to your index. The macros live in Splunk Web UI under Settings > Advanced Search > Search Macros. For example, change macro_kubernetes_logs from (sourcetype=kubernetes_logs) to (index=your_index sourcetype=kubernetes_logs) — every dashboard is built on these macros, so the change takes effect immediately.