Verify configuration

Available since Collectord version 5.2

The first thing to do when something looks off is to run collectord verify from inside a Collectord pod. It checks the configuration end-to-end — license, Splunk output, container runtime, file inputs, Prometheus endpoints — and reports each item as OK or FAILED.

Start by listing the Collectord pods:

bash

1$ kubectl get pods -n collectorforkubernetes
2NAME                                           READY     STATUS    RESTARTS   AGE
3collectorforkubernetes-addon-857fccb8b9-t9qgq   1/1       Running   1          1h
4collectorforkubernetes-master-bwmwr             1/1       Running   0          1h
5collectorforkubernetes-xbnaa                    1/1       Running   0          1h

Collectord runs as three workloads — a DaemonSet on master nodes (collectorforkubernetes-master), a DaemonSet on the rest of the nodes (collectorforkubernetes), and a single Deployment add-on (collectorforkubernetes-addon). Run verify against one pod from each so every code path is exercised:

bash

1$ kubectl exec -n collectorforkubernetes collectorforkubernetes-addon-857fccb8b9-t9qgq -- /collectord verify
2$ kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify
3$ kubectl exec -n collectorforkubernetes collectorforkubernetes-xbnaa -- /collectord verify

Each command produces output similar to:

text

 1Version = 5.2.176
 2Build date = 181012
 3Environment = kubernetes
 4
 5
 6  General:
 7  + conf: OK
 8  + db: OK
 9  + db-meta: OK
10  + instanceID: OK
11    instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG
12  + license load: OK
13    trial
14  + license expiration: OK
15    license expires 2018-11-12 15:51:18.200772266 -0500 EST
16  + license connection: OK
17
18  Splunk output:
19  + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK
20  + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK
21
22  Kubernetes configuration:
23  + api: OK
24  + pod cgroup: OK
25    pods = 18
26  + container cgroup: OK
27    containers = 39
28  + volumes root: OK
29  + runtime: OK
30    docker
31
32  Docker configuration:
33  + connect: OK
34    containers = 43
35  + path: OK
36  + cgroup: OK
37    containers = 40
38  + files: OK
39
40  CRI-O configuration:
41  - ignored: OK
42    kubernetes uses other container runtime
43
44  File Inputs:
45  x input(syslog): FAILED
46    no matches
47  + input(logs): OK
48    path /rootfs/var/log/
49
50  System Input:
51  + path cgroup: OK
52  + path proc: OK
53
54  Network stats Input:
55  + path proc: OK
56
57  Network socket table Input:
58  + path proc: OK
59
60  Proc Input:
61  + path proc: OK
62
63  Mount Input:
64  + stats: OK
65
66  Prometheus input:
67  + input(kubernetes-api): OK
68  x input(etcd): FAILED
69    failed to load metrics from specified endpoints [https://:2379/metrics]
70  x input(controller): FAILED
71    failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics]
72  + input(kubelet): OK
73  
74Errors: 5

The total number of errors appears at the bottom. Not every failure is a real problem — some are expected on smaller or non-standard clusters. The example above is from minikube, where these failures are benign:

input(syslog) — minikube doesn’t persist syslog to disk, so those logs aren’t available.
input(etcd) — etcd isn’t reachable on this minikube instance.
input(controller) — the controller endpoint isn’t reachable on this minikube instance.

If you fix a real configuration error — say, a wrong Splunk URL — kubectl apply -f ./collectorforkubernetes.yaml won’t restart the running pods. Delete them so the workloads recreate them with the new config: kubectl delete pods --all -n collectorforkubernetes.

Describe command

Available since Collectord version 5.12

When you apply annotations through namespaces, workloads, configurations, and pods, it could be hard to track which annotations are applied to the Pod or Container. You can use the describe command of collectord to get information about which annotations are used for a specific Pod. You can use any collectord Pod to run this command on the cluster:

bash

1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgres

Starting with version 26.04, the describe command also tags each resolved field with its origin in square brackets:

[pod] — the value comes from a pod annotation
[namespace] — the value comes from a namespace annotation
[configuration:<name>] — the value comes from a Collectord CRD Configuration resource (the <name> matches the resource name)

This makes it easy to trace which level of the configuration hierarchy is winning when the same annotation is defined at multiple levels — for example, when a CRD-level default is being overridden by a pod-level annotation, or when a namespace annotation is unexpectedly routing logs to a different output:

bash

1$ kubectl exec -n collectorforkubernetes collectorforkubernetes-fqhmv -- /collectord describe --namespace webportal --pod audit-logger-774675c89c-rpfwx | grep '\['
2logs-type [pod] = audit_logs
3volume.1-logs-name [pod] = data
4volume.1-logs-glob [pod] = *.log

This is especially useful when debugging why a pod is routing to an unexpected output, using the wrong sourcetype, or picking up a field extraction you didn’t expect.

Collect diagnostic information

When you open a support case, attach a diagnostic bundle so we can reproduce the issue without a back-and-forth. The bundle includes performance profiles, memory and telemetry metrics, host Linux information, and the Collectord configuration — Splunk URL and HEC token are stripped out.

Run all four steps below.

1. Collect internal diag information from Collectord instance run following command

Available since Collectord version 5.2

Pick any Collectord pod and run collectord diag. The command takes a few minutes:

bash

1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord diag --stream 1>diag.tar.gz

You can extract the archive yourself to see exactly what’s in it — performance and memory profiles, basic telemetry metrics, host Linux info, and license metadata.

Since 5.20.400, performance profiles aren’t collected by default. Add --include-performance-profiles if you need them.

Since 5.24, two more flags are available: --quiet suppresses stdout output, and --keep writes the diag file to Collectord’s data directory instead of streaming it.

If you’re running kubectl on a Windows or macOS host, streaming the archive directly back to your machine sometimes corrupts the tar. Use --keep instead (Collectord 5.24+):

bash

1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord diag --keep

The command prints the path of the archive at the end — something like collected diag data/diag-1745363135.tar.gz. Copy that file off the node where the pod was running; it lives under /var/lib/collectorforkubernetes/.

2. Collect logs

bash

1kubectl logs -n collectorforkubernetes --timestamps collectorforkubernetes-master-bwmwr  1>collectorforkubernetes.log 2>&1

3. Run verify

Available since Collectord version 5.2

bash

1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify > verify.log

4. Prepare tar archive

bash

1tar -czvf collectorforkubernetes-$(date +%s).tar.gz verify.log collectorforkubernetes.log diag*.tar.gz

Pod is not getting scheduled

If Collectord pods never appear, the DaemonSets aren’t placing them on any node. Check the desired/current counts:

bash

1kubectl get daemonset --namespace collectorforkubernetes

Zeros under DESIRED, CURRENT, READY, or UP-TO-DATE mean the controller couldn’t create a single pod:

text

1NAME                            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
2collectorforkubernetes          0         0         0         0            0           <none>          1m
3collectorforkubernetes-master   0         0         0         0            0           <none>          1m

Describe the DaemonSets to see the underlying reason:

bash

1$ kubectl describe daemonsets --namespace collectorforkubernetes

Both DaemonSets are listed; the Events section at the bottom of each tells you what failed:

text

1...
2Events:
3  Type     Reason            Age                From                  Message
4  ----     ------            ----               ----                  -------
5  Warning  FailedCreate      31m                daemonset-controller  Error creating: pods "collectorforkubernetes-" is forbidden: SecurityContext.RunAsUser is forbidden

That particular error means Pod Security Policies are enforced on the cluster and the Collectord ClusterRole isn’t bound to the privileged PSP. Add the use permission for the privileged PSP to the Collectord ClusterRole:

yaml

 1apiVersion: rbac.authorization.k8s.io/v1
 2kind: ClusterRole
 3metadata:
 4  labels:
 5    app: collectorforkubernetes
 6  name: collectorforkubernetes
 7rules:
 8- apiGroups: ['extensions']
 9  resources: ['podsecuritypolicies']
10  verbs:     ['use']
11  resourceNames:
12  - privileged
13- apiGroups:
14  ...

Failed to pull the image

If the DaemonSet shows pods but READY is below DESIRED, the kubelet is probably failing to pull the image:

bash

1$ kubectl get daemonsets --namespace collectorforkubernetes

text

1NAMESPACE   NAME                     DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
2default     collectorforkubernetes   1         1         0         1            0           <none>          6m

List the pods to confirm:

bash

1$ kubectl get pods --namespace collectorforkubernetes

A status of ImagePullBackOff confirms the kubelet can’t reach the registry:

text

1NAMESPACE   NAME                             READY     STATUS             RESTARTS   AGE
2default     collectorforkubernetes-55t61     0/1       ImagePullBackOff   0          2m

That means the cluster doesn’t have network access to hub.docker.com. Describe the pods to see the kubelet’s actual error:

bash

1$ kubectl describe pods --namespace collectorforkubernetes

The Events section shows the pull attempts:

text

1Events:
2  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason		Message
3  ---------	--------	-----	----			-------------				--------	------		-------
4  3m		2m		4	kubelet, localhost	spec.containers{collectorforkubernetes}	Normal		Pulling		pulling image "hub.docker.com/outcoldsolutions/collectorforkubernetes:26.04.1"
5  3m		1m		6	kubelet, localhost	spec.containers{collectorforkubernetes}	Normal		BackOff		Back-off pulling image "rhub.docker.com/outcoldsolutions/collectorforkubernetes:26.04.1"
6  3m		1m		11	kubelet, localhost						Warning		FailedSync	Error syncing pod

Blocked access to external registries

If your cluster can’t reach hub.docker.com for security reasons, mirror the image to an internal registry from a host that does have outbound access.

Copying image from hub.docker.com to your own registry

Pull the image:

bash

1$ docker pull outcoldsolutions/collectorforkubernetes:26.04.1

Re-tag it under your registry:

bash

1docker tag  outcoldsolutions/collectorforkubernetes:26.04.1 [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1

Push it:

bash

1docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1

Then update the manifest to point at your registry:

yaml

1image: [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:26.04.1

If you need to move the image between hosts that can’t talk to each other, save it to a tar:

bash

1$ docker image save outcoldsolutions/collectorforkubernetes:26.04.1 > collectorforkubernetes.tar

And load it on the other host:

bash

1$ cat collectorforkubernetes.tar | docker image load

Pod is crashing or running, but you don’t see any data

Get the Pod information

When a pod is crash-looping, the most useful thing in its YAML is lastState — it tells you why the previous container exited. Replace the pod name with the one that’s crashing:

bash

1kubectl get pod -n collectorforkubernetes -o yaml collectorforkubernetes-master-mshxd

If lastState looks like this:

yaml

1lastState:
2  terminated:
3    containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db
4    exitCode: 128
5    finishedAt: "2022-10-16T05:58:13Z"
6    message: path / is mounted on / but it is not a shared or slave mount
7    reason: ContainerCannotRun
8    startedAt: "2022-10-16T05:58:13Z"

The host’s root filesystem isn’t a shared or slave mount, so Collectord can’t propagate mounts back. In collectorforkubernetes.yaml, find every mountPropagation: HostToContainer and comment it out. The only feature you’ll lose is Containerd’s auto-discovery of volumes containing application logs.

Email us at support@outcoldsolutions.com and we’ll help you configure it properly.

Check containerd logs

The Collectord logs themselves usually tell you what’s going wrong. A healthy startup looks like this:

text

 1$ kubectl logs -f collectorforkubernetes-gvhgw --namespace collectorforkubernetes
 2INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65
 3
 4
 5You are running trial version of this software.
 6Trial version valid for 30 days.
 7
 8Contact sales@outcoldsolutions.com to purchase the license or extend trial.
 9
10See details on https://www.outcoldsolutions.com
11
12INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
13INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
14INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
15INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
16INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
17INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
18INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
19INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
20INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
21INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
22INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
23INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
24INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
25INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
26INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
27INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check kubernetes  1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0

If you forget to set url and token for the Splunk output, Collectord refuses to start:

text

1INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
2Configuration validation failed
3[output.splunk]/url is required

If the license server is unreachable, the logs say so. For air-gapped clusters, contact us for a license that doesn’t require internet access.

If Splunk itself is unreachable, the logs say that too.

If you don’t see any *-json.log files mentioned but you do have containers running, the node is probably using the journald logging driver instead of json-file. The startup log will look like this — note the missing *-json.log files:

text

 1INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
 2You are running trial version of this software.
 3Trial version valid for 30 days.
 4Contact sales@outcoldsolutions.com to purchase the license or extend trial.
 5See details on https://www.outcoldsolutions.com
 6INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
 7INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
 8INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
 9INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
10INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
11INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
12INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check kubernetes  1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0

If the logs are clean but the Monitoring Kubernetes app is empty, the most common cause is that your HEC token writes to a non-default index that the Splunk role can’t search. Two fixes: add the index as a default index for the role, or update the app’s macros to scope them to your index. The macros live in Splunk Web UI under Settings > Advanced Search > Search Macros. For example, change macro_kubernetes_logs from (sourcetype=kubernetes_logs) to (index=your_index sourcetype=kubernetes_logs) — every dashboard is built on these macros, so the change takes effect immediately.

Troubleshooting

Verify configuration

Describe command

Collect diagnostic information

1. Collect internal diag information from Collectord instance run following command

2. Collect logs

3. Run verify

4. Prepare tar archive

Pod is not getting scheduled

Failed to pull the image

Blocked access to external registries

Copying image from hub.docker.com to your own registry

Pod is crashing or running, but you don’t see any data

Get the Pod information

Check containerd logs