Outcold Solutions LLC

Monitoring Kubernetes - Version 5

Troubleshooting

Verify configuration

Available since collectorforkubernetes v5.2

Get the list of the pods

$ kubectl get pods -n collectorforkubernetes
NAME                                           READY     STATUS    RESTARTS   AGE
collectorforkubernetes-addon-857fccb8b9-t9qgq   1/1       Running   1          1h
collectorforkubernetes-master-bwmwr             1/1       Running   0          1h
collectorforkubernetes-xbnaa                    1/1       Running   0          1h

Considering that we have 3 different deployment types, the DaemonSet we deploy on Masters (collectorforkubernetes-master), the DaemonSet we deploy on non-master nodes (collectorforkubernetes) and one Deployment addon (collectorforkubernetes-addon) verify one node from each deployment (in example below change the pod names to the pods that are running on your cluster).

$ kubectl exec -n collectorforkubernetes collectorforkubernetes-addon-857fccb8b9-t9qgq -- /collectord verify
$ kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify
$ kubectl exec -n collectorforkubernetes collectorforkubernetes-xbnaa -- /collectord verify

For each command you will see an output similar to

Version = 5.2.176
Build date = 181012
Environment = kubernetes


  General:
  + conf: OK
  + db: OK
  + db-meta: OK
  + instanceID: OK
    instanceID = 2LEKCFD4KT4MUBIAQSUG7GRSAG
  + license load: OK
    trial
  + license expiration: OK
    license expires 2018-11-12 15:51:18.200772266 -0500 EST
  + license connection: OK

  Splunk output:
  + OPTIONS(url=https://10.0.2.2:8088/services/collector/event/1.0): OK
  + POST(url=https://10.0.2.2:8088/services/collector/event/1.0, index=): OK

  Kubernetes configuration:
  + api: OK
  + pod cgroup: OK
    pods = 18
  + container cgroup: OK
    containers = 39
  + volumes root: OK
  + runtime: OK
    docker

  Docker configuration:
  + connect: OK
    containers = 43
  + path: OK
  + cgroup: OK
    containers = 40
  + files: OK

  CRI-O configuration:
  - ignored: OK
    kubernetes uses other container runtime

  File Inputs:
  x input(syslog): FAILED
    no matches
  + input(logs): OK
    path /rootfs/var/log/

  System Input:
  + path cgroup: OK
  + path proc: OK

  Network stats Input:
  + path proc: OK

  Network socket table Input:
  + path proc: OK

  Proc Input:
  + path proc: OK

  Mount Input:
  + stats: OK

  Prometheus input:
  + input(kubernetes-api): OK
  x input(etcd): FAILED
    failed to load metrics from specified endpoints [https://:2379/metrics]
  x input(controller): FAILED
    failed to load metrics from specified endpoints [https://127.0.0.1:8444/metrics]
  + input(kubelet): OK

Errors: 5

With the number of the errors at the end. In our example we show output from minikube, where we see some invalid configurations, like

  • input(syslog) - minikube does not persist syslog output to disk, we will not be able to see these logs in application
  • input(etcd) - etcd is not available on this minikube instance
  • input(controller) - controller is not available on this minikube instance

If you find some error in the configuration, like incorrect Splunk URL, after applying the change kubectl apply -f ./collectorforkubernetes.yaml you will need to recreate pods, for that you can just delete all of them in our namespace kubectl delete pods --all -n collectorforkubernetes. The workloads will recreate them.

Describe command

Available since collectorforkubernetes v5.12

When you apply annotations through the namespace, workload, configurations and pods it could be hard to track which annotations are applied to the Pod or Container. You can use a describe command of collectord to get information which annotations are used for the specific Pod. You can use any collectord Pod to run this command on the cluster

kubectl exec -n collectorforkubernetes collectorforkubernetes-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgres

Collect diagnostic information

If you need to open a support case you can collect diagnostic information, including performance, metrics and configuration (excluding splunk URL and Token).

Please run all 4 steps to collect diagnostic information.

1. Collect internal diag information from Collectord instance run following command

Available since collectorforkubernetes v5.2

Choose pod from which you want to collect a diag information.

The following command takes several minutes.

kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord diag --stream 1>diag.tar.gz

You can extract a tar archive to verify the information that we collect. We include information about performance, memory usage, basic telemetry metrics, information file with the information of the host Linux version and basic information about the license.

Since 5.20.400 performance information is not collected by default unless you include a flag --include-performance-profiles in the command.

2. Collect logs

kubectl logs -n collectorforkubernetes --timestamps collectorforkubernetes-master-bwmwr  1>collectorforkubernetes.log 2>&1

3. Run verify

Available since collectorforkubernetes v5.2

kubectl exec -n collectorforkubernetes collectorforkubernetes-master-bwmwr -- /collectord verify > verify.log

4. Prepare tar archive

tar -czvf collectorforkubernetes-$(date +%s).tar.gz verify.log collectorforkubernetes.log diag.tar.gz

Pod is not getting scheduled

Verify that daemonsets have scheduled pods on the nodes

kubectl get daemonset --namespace collectorforkubernetes

If in the output numbers under DESIRED, CURRENT, READY or UP-TO-DATE are 0, something can be wrong with configuration

NAME                            DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
collectorforkubernetes          0         0         0         0            0           <none>          1m
collectorforkubernetes-master   0         0         0         0            0           <none>          1m

You can run command to describe current state of the daemonset/collectorforkubernetes

$ kubectl describe daemonsets --namespace collectorforkubernetes

In the output there are will be two daemonsets. In each you can find in the last lines events reported for this daemonset, for example

...
Events:
  Type     Reason            Age                From                  Message
  ----     ------            ----               ----                  -------
  Warning  FailedCreate      31m                daemonset-controller  Error creating: pods "collectorforkubernetes-" is forbidden: SecurityContext.RunAsUser is forbidden

This error means that you are using Pod Security Policies, in that case you need to add our Cluster Role to the privileged Pod Security Policy, with

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: collectorforkubernetes
  name: collectorforkubernetes
rules:
- apiGroups: ['extensions']
  resources: ['podsecuritypolicies']
  verbs:     ['use']
  resourceNames:
  - privileged
- apiGroups:
  ...

Failed to pull the image

When you run command

$ kubectl get daemonsets --namespace collectorforkubernetes

You can find that number under READY does not match DESIRED

NAMESPACE   NAME                     DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
default     collectorforkubernetes   1         1         0         1            0           <none>          6m

Try to find the pods, that Kubernetes failed to start

$ kubectl get pods --namespace collectorforkubernetes

If you see that collectorforkubernetes- pod has an error ImagePullBackOff, as in the example below

NAMESPACE   NAME                             READY     STATUS             RESTARTS   AGE
default     collectorforkubernetes-55t61     0/1       ImagePullBackOff   0          2m

In that case you need to verify that your Kubernetes cluster have access to the hub.docker.com registry.

You can run command

$ kubectl describe pods --namespace collectorforkubernetes

Which should show you an output for each pod, including events raised for every pod

Events:
  FirstSeen LastSeen    Count   From            SubObjectPath               Type        Reason      Message
  --------- --------    -----   ----            -------------               --------    ------      -------
  3m        2m      4   kubelet, localhost  spec.containers{collectorforkubernetes} Normal      Pulling     pulling image "hub.docker.com/outcoldsolutions/collectorforkubernetes:5.22.420"
  3m        1m      6   kubelet, localhost  spec.containers{collectorforkubernetes} Normal      BackOff     Back-off pulling image "rhub.docker.com/outcoldsolutions/collectorforkubernetes:5.22.420"
  3m        1m      11  kubelet, localhost                      Warning     FailedSync  Error syncing pod

Blocked access to external registries

If you are blocking external registries (hub.docker.com) for security reasons, you can copy image from external registry to your own repository with one host which have access to external registry

Copying image from hub.docker.com to your own registry

$ docker pull outcoldsolutions/collectorforkubernetes:5.22.420

After that you can re-tag it by prefixing with your own registry

docker tag  outcoldsolutions/collectorforkubernetes:5.22.420 [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:5.22.420

And push it to your registry

docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:5.22.420

After that you will need to change your configuration yaml file to specify that you want to use image from different location

image: [YOUR_REGISTRY]/outcoldsolutions/collectorforkubernetes:5.22.420

If you need to move image between computers you can export it to tar file

$ docker image save outcoldsolutions/collectorforkubernetes:5.22.420 > collectorforkubernetes.tar

And load it on different docker host

$ cat collectorforkubernetes.tar | docker image load

Pod is crashing or running, but you don't see any data

Get the Pod information

First get information about the Pod (replace pod name with the one that is crashing)

kubectl get pod -n collectorforkubernetes -o yaml collectorforkubernetes-master-mshxd

If in the lastState you see something similar to

lastState:
  terminated:
    containerID: docker://8e9086aaf65b86d6d070f98ef4c5c59d9c838401a1f40765dd997723144d65db
    exitCode: 128
    finishedAt: "2022-10-16T05:58:13Z"
    message: path / is mounted on / but it is not a shared or slave mount
    reason: ContainerCannotRun
    startedAt: "2022-10-16T05:58:13Z"

You will need to modify how rootfs is mounted inside the Pod. In the collectorforkubernetes.yaml file find all mountPropagation: HostToContainer and comment them out. The only feature that will not work, is the ability for Containerd to auto-discover volumes with the Application Logs.

Please email us at support@outcoldsolutions.com to help us to configure it properly.

Check containerd logs

Start from looking on the logs of collector, this is how the normal output looks like

$ kubectl logs -f collectorforkubernetes-gvhgw --namespace collectorforkubernetes
INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65


You are running trial version of this software.
Trial version valid for 30 days.

Contact sales@outcoldsolutions.com to purchase the license or extend trial.

See details on https://www.outcoldsolutions.com

INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check kubernetes  1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0 

In case if you will forget to set url and token for Splunk output, you will see

INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
Configuration validation failed
[output.splunk]/url is required

In case if connection is failed to our license server you will see that in the logs. If your containers and hosts do not have access to the internet, please contact us for a license which does not require internet access.

If connection will fail to your Splunk instances, you will see that too in logs.

If you don't see mentioning of any *-json.log files, but you have containers running, possible you have journald logging driver enabled instead of json-file. As an example

INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
You are running trial version of this software.
Trial version valid for 30 days.
Contact sales@outcoldsolutions.com to purchase the license or extend trial.
See details on https://www.outcoldsolutions.com
INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check kubernetes  1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0 

If you don't see any errors, but you don't see any data in the Monitoring Kubernetes application, it is possible that you have specified different index than main for the Splunk HTTP Event Collector token you use. In that case you can add this index as a default index for the Splunk role you are using. Or change our macros in the application to prefix them with index=your_index, you can find the macros in Splunk Web UI, under Setting, Advanced Search, Search Macros. As an example for macro macro_kubernetes_logs you will need to change the value from (sourcetype=kubernetes_logs) to (index=your_index sourcetype=kubernetes_logs). All our dashboards are built on top of these macros, changing that should have immediate effect on the application.


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.