Outcold Solutions LLC

Monitoring OpenShift - Version 5

Troubleshooting Collector for OpenShift

Pod is not getting scheduled

Verify that daemonsets have scheduled pods on the nodes

oc get daemonset --namespace collectorforopenshift

If in the output numbers under DESIRED, CURRENT, READY or UP-TO-DATE are 0, something can be wrong with configuration

NAME                           DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
collectorforopenshift          0         0         0         0            0           <none>          1m
collectorforopenshift-master   0         0         0         0            0           <none>          1m

You can run command to describe current state of the daemonset/collectorforopenshift

$ oc describe daemonsets --namespace collectorforopenshift

In the output there are will be two daemonsets. In each you can find in the last lines events reported for this daemonset, for example

...
Events:
  FirstSeen LastSeen    Count   From        SubObjectPath   Type        Reason      Message
  --------- --------    -----   ----        -------------   --------    ------      -------
  2m        43s     15  daemon-set          Warning     FailedCreate    Error creating: pods "collectorforopenshift-" is forbidden: unable to validate against any security context constraint: [provider anyuid: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider anyuid: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider anyuid: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used securityContext.runAsUser: Invalid value: 0: UID on container collectorforopenshift does not match required range.  Found 0, required min: 1000000000 max: 1000009999 provider restricted: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider restricted: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]

The error above means that you forgot to add collectorforopenshift service account to the privileged security context, run command

$ oc adm policy add-scc-to-user privileged system:serviceaccount:collectorforopenshift:collectorforopenshift

Try to run describe again in few moments (can take up to few minutes)

$ oc describe daemonsets --namespace collectorforopenshift

In the output you can still see the old event, but you should also see the new event SuccessfulCreate

Events:
  FirstSeen LastSeen    Count   From        SubObjectPath   Type        Reason      Message
  --------- --------    -----   ----        -------------   --------    ------      -------
  ...
  1m        1m      1   daemon-set          Normal      SuccessfulCreate    Created pod: collectorforopenshift-55t61

Failed to pull the image

When you run command

$ oc get daemonsets --namespace collectorforopenshift

You can find that number under READY does not match DESIRED

NAMESPACE   NAME                    DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE-SELECTOR   AGE
default     collectorforopenshift   1         1         0         1            0           <none>          6m

Try to find the pods, which OpenShift failed to start

$ oc get pods --namespace collectorforopenshift

If you see that collectorforopenshift- pod has an error ImagePullBackOff, as in the example below

NAMESPACE   NAME                            READY     STATUS             RESTARTS   AGE
default     collectorforopenshift-55t61     0/1       ImagePullBackOff   0          2m

In that case you need to verify that your OpenShift cluster have access to the hub.docker.com registry or registry.connect.redhat.com, depends on which Configuration Reference you use.

You can run command

$ oc describe pods --namespace collectorforopenshift

Which should show you an output for each pod, including events raised for every pod

Events:
  FirstSeen LastSeen    Count   From            SubObjectPath               Type        Reason      Message
  --------- --------    -----   ----            -------------               --------    ------      -------
  3m        2m      4   kubelet, localhost  spec.containers{collectorforopenshift}  Normal      Pulling     pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821"
  3m        2m      4   kubelet, localhost  spec.containers{collectorforopenshift}  Warning     Failed      Failed to pull image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821": rpc error: code = 2 desc = unexpected http code: 500, URL: https://registry.connect.redhat.com/auth/realms/rhc4tp/protocol/docker-v2/auth?scope=repository%3Aoutcoldsolutions%2Fcollectorforopenshift%3Apull&service=docker-registry
  3m        1m      6   kubelet, localhost  spec.containers{collectorforopenshift}  Normal      BackOff     Back-off pulling image "registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821"
  3m        1m      11  kubelet, localhost                      Warning     FailedSync  Error syncing pod

Failed to pull image from registry.connect.redhat.com

Images in Red Hat Container Catalog are listed with two repositories registry.access.redhat.com and registry.connect.redhat.com. Originally all images (including provided by Red Hat and by partners) were in the registry.access.redhat.com, but starting from beginning of 2018 partner images are getting moved to registry.access.redhat.com. OpenShift Container Platform has very good support for registry.connect.redhat.com registry, but currently registry.access.redhat.com is lucking good documentation and good out of box support of OpenShift Container Platform.

If in the events for the pods you see that OpenShift failed to download image from registry.connect.redhat.com, because of the authorization issues you can fallback to image from hub.docker.com or you can authorize with this registry and save the secret.

Look on Configuration Reference page how to authenticate with registry.connect.redhat.com.

Blocked access to external registries

If you are blocking external registries (hub.docker.com or registry.connect.redhat.com) for security reasons, you can copy image from external registry to your own repository with one host which have access to external registry

Copying image from hub.docker.com to your own registry

$ docker pull outcoldsolutions/collectorforopenshift:4.0.174.180821

After that you can re-tag it by prefixing with your own registry

docker tag  outcoldsolutions/collectorforopenshift:4.0.174.180821 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

And push it to your registry

docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

After that you will need to change your configuration yaml file to specify that you want to use image from different location

image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

If you need to move image between computers you can export it to tar file

$ docker image save outcoldsolutions/collectorforopenshift:4.0.174.180821 > collectorforopenshift.tar

And load it on different docker host

$ cat collectorforopenshift.tar | docker image load

Copying image from registry.connect.redhat.com to your own registry

Login to registry.connect.redhat.com using docker login and your Red Hat account

$ docker login registry.connect.redhat.com
Username: [redhat-username]
Password: [redhat-user-password]
Login Succeeded

Make sure to use username and not email, when you login to this registry. They both allows you to login. But if you logged in with email, you will not be able to download the image.

$ docker pull registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821

After that you can re-tag it by prefixing with your own registry

docker tag registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821 [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

And push it to your registry

docker push [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

After that you will need to change your configuration yaml file to specify that you want to use image from different location

image: [YOUR_REGISTRY]/outcoldsolutions/collectorforopenshift:4.0.174.180821

If you need to move image between computers you can export it to tar file

$ docker image save registry.connect.redhat.com/outcoldsolutions/collectorforopenshift:4.0.174.180821 > collectorforopenshift.tar

And load it on different docker host

$ cat collectorforopenshift.tar | docker image load

Pod is crashing or running, but you don't see any data

Start from looking on the logs of collector, this is how the normal output looks like

$ oc logs -f collectorforopenshift-gvhgw --namespace collectorforopenshift
INFO 2018/01/24 02:40:17.547485 main.go:213: Build date = 180116, version = 2.1.65


You are running trial version of this software.
Trial version valid for 30 days.

Contact sales@outcoldsolutions.com to purchase the license or extend trial.

See details on https://www.outcoldsolutions.com

INFO 2018/01/24 02:40:17.553805 main.go:207: InstanceID = 2K69F0F36DFT7E1RDBL9MSNROC, created = 2018-01-24 00:29:18.635604451 +0000 UTC
INFO 2018/01/24 02:40:17.681765 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
INFO 2018/01/24 02:40:17.681798 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
INFO 2018/01/24 02:40:17.681803 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
INFO 2018/01/24 02:40:17.682663 watcher.go:150: added file /rootfs/var/lib/docker/containers/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069/054e899d52626c2806400ec10f53df29dfa002ca28d08765facf404848967069-json.log
INFO 2018/01/24 02:40:17.682854 watcher.go:150: added file /rootfs/var/lib/docker/containers/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d/0acb2dc45e1a180379f4e8c4604f4c73d76572957bce4a36cef65eadc927813d-json.log
INFO 2018/01/24 02:40:17.683300 watcher.go:150: added file /rootfs/var/log/userdata.log
INFO 2018/01/24 02:40:17.683357 watcher.go:150: added file /rootfs/var/log/yum.log
INFO 2018/01/24 02:40:17.683406 watcher.go:150: added file /rootfs/var/lib/docker/containers/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18/14fe43366ab9305ecd486146ab2464377c59fe20592091739d8f51a323d2fb18-json.log
INFO 2018/01/24 02:40:17.683860 watcher.go:150: added file /rootfs/var/lib/docker/containers/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02/3ea123d8b5b21d04b6a2b6089a681744cd9d2829229e9f586b3ed1ac96b3ec02-json.log
INFO 2018/01/24 02:40:17.683994 watcher.go:150: added file /rootfs/var/lib/docker/containers/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f/4d6c5b7728ea14423f2039361da3c242362acceea7dd4a3209333a9f47d62f4f-json.log
INFO 2018/01/24 02:40:17.684166 watcher.go:150: added file /rootfs/var/lib/docker/containers/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f/5781cb8252f2fe5bdd71d62415a7e2339a102f51c196701314e62a1cd6a5dd3f-json.log
INFO 2018/01/24 02:40:17.685787 watcher.go:150: added file /rootfs/var/lib/docker/containers/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77/6e3eacd5c86a33261e1d5ce76152d81c33cc08ec33ab316a2a27fff8e69a5b77-json.log
INFO 2018/01/24 02:40:17.686062 watcher.go:150: added file /rootfs/var/lib/docker/containers/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f/7151d7ce1342d84ceb8e563cbb164732e23d79baf71fce36d42d8de70b86da0f-json.log
INFO 2018/01/24 02:40:17.687023 watcher.go:150: added file /rootfs/var/lib/docker/containers/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3/d65e4efb5b3d84705daf342ae1a3640f6872e9195b770498a47e2a2d10b925e3-json.log
INFO 2018/01/24 02:40:17.944910 license_check_pipe.go:102: license-check openshift  1 1519345758 2K69F0F36DFT7E1RDBL9MSNROC 1516753758 1516761617 2.1.65 1516060800 true true 0 

In case if you will forget to set url and token for Splunk output, you will see

INFO 2018/01/24 05:08:14.254306 main.go:213: Build date = 180116, version = 2.1.65
Configuration validation failed
[output.splunk]/url is required

In case if connection is failed to our license server you will see that in the logs. If your containers and hosts do not have access to the internet, please contact us for a license which does not require internet access.

If connection will fail to your Splunk instances, you will see that too in logs.

If you don't see mentioning of any *-json.log files, but you have containers running, possible you have journald logging driver enabled instead of json-file. Please look on our steps how to install the collector Monitoring OpenShift Installation. As an example

INFO 2018/01/25 02:51:21.749190 main.go:213: Build date = 180116, version = 2.1.65
You are running trial version of this software.
Trial version valid for 30 days.
Contact sales@outcoldsolutions.com to purchase the license or extend trial.
See details on https://www.outcoldsolutions.com
INFO 2018/01/25 02:51:21.756258 main.go:207: InstanceID = 2K6ERLN622EBISIITVQE34PHA4, created = 2018-01-25 02:51:21.755847967 +0000 UTC m=+0.010852259
INFO 2018/01/25 02:51:21.910598 watcher.go:95: watching /rootfs/var/lib/docker/containers//(glob = */*-json.log*, match = )
INFO 2018/01/25 02:51:21.910909 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^(syslog|messages)(.\d+)?$)
INFO 2018/01/25 02:51:21.910915 watcher.go:95: watching /rootfs/var/log//(glob = , match = ^[\w]+\.log(.\d+)?$)
INFO 2018/01/25 02:51:21.914101 watcher.go:150: added file /rootfs/var/log/userdata.log
INFO 2018/01/25 02:51:21.914354 watcher.go:150: added file /rootfs/var/log/yum.log
INFO 2018/01/25 02:51:22.468489 license_check_pipe.go:102: license-check openshift  1 1519440681 2K6ERLN622EBISIITVQE34PHA4 1516848681 1516848681 2.1.65 1516060800 true true 0 

If you don't see any errors, but you don't see any data in the Monitoring OpenShift application, it is possible that you have specified different index than main for the Splunk HTTP Event Collector token you use. In that case you can add this index as a default index for the Splunk role you are using. Or change our macros in the application to prefix them with index=your_index, you can find the macros in Splunk Web UI, under Setting, Advanced Search, Search Macros. As an example for macro macro_openshift_logs you will need to change the value from (sourcetype=openshift_logs) to (index=your_index sourcetype=openshift_logs). All our dashboards are built on top of these macros, changing that should have immediate effect on the application.


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which gives you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which helps developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.