Collectord update - thruput and time correction

Today we have shipped an updated version of Collectord (version 5.10.252) that brings two features: configuration for throughput and time correction.

If you have been running your OpenShift, Kubernetes, or Docker clusters for a while, it is possible that you have gathered a lot of logs on the nodes. When you deploy Collectord, it will run as fast as it can (proving its outstanding performance), which may potentially bring a lot of load to your Splunk deployments. To be able to preload the data, we are providing two new features:

Throughput - configure throughput at the global level (Collectord instance) or specifically for container or host logs.
Time correction - configure the time range in which you want to forward the logs, for example, define that you want to forward logs only in the time range (-48 hours, +1 hour). All events that are outside of this time range will be ignored.

Throughput

First, you can configure the global throughput in the Collectord configuration. Under section [general] you can find thruputPerSecond, which you can set, for example, to 256Kb. Collectord will apply this throughput to all the logs it ships from this node. Important note: we do not count metrics that we ship from this node in the throughput, as we do not want to throttle metrics delivery, so we will not trigger unwanted alerts.

For each container, you can configure thruput independently, and for host logs, you can configure thruput per set.

For example, if you configure thruputPerSecond under [input.files::logs], that means that Collectord will have a throughput for the files that match all the files under configuration [input.files::logs].

If you configure thruputPerSecond under [input.files] (container logs), each container will have its own throughput. For example, if the node has two containers, one sending 100Kb per second and another 50Kb per second, and you have set thruputPerSecond to 80Kb, only the first container will be throttled to 80Kb because the second produces less than 80Kb per second.

For container logs, you can also override this configuration with annotations by applying collectord.io/logs-ThruputPerSecond: 50Kb.

Alerts for throttled logs

We are providing two different alerts. The first one will tell you if Collectord containers are producing WARN messages, and the message will look similar to:

WARN 2019/07/24 18:53:00.815293 outcoldsolutions.com/collector/pipeline/pipes/thruput/pipe.go:70: pipeline is getting throttled - /rootfs/var/lib/docker/containers/b2aa6678086cbe2cd4ca374743a25e89225279db26ec34c7f4af8434b43b9b38 - maximum throughput = 10240 bytes per second

We produce this WARN message once a minute or less frequent.

You can see these WARN messages with alert Collectord reports warnings or errors in Splunk.

You will also know if logs are getting throttled with the alert Warning: Increasing lag between event time and indexing time in container logs, where we compare the _time of the event to the _indextime of the event and see if the lag is growing.

Time correction

Similar to throughput, you can configure events that you believe are too old or too new to be forwarded to Splunk. Under section [general] in the configuration, you can find two keys tooOldEvents and tooNewEvents which you can set to durations. For example:

[general]
...

# 168h = 7 days
tooOldEvents = 168h

# anything newer than 1 hour ahead is getting dropped
tooNewEvents = 1h

You can also configure these keys independently for container logs and host logs. In the case of container logs, you can override these values with annotations:

annotations:
    collectord.io/logs-TooOldEvents: 24h
    collectord.io/logs-tooNewEvents: 30m

Alerts for time correction

If Collectord finds events that are too new or too old, it will raise a WARN message:

WARN 2019/07/24 18:28:15.516115 outcoldsolutions.com/collector/pipeline/pipes/timecorrection/pipe.go:88: skipping too old or too new events - /rootfs/var/lib/docker/containers/7bef94bc58965ff059f7989ad9ae7db0b123b9e60615ffb28055884b85664cd3 - events should be in the scope (-7h, +30m)

We produce this WARN message once a minute or less frequent.

We can show these WARN messages with the alert Collectord reports warnings or errors in Splunk.

Upgrade

If you are on version 5.10, just upgrade the image to version 5.10.252. If you are on previous versions, please look at our upgrade instructions:

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all container environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and easy-to-deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and help operators keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.