Collectord update - thruput and time correction
Today we have shipped an updated version of Collectord (version 5.10.252
) that brings two features: configuration for throughput and time correction.
If you have been running your OpenShift, Kubernetes, or Docker clusters for a while, it is possible that you have gathered a lot of logs on the nodes. When you deploy Collectord, it will run as fast as it can (proving its outstanding performance), which may potentially bring a lot of load to your Splunk deployments. To be able to preload the data, we are providing two new features:
- Throughput - configure throughput at the global level (Collectord instance) or specifically for container or host logs.
- Time correction - configure the time range in which you want to forward the logs, for example, define that you want to forward logs only in the time range (-48 hours, +1 hour). All events that are outside of this time range will be ignored.
Throughput
First, you can configure the global throughput in the Collectord configuration. Under section [general]
you can find
thruputPerSecond
, which you can set, for example, to 256Kb
. Collectord will apply this throughput to all the logs it ships
from this node. Important note: we do not count metrics that we ship from this node in the throughput, as we do not
want to throttle metrics delivery, so we will not trigger unwanted alerts.
For each container, you can configure thruput
independently, and for host logs, you can configure thruput
per set.
For example, if you configure thruputPerSecond
under [input.files::logs]
, that means that Collectord will have a
throughput for the files that match all the files under configuration [input.files::logs]
.
If you configure thruputPerSecond
under [input.files]
(container logs), each container will have its own throughput. For
example, if the node has two containers, one sending 100Kb
per second and another 50Kb
per second, and you have set thruputPerSecond
to 80Kb
,
only the first container will be throttled to 80Kb
because the second produces less than 80Kb
per second.
For container logs, you can also override this configuration with annotations by applying collectord.io/logs-ThruputPerSecond: 50Kb
.
Alerts for throttled logs
We are providing two different alerts. The first one will tell you if Collectord containers are producing WARN messages, and the message will look similar to:
WARN 2019/07/24 18:53:00.815293 outcoldsolutions.com/collector/pipeline/pipes/thruput/pipe.go:70: pipeline is getting throttled - /rootfs/var/lib/docker/containers/b2aa6678086cbe2cd4ca374743a25e89225279db26ec34c7f4af8434b43b9b38 - maximum throughput = 10240 bytes per second
We produce this WARN message once a minute or less frequent.
You can see these WARN messages with alert Collectord reports warnings or errors
in Splunk.
You will also know if logs are getting throttled with the alert Warning: Increasing lag between event time and indexing time in container logs
,
where we compare the _time
of the event to the _indextime
of the event and see if the lag is growing.
Time correction
Similar to throughput, you can configure events that you believe are too old or too new to be forwarded to Splunk.
Under section [general]
in the configuration, you can find two keys tooOldEvents
and tooNewEvents
which you can set to
durations. For example:
[general]
...
# 168h = 7 days
tooOldEvents = 168h
# anything newer than 1 hour ahead is getting dropped
tooNewEvents = 1h
You can also configure these keys independently for container logs and host logs. In the case of container logs, you can override these values with annotations:
annotations:
collectord.io/logs-TooOldEvents: 24h
collectord.io/logs-tooNewEvents: 30m
Alerts for time correction
If Collectord finds events that are too new or too old, it will raise a WARN message:
WARN 2019/07/24 18:28:15.516115 outcoldsolutions.com/collector/pipeline/pipes/timecorrection/pipe.go:88: skipping too old or too new events - /rootfs/var/lib/docker/containers/7bef94bc58965ff059f7989ad9ae7db0b123b9e60615ffb28055884b85664cd3 - events should be in the scope (-7h, +30m)
We produce this WARN message once a minute or less frequent.
We can show these WARN messages with the alert Collectord reports warnings or errors
in Splunk.
Upgrade
If you are on version 5.10
, just upgrade the image to version 5.10.252
. If you are on previous versions, please look at our
upgrade instructions: