Splunk .conf26 Outcold Solutions is sponsoring Splunk .conf26 - see you there!

Outcold Solutions - Monitoring Kubernetes, OpenShift and Docker in Splunk
Monitoring Docker in Splunk
Monitoring Kubernetes in Splunk
Monitoring OpenShift in Splunk
Monitoring Linux in Splunk
Monitoring Windows Containers in Splunk
Forwarding Logs to Elasticsearch and OpenSearch
Forwarding Logs to QRadar
Documentation Blog
Contact Us
Pricing
Evaluation licenses
Partners
License Agreement
Privacy Policy
Refund & Cancellation
Acceptable Use Policy
FAQ

Monitoring Linux

Blog Blog Blog - Release 26.04.2 Blog Blog - Release 26.04 Blog Blog - Release 25.10 Blog FAQ Common Monitoring Linux - Installation Monitoring Linux Monitoring Linux - Configuration Monitoring Linux Monitoring Linux - Log forwarding Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Alerts Monitoring Linux Monitoring Linux - Troubleshooting Monitoring Linux Monitoring Linux - Release history Monitoring Linux Monitoring Kubernetes Product Monitoring OpenShift Product Monitoring Docker Product Monitoring Linux Product
Blog Blog Blog Blog Blog - Release 26.04.2 Blog Blog - Release 26.04.2 Blog Blog - Release 26.04 Blog Blog - Release 26.04 Blog Blog - Release 25.10 Blog Blog - Release 25.10 Blog FAQ Common FAQ FAQ How to submit a support request? If you have a support contract, send an email to support@outcoldsolutions.com. When you sign the contract, you nominate the people in your organization authorized to open requests - contact your procurement department to add more. Include the following in every request: LicenseID (not the License Key). You can find it in our applications under Setup -> Collectord usage, or in the Collectord logs - for example, INFO 2023/05/24 05:36:48.834659 outcoldsolutions.com/collectord/license/license_check_pipe.go:158: license-check openshift BG5183Q89IE2G 0 0 .... The customer name, if you’re a partner or contractor opening a request on someone else’s behalf. A short incident description in the email subject. Incident details in the body. The more you include up front, the faster we can help - we won’t have to come back asking for more. In 99% of cases we’ll ask for diagnostic information from the Collectord instance running on the node where the issue is happening; that archive contains most of what we need, including the Collectord version and basic usage. Run all 4 steps for your platform: Kubernetes OpenShift Docker If you can’t share diagnostic information, describe the issue in as much detail as possible and include the Collectord version you’re running. A few things that help us help you faster: Email is our primary support channel. We’ll follow up with a phone call or web meeting if it’s useful. You can keep your team in CC, but please respect the contract and keep that list small - we can’t respond to people who aren’t covered. One support request per email thread. If you have multiple issues, open multiple requests. I do not see IO metrics on host dashboards We source IO metrics from the blkio-controller, which isn’t always enabled. As a workaround, sum the proc metrics - we collect those per process. I received a license. How do I apply it? Collectord reads the license key from its configuration file. Collectord for Kubernetes and Collectord for OpenShift Set the license in the ConfigMap shipped with our yaml manifests. Find the line license = and paste your key as the value (no spaces). Starting with version 5.0, Collectord picks up the change within a few moments - no restart needed. On versions below 5.0 you need to restart the collectors. Editing a ConfigMap doesn’t trigger a Pod restart, so delete the running pods and let the scheduler recreate them. OpenShift: oc delete --namespace collectorforopenshift pods --all Kubernetes: kubectl delete --namespace collectorforkubernetes pods --all If you’re on version 3 of our Monitoring OpenShift and Kubernetes solution, Collectord runs in the default namespace - delete the pods manually and the DaemonSet will reschedule them. Collectord for Docker Edit the collector.yaml configuration file if you already maintain your own, or set the license through an environment variable: --env "COLLECTOR__LICENSE=general__license=...". How much data does your application generate? In version 5.3 and above, the Splunk Usage dashboard under Setup shows current usage at a glance. Because we use indexed field extractions on the HTTP Event Collector, our tests show less than a 5% increase in Splunk licensing cost for logs. You can also check usage per source type directly in Splunk: text Copy 1index=_internal source=*metrics.log | eval MB=kb/1024 | search group="per_sourcetype_thruput" | timechart span=1h sum(MB) by series For Kubernetes nodes running 20–30 containers, we typically see around 200 MB per day across kubernetes_stats and kubernetes_proc_stats combined. Here’s the licensing breakdown from our demo environment - 2 OpenShift clusters, 9 nodes total (1+3 masters, 1+4 workers), 62 pods: Performance For numbers on Collectord’s throughput, see our blog post: Forwarding 10,000 1k events generated by containers from a single host with ease. Do you support a specific version of Kubernetes, OpenShift and Docker? We continuously test against the most popular Kubernetes providers - Azure AKS, Amazon EKS, and Google Kubernetes Engine - and verify edge versions on self-provisioned clusters built with Kubeadm. Our OpenShift monitoring solution is Red Hat certified. For Docker, we test on a variety of Linux distributions and ship configurations for the common orchestrations, including Docker Swarm and Amazon ECS. Why is it called Outcold Solutions? Outcold Solutions takes its name from our founder’s recognized industry handle “outcoldman,” which became synonymous with expertise in Docker and Splunk integration during the early adoption phase of containerization. Our founder’s deep technical knowledge and early adoption of containerization made them a trusted voice in the enterprise monitoring community, establishing the technical foundation that drives our solutions today. Our founder’s industry contributions include developing the original Docker images for Splunk that became the official Splunk Docker images, contributing to the Splunk logging driver for Docker, and serving as a technical authority through speaking engagements at Splunk Conferences. This proven track record of innovation and enterprise-grade solutions continues to guide our commitment to delivering reliable, scalable monitoring platforms for mission-critical environments. Common Monitoring Linux - Installation Monitoring Linux Monitoring Linux - Installation Installation This guide walks you through installing Monitoring Linux end-to-end: configuring the Splunk app and HTTP Event Collector, then installing Collectord on your Linux host as a systemd service to forward host logs (syslog, journald), host metrics, and process metrics. A typical install takes under 10 minutes. If you don’t have a license yet, you can request a 30-day evaluation. Install the Monitoring Linux application Install Monitoring Linux from Splunkbase on your Search Heads only. If you’re using a dedicated index that isn’t searchable by default, update the macro_linux_base macro to include it: text Copy 1macro_linux_base = (index=linux) Enable HTTP Event Collector in Splunk Collectord forwards data to Splunk over the HTTP Event Collector (HEC). If HEC isn’t enabled yet, follow Splunk’s HTTP Event Collector walkthrough. The minimum requirement is Splunk Enterprise or Splunk Cloud 6.5. If you’re managing Splunk Clusters older than 6.5, see our FAQ on setting up a Heavy Weight Forwarder in between. Once HEC is enabled, you need two pieces of information for the rest of this guide: the HEC endpoint URL and an HEC token. You can verify both with curl: bash Copy 1$ curl -k https://hec.example.com:8088/services/collector/event/1.0 -H "Authorization: Splunk B5A79AAD-D822-46CC-80D1-819F80D7BFB0" -d '{"event": "hello world"}' 2{"text": "Success", "code": 0} -k skips certificate validation; use it only for self-signed certificates. Splunk Cloud uses a different HEC URL than Splunk Web - see Send data to HTTP Event Collector on Splunk Cloud instances. Install Collectord for Linux Download collectorforlinux.tar.gz and extract it into /opt/collectorforlinux. The archive contains builds for both amd64 and aarch64 architectures. bash Copy 1sudo curl -O /docs/monitoring-linux/builds/5.21.410/collectorforlinux.tar.gz -o /tmp/collectorforlinux.tar.gz 2sudo mkdir -p /opt/collectorforlinux 3sudo tar -xvf /tmp/collectorforlinux.tar.gz -C /opt/collectorforlinux Open /opt/collectorforlinux/etc/002-user.conf with your editor: bash Copy 1sudo edit /opt/collectorforlinux/etc/002-user.conf This file holds your overrides for the Collectord defaults. The full default configuration lives in /opt/collectorforlinux/etc/002-general.conf - refer to it when you need to know what options exist. In 002-user.conf, set the Splunk HEC URL and token, review and accept the license agreement, and paste in your license key (request an evaluation key with this automated form). Naming the cluster is optional but useful when you’re monitoring more than one host group and want to filter by cluster in the app. 002-user.conf ini Copy 1[general] 2acceptLicense = true 3license = ... 4fields.linux_cluster = dev 5 6[output.splunk] 7url = https://hec.example.com:8088/services/collector/event/1.0 8token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 9insecure = true You can run collectorforlinux directly from the terminal to confirm the configuration is valid: text Copy 1sudo /opt/collectorforlinux/bin/collectorforlinux Install the collectorforlinux service with systemd The package ships a systemd unit you can link with systemctl and run as a background daemon: bash Copy 1sudo systemctl link /opt/collectorforlinux/bin/collectorforlinux.service 2sudo systemctl daemon-reload 3sudo systemctl enable collectorforlinux 4sudo systemctl start collectorforlinux Tail the logs to confirm Collectord is running: bash Copy 1sudo journalctl -fu collectorforlinux Next steps Review the predefined alerts and enable the ones relevant to your environment. If something looks off, work through the troubleshooting checks. Configure log forwarding from custom locations beyond /var/log and journald. Monitoring Linux Monitoring Linux - Installation - Install the Monitoring Linux application [TOC] Monitoring Linux Install the Monitoring Linux application Monitoring Linux - Installation Installation Monitoring Linux Monitoring Linux - Installation - Enable HTTP Event Collector in Splunk [TOC] Monitoring Linux Enable HTTP Event Collector in Splunk Monitoring Linux - Installation Installation Monitoring Linux Monitoring Linux - Installation - Install Collectord for Linux [TOC] Monitoring Linux Install Collectord for Linux Monitoring Linux - Installation Installation Monitoring Linux Monitoring Linux - Installation - Install the collectorforlinux service with systemd [TOC] Monitoring Linux Install the collectorforlinux service with systemd Monitoring Linux - Installation Installation Monitoring Linux Monitoring Linux - Installation - Next steps [TOC] Monitoring Linux Next steps Monitoring Linux - Installation Installation Monitoring Linux Monitoring Linux - Configuration Monitoring Linux Monitoring Linux - Configuration Configuration Collectord’s defaults live in /opt/collectorforlinux/etc/001-general.conf. Don’t edit that file directly - instead, override the values you need in /opt/collectorforlinux/etc/001-user.conf. The reference below shows every section and key with inline comments. 001-general.conf ini Copy 1# collectord configuration file 2# 3# Run collectord with flag -conf and specify location of the configuration file. 4# 5# You can override all the values using environment variables with the format like 6# COLLECTOR__<ANYNAME>=<section>__<key>=<value> 7# As an example you can set dataPath in [general] section as 8# COLLECTOR__DATAPATH=general__dataPath=C:\\some\\path\\data.db 9# This parameter can be configured using -env-override, set it to empty string to disable this feature 10 11[general] 12 13# (obsolete, use acceptLicense instead) 14# acceptEULA = false 15 16# Please review license https://www.outcoldsolutions.com/legal/license-agreement/ 17# and accept license by changing the value to *true* 18acceptLicense = false 19 20# location for the database 21# is used to store position of the files and internal state 22dataPath = ../var/collectord 23 24# log level (trace, debug, info, warn, error, fatal) 25logLevel = info 26 27# http server gives access to two endpoints 28# /healthz 29# /metrics 30httpServerBinding = 31 32# telemetry report endpoint, set it to empty string to disable telemetry 33telemetryEndpoint = https://license.outcold.solutions/telemetry/ 34 35# license check endpoint 36licenseEndpoint = https://license.outcold.solutions/license/ 37 38# license server through proxy 39licenseServerProxyUrl = 40 41# authentication with basic authorization (user:password) 42licenseServerProxyBasicAuth = 43 44# license key 45license = 46 47# docker daemon hostname is used by default as hostname 48# use this configuration to override 49hostname = ${HOSTNAME} 50 51# Default output for events, logs and metrics 52# valid values: splunk and devnull 53# Use devnull by default if you don't want to redirect data 54defaultOutput = splunk 55 56# Default buffer size for file input 57fileInputBufferSize = 256b 58 59# Maximum size of one line the file reader can read 60fileInputLineMaxSize = 1mb 61 62# Include custom fields to attach to every event, in example below every event sent to Splunk will hav 63# indexed field my_environment=dev. Fields names should match to ^[a-z][_a-z0-9]*$ 64# Better way to configure that is to specify labels for Docker Hosts. 65# ; fields.my_environment = dev 66fields.linux_cluster = - 67 68# Include EC2 Metadata (see list of possible fields https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html) 69# Should be in format ec2Metadata.{desired_field_name} = {url path to read the value} 70# ec2Metadata.ec2_instance_id = /latest/meta-data/instance-id 71# ec2Metadata.ec2_instance_type = /latest/meta-data/instance-type 72 73# subdomain for the annotations added to the pods, workloads, namespaces or containers, like splunk.collectord.io/.. 74annotationsSubdomain = 75 76# Configure acknowledgement database. 77# - force fsync on every write to Write-Ahead-Log 78db.fsync = false 79# - maximum size of the Write-Ahead-Log 80db.compactAt = 1M 81 82# configure global thruput per second for forwarded logs (metrics are not included) 83# for example if you set `thruputPerSecond = 512Kb`, that will limit amount of logs forwarded 84# from the single Collectord instance to 512Kb per second. 85# You can configure thruput individually for the logs (including specific for container logs) below 86thruputPerSecond = 87 88# Configure events that are too old to be forwarded, for example 168h (7 days) - that will drop all events 89# older than 7 days 90tooOldEvents = 91 92# Configure events that are too new to be forwarded, for example 1h - that will drop all events that are 1h in future 93tooNewEvents = 94 95 96# cgroup input 97# sends stas for the host and cgroups (containers) 98[input.system_stats] 99 100# disable system level stats 101disabled.host = false 102 103# cgroups fs location 104pathCgroups = /sys/fs/cgroup 105 106# proc location 107pathProc = /proc 108 109# how often to collect cgroup stats 110statsInterval = 30s 111 112# override type 113type.host = linux_stats_v2_host 114 115# specify Splunk index 116index.host = 117 118# set output (splunk or devnull, default is [general]defaultOutput) 119output.host = 120 121 122# mount input (collects mount stats where docker runtime is stored) 123[input.mount_stats] 124 125# disable system level stats 126disabled = false 127 128# how often to collect mount stats 129statsInterval = 30s 130 131# override type 132type = linux_mount_stats 133 134# specify Splunk index 135index = 136 137# set output (splunk or devnull, default is [general]defaultOutput) 138output = 139 140 141# proc input 142[input.proc_stats] 143 144# disable proc level stats 145disabled = false 146 147# proc location 148pathProc = /proc 149 150# how often to collect proc stats 151statsInterval = 60s 152 153# override type 154type = linux_proc_stats_v2 155 156# specify Splunk index 157index.host = 158 159# proc filesystem includes by default system threads (there can be over 100 of them) 160# these stats do not help with the observability 161# excluding them can reduce the size of the index, performance of the searches and usage of the collector 162includeSystemThreads = false 163 164# set output (splunk or devnull, default is [general]defaultOutput) 165output.host = 166 167 168# network stats 169[input.net_stats] 170 171# disable net stats 172disabled = false 173 174# proc path location 175pathProc = /proc 176 177# how often to collect net stats 178statsInterval = 30s 179 180# override type 181type = linux_net_stats_v2 182 183# specify Splunk index 184index = 185 186# set output (splunk or devnull, default is [general]defaultOutput) 187output = 188 189 190# network socket table 191[input.net_socket_table] 192 193# disable net stats 194disabled = false 195 196# proc path location 197pathProc = /proc 198 199# how often to collect net stats 200statsInterval = 30s 201 202# override type 203type = linux_net_socket_table 204 205# specify Splunk index 206index = 207 208# set output (splunk or devnull, default is [general]defaultOutput) 209output = 210 211# group connections by tcp_state, localAddr, remoteAddr (if localPort is not the port it is listening on) 212# that can significally reduces the amount of events 213group = true 214 215 216# Input syslog(.\d+)? files 217[input.files::syslog] 218 219# disable host level logs 220disabled = false 221 222# root location of log files 223path = /var/log/ 224 225# regex matching pattern 226match = ^(syslog|messages)(.\d+)?$ 227 228# limit search only on one level 229recursive = false 230 231# files are read using polling schema, when reach the EOF how often to check if files got updated 232pollingInterval = 250ms 233 234# how often o look for the new files under logs path 235walkingInterval = 5s 236 237# include verbose fields in events (file offset) 238verboseFields = false 239 240# override type 241type = linux_host_logs 242 243# specify Splunk index 244index = 245 246# field extraction 247extraction = ^(?P<timestamp>[A-Za-z]+\s+\d+\s\d+:\d+:\d+)\s(?P<syslog_hostname>[^\s]+)\s(?P<syslog_component>[^:\[]+)(\[(?P<syslog_pid>\d+)\])?: (.+)$ 248 249# timestamp field 250timestampField = timestamp 251 252# format for timestamp 253# the layout defines the format by showing how the reference time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` 254timestampFormat = Jan 2 15:04:05 255 256# Adjust date, if month/day aren't set in format 257timestampSetMonth = false 258timestampSetDay = false 259 260# timestamp location (if not defined by format) 261timestampLocation = Local 262 263# sample output (-1 does not sample, 20 - only 20% of the logs should be forwarded) 264samplingPercent = -1 265 266# sampling key for hash based sampling (should be regexp with the named match pattern `key`) 267samplingKey = 268 269# set output (splunk or devnull, default is [general]defaultOutput) 270output = 271 272# configure default thruput per second for for each container log 273# for example if you set `thruputPerSecond = 128Kb`, that will limit amount of logs forwarded 274# from the single container to 128Kb per second. 275thruputPerSecond = 276 277# Configure events that are too old to be forwarded, for example 168h (7 days) - that will drop all events 278# older than 7 days 279tooOldEvents = 280 281# Configure events that are too new to be forwarded, for example 1h - that will drop all events that are 1h in future 282tooNewEvents = 283 284 285# Input all *.log(.\d+)? files 286[input.files::logs] 287 288# disable host level logs 289disabled = false 290 291# root location of log files 292path = /var/log/ 293 294# regex matching pattern 295match = ^(([\w\-.]+\.log(.[\d\-]+)?)|(docker))$ 296 297# files are read using polling schema, when reach the EOF how often to check if files got updated 298pollingInterval = 250ms 299 300# how often o look for the new files under logs path 301walkingInterval = 5s 302 303# include verbose fields in events (file offset) 304verboseFields = false 305 306# override type 307type = linux_host_logs 308 309# specify Splunk index 310index = 311 312# field extraction 313extraction = 314 315# timestamp field 316timestampField = 317 318# format for timestamp 319# the layout defines the format by showing how the reference time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` 320timestampFormat = 321 322# timestamp location (if not defined by format) 323timestampLocation = 324 325# sample output (-1 does not sample, 20 - only 20% of the logs should be forwarded) 326samplingPercent = -1 327 328# sampling key (should be regexp with the named match pattern `key`) 329samplingKey = 330 331# set output (splunk or devnull, default is [general]defaultOutput) 332output = 333 334# configure default thruput per second for for each container log 335# for example if you set `thruputPerSecond = 128Kb`, that will limit amount of logs forwarded 336# from the single container to 128Kb per second. 337thruputPerSecond = 338 339# Configure events that are too old to be forwarded, for example 168h (7 days) - that will drop all events 340# older than 7 days 341tooOldEvents = 342 343# Configure events that are too new to be forwarded, for example 1h - that will drop all events that are 1h in future 344tooNewEvents = 345 346 347[input.journald] 348 349# disable host level logs 350disabled = false 351 352# root location of log files 353path.persistent = /var/log/journal/ 354path.volatile = /run/log/journal/ 355 356# when reach end of journald, how often to pull 357pollingInterval = 250ms 358 359# override type 360type = linux_host_logs 361 362# specify Splunk index 363index = 364 365# sample output (-1 does not sample, 20 - only 20% of the logs should be forwarded) 366samplingPercent = -1 367 368# sampling key (should be regexp with the named match pattern `key`) 369samplingKey = 370 371# how often to reopen the journald to free old files 372reopenInterval = 1h 373 374# set output (splunk or devnull, default is [general]defaultOutput) 375output = 376 377# configure default thruput per second for for each container log 378# for example if you set `thruputPerSecond = 128Kb`, that will limit amount of logs forwarded 379# from the single container to 128Kb per second. 380thruputPerSecond = 381 382# Configure events that are too old to be forwarded, for example 168h (7 days) - that will drop all events 383# older than 7 days 384tooOldEvents = 385 386# Configure events that are too new to be forwarded, for example 1h - that will drop all events that are 1h in future 387tooNewEvents = 388 389 390# Default configuration for join multi-lines 391[pipe.join] 392 393# Maximum interval of messages in pipeline 394maxInterval = 100ms 395 396# Maximum time to wait for the messages in pipeline 397maxWait = 1s 398 399# Maximum message size 400maxSize = 100K 401 402 403# Splunk output 404[output.splunk] 405 406# Splunk HTTP Event Collector url 407url = 408# You can specify muiltiple splunk URls with 409# 410# urls.0 = https://server1:8088/services/collector/event/1.0 411# urls.1 = https://server1:8088/services/collector/event/1.0 412# urls.2 = https://server1:8088/services/collector/event/1.0 413# 414# Limitations: 415# * The urls cannot have different path. 416 417# Specify how URL should be picked up (in case if multiple is used) 418# urlSelection = random|round-robin|random-with-round-robin 419# where: 420# * random - choose random url on first selection and after each failure (connection or HTTP status code >= 500) 421# * round-robin - choose url starting from first one and bump on each failure (connection or HTTP status code >= 500) 422# * random-with-round-robin - choose random url on first selection and after that in round-robin on each 423# failure (connection or HTTP status code >= 500) 424urlSelection = random-with-round-robin 425 426# Splunk HTTP Event Collector Token 427token = 428 429# Allow invalid SSL server certificate 430insecure = false 431 432# Path to CA cerificate 433caPath = 434 435# CA Name to verify 436caName = 437 438# path for client certificate (if required) 439clientCertPath = 440 441# path for client key (if required) 442clientKeyPath = 443 444# Events are batched with the maximum size set by batchSize and staying in pipeline for not longer 445# than set by frequency 446frequency = 5s 447batchSize = 768K 448# limit by the number of events (0 value has no limit on the number of events) 449events = 50 450 451# Splunk through proxy 452proxyUrl = 453 454# authentication with basic authorization (user:password) 455proxyBasicAuth = 456 457# Splunk acknowledgement url (.../services/collector/ack) 458ackUrl = 459# You can specify muiltiple splunk URls for ackUrl 460# 461# ackUrls.0 = https://server1:8088/services/collector/ack 462# ackUrls.1 = https://server1:8088/services/collector/ack 463# ackUrls.2 = https://server1:8088/services/collector/ack 464# 465# Make sure that they in the same order as urls for url, to make sure that this Splunk instance will be 466# able to acknowledge the payload. 467# 468# Limitations: 469# * The urls cannot have different path. 470 471# Enable index acknowledgment 472ackEnabled = false 473 474# Index acknowledgment timeout 475ackTimeout = 3m 476 477# Timeout specifies a time limit for requests made by collectord. 478# The timeout includes connection time, any 479# redirects, and reading the response body. 480timeout = 30s 481 482# in case when pipeline can post to multiple indexes, we want to avoid posibility of blocking 483# all pipelines, because just some events have incorrect index 484dedicatedClientPerIndex = true 485 486# (obsolete) in case if some indexes aren't used anymore, how often to destroy the dedicated client 487# dedicatedClientCleanPeriod = 24h 488 489# possible values: RedirectToDefault, Drop, Retry 490incorrectIndexBehavior = RedirectToDefault 491 492# gzip compression level (nocompression, default, 1...9) 493compressionLevel = default 494 495# number of dedicated splunk output threads (to increase throughput above 4k events per second) 496threads = 1 Monitoring Linux Monitoring Linux - Log forwarding Monitoring Linux Monitoring Linux - Log forwarding Log forwarding Configuration By default, collectorforlinux forwards logs from /var/log (including syslog files) and from journald. When your applications write logs somewhere else - /opt/<app>/logs/, /srv/<app>/, a mounted data volume - add an [input.files::<name>] section to 002-user.conf to pick them up. The example below forwards *.log files from /opt/myapp/logs. Replace myapp with a name that identifies the source (for example, webportal or payments): 002-user.conf ini Copy 1# Input syslog(.\d+)? files 2[input.files::mylogs] 3 4# disable host level logs 5disabled = false 6 7# root location of log files 8path = /opt/myapp/logs 9 10# glob pattern 11glob = *.log 12 13# regex matching pattern (use it instead of glob pattern if you need more complicated filtering) 14# match = 15 16# limit search only on one level 17recursive = false 18 19# files are read using polling schema, when reach the EOF how often to check if files got updated 20pollingInterval = 250ms 21 22# how often o look for the new files under logs path 23walkingInterval = 5s 24 25# include verbose fields in events (file offset) 26verboseFields = false 27 28# override type (source type) 29type = linux_host_logs 30 31# specify Splunk index 32index = 33 34# regexp to specify the beginning of the event line 35eventPattern = 36 37# regexp field extraction 38extraction = 39 40# timestamp field (if field extraction is used) 41timestampField = 42 43# format for timestamp 44# the layout defines the format by showing how the reference time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` 45timestampFormat = Jan 2 15:04:05 46 47# Adjust date, if month/day aren't set in format 48timestampSetMonth = false 49timestampSetDay = false 50 51# timestamp location (if not defined by format) 52timestampLocation = Local 53 54# sample output (-1 does not sample, 20 - only 20% of the logs should be forwarded) 55samplingPercent = -1 56 57# sampling key for hash based sampling (should be regexp with the named match pattern `key`) 58samplingKey = 59 60# set output (splunk or devnull, default is [general]defaultOutput) 61output = 62 63# configure default thruput per second for for each container log 64# for example if you set `thruputPerSecond = 128Kb`, that will limit amount of logs forwarded 65# from the single container to 128Kb per second. 66thruputPerSecond = 67 68# Configure events that are too old to be forwarded, for example 168h (7 days) - that will drop all events 69# older than 7 days 70tooOldEvents = 71 72# Configure events that are too new to be forwarded, for example 1h - that will drop all events that are 1h in future 73tooNewEvents = Examples Will be added in the future Monitoring Linux Monitoring Linux - Log forwarding - Configuration [TOC] Monitoring Linux Configuration Monitoring Linux - Log forwarding Log forwarding Monitoring Linux Monitoring Linux - Log forwarding - Examples [TOC] Monitoring Linux Examples Monitoring Linux - Log forwarding Log forwarding Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Configure HTTP Event Collector secure connection Splunk HEC ships with self-signed certificates by default. Collectord gives you a few configuration options for how to trust them. Configure trusted SSL connection to the self-signed certificate To trust Splunk’s self-signed certificate properly (instead of disabling validation with insecure = true), copy the server CA certificate from $SPLUNK_HOME/etc/auth/cacert.pem onto the host and point Collectord at it. The configuration below accepts the license, points at your HEC URL, and tells Collectord to trust cacert.pem while verifying the server name SplunkServerDefaultCert (the name baked into Splunk’s default self-signed certificate): ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5url = https://hec.example.com:8088/services/collector/event/1.0 6token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 7caPath = /opt/collectorforlinux/etc/cacert.pem 8caName = SplunkServerDefaultCert Place cacert.pem at /opt/collectorforlinux/etc/cacert.pem and restart the collectorforlinux service. HTTP Event Collector incorrect index behavior When Collectord forwards an event to an index that the HEC token isn’t allowed to write to, HEC rejects the payload. The incorrectIndexBehavior setting controls how Collectord handles these rejections: RedirectToDefault - the default. Forwards rejected events to the HEC token’s default index. Drop - drops rejected events. Retry - keeps retrying. Be careful: a single rejected pipeline (process stats, for example) can stall every other pipeline on the host. Set the behavior in the output configuration: ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5url = https://hec.example.com:8088/services/collector/event/1.0 6token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 7incorrectIndexBehavior = Drop Using a proxy for HTTP Event Collector If Collectord has to reach HEC through a proxy, set proxyUrl. For an SSL connection through the proxy, also include the proxy’s CA certificate: ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5url = https://hec.example.com:8088/services/collector/event/1.0 6token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 7proxyUrl = http://proxy.example:4321 8caPath = /opt/collectorforlinux/etc/proxie-ca.pem Using multiple HTTP Event Collector endpoints for load balancing and failover When you have several HEC endpoints fronting the same indexer cluster, Collectord can spread traffic across them and fail over automatically. Three URL-selection algorithms are available: random - pick a random URL on first selection and after each failure (connection error or HTTP status >= 500). round-robin - start with the first URL and advance on each failure. random-with-round-robin - pick a random URL on first selection, then advance round-robin on each failure. random-with-round-robin is the default. ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5urls.0 = https://hec1.example.com:8088/services/collector/event/1.0 6urls.1 = https://hec2.example.com:8088/services/collector/event/1.0 7urls.2 = https://hec3.example.com:8088/services/collector/event/1.0 8 9urlSelection = random-with-round-robin 10 11token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 Enable indexer acknowledgement HEC offers Indexer acknowledgment, which confirms not just that HEC accepted a payload but that the indexer wrote it. It costs throughput - sometimes a lot - so enable it only when you need delivery guarantees. You have to enable it on both the HEC token and in the Collectord output: ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5url = https://hec.example.com:8088/services/collector/event/1.0 6ackUrl = https://hec.example.com:8088/services/collector/ack 7token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 8ackEnabled = true 9ackTimeout = 3m Client certificates for collector If your HEC endpoint requires mutual TLS, place the client certificate and key on the host and point Collectord at them: ini Copy 1[general] 2acceptLicense = true 3 4[output.splunk] 5url = https://hec.example.com:8088/services/collector/event/1.0 6token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 7clientCertPath = /opt/collectorforlinux/etc/client-cert.pem 8clientKeyPath = /opt/collectorforlinux/etc/client-cert.key Support for multiple Splunk clusters To forward data from the same host to more than one Splunk cluster, declare additional [output.splunk::<name>] sections. The example below adds a prod1 output: ini Copy 1[output.splunk::prod1] 2url = https://prod1.hec.example.com:8088/services/collector/event/1.0 3token = AF420832-F61B-480F-86B3-CCB5D37F7D0D Anything not set on the named output is inherited from the default output.splunk block. Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Configure HTTP Event Collector secure connection [TOC] Monitoring Linux Configure HTTP Event Collector secure connection Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Configure trusted SSL connection to the self-signed certificate [TOC] Monitoring Linux Configure trusted SSL connection to the self-signed certificate Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - HTTP Event Collector incorrect index behavior [TOC] Monitoring Linux HTTP Event Collector incorrect index behavior Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Using a proxy for HTTP Event Collector [TOC] Monitoring Linux Using a proxy for HTTP Event Collector Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Using multiple HTTP Event Collector endpoints for load balancing and failover [TOC] Monitoring Linux Using multiple HTTP Event Collector endpoints for load balancing and failover Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Enable indexer acknowledgement [TOC] Monitoring Linux Enable indexer acknowledgement Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Client certificates for collector [TOC] Monitoring Linux Client certificates for collector Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Splunk HTTP Event Collector - Support for multiple Splunk clusters [TOC] Monitoring Linux Support for multiple Splunk clusters Monitoring Linux - Splunk HTTP Event Collector Splunk HTTP Event Collector Monitoring Linux Monitoring Linux - Alerts Monitoring Linux Monitoring Linux - Alerts Alerts Predefined alerts Monitoring Linux ships a set of predefined alerts covering license health, collector health, and the most common host-level capacity issues. Review the list and enable the ones that fit your environment. Monitoring Linux: Collector Failed License Checks Fires when Collectord fails to reach the licensing server. Monitoring Linux: Collector License Expiration (less than 14 days) Fires when your license is within 14 days of expiring. Monitoring Linux: Collector license overuse Fires when the app sees more collectors reporting in than your license allows. Monitoring Linux: Collector outdated Fires when collectord versions running on your hosts are older than the installed Splunk app expects. Monitoring Linux: Warning: linux runtime disk space is low A Linux host has less than 20% free disk space. Monitoring Linux: Warning: high host memory usage A Linux host is using more than 85% of its memory. Monitoring Linux: Cluster Warning: high host CPU usage A Linux host has averaged more than 90% CPU over the last 5 minutes. Monitoring Linux: Warning: collectord has WARN or ERROR logs Collectord itself is reporting warnings or errors. Alert triggers The Hosts page surfaces currently triggered alerts at the top, populated by the /alerts/fired_alerts/ REST call. Other triggers Splunkbase has a wide selection of alert actions for routing alerts into Slack, PagerDuty, email, ticketing systems, and other incident-management tools. Install the action you need, then edit the predefined alerts to wire it in. Monitoring Linux Monitoring Linux - Alerts - Predefined alerts [TOC] Monitoring Linux Predefined alerts Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Collector Failed License Checks [TOC] Monitoring Linux Monitoring Linux: Collector Failed License Checks Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Collector License Expiration (less than 14 days) [TOC] Monitoring Linux Monitoring Linux: Collector License Expiration (less than 14 days) Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Collector license overuse [TOC] Monitoring Linux Monitoring Linux: Collector license overuse Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Collector outdated [TOC] Monitoring Linux Monitoring Linux: Collector outdated Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Warning: linux runtime disk space is low [TOC] Monitoring Linux Monitoring Linux: Warning: linux runtime disk space is low Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Warning: high host memory usage [TOC] Monitoring Linux Monitoring Linux: Warning: high host memory usage Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Cluster Warning: high host CPU usage [TOC] Monitoring Linux Monitoring Linux: Cluster Warning: high host CPU usage Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Monitoring Linux: Warning: collectord has WARN or ERROR logs [TOC] Monitoring Linux Monitoring Linux: Warning: collectord has WARN or ERROR logs Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Alert triggers [TOC] Monitoring Linux Alert triggers Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Alerts - Other triggers [TOC] Monitoring Linux Other triggers Monitoring Linux - Alerts Alerts Monitoring Linux Monitoring Linux - Troubleshooting Monitoring Linux Monitoring Linux - Troubleshooting Troubleshooting Verify configuration When data isn’t showing up in Splunk, the first thing to run is collectord verify. It exercises every input and output declared in your configuration - license check, HEC connectivity, file paths, journald, proc, and cgroup access - and prints OK or the exact error for each. bash Copy 1sudo /opt/collectorforlinux/bin/collectord verify --environment linux --conf /opt/collectorforlinux/etc A healthy run looks like this: text Copy 1... 2Version = 5.12.270 3Build date = 191031 4Environment = linux 5 6 7 General: 8 + conf: OK 9 + db: OK 10 + db-meta: OK 11 + instanceID: OK 12 instanceID = 2N9ERP0D9SANAPL56IOQNBCJH0 13 + license load: OK 14 + license expiration: OK 15 + license connection: OK 16 17 Splunk output: 18 + OPTIONS(url=https://127.0.0.1:8088/services/collector/event/1.0): OK 19 + POST(url=https://127.0.0.1:8088/services/collector/event/1.0, index=): OK 20 21 File Inputs: 22 + input(syslog): OK 23 path /var/log/ 24 + input(logs): OK 25 path /var/log/ 26 27 System Input: 28 + path cgroup: OK 29 + path proc: OK 30 31 Network stats Input: 32 + path proc: OK 33 34 Network socket table Input: 35 + path proc: OK 36 37 Proc Input: 38 + path proc: OK 39 40 Mount Input: 41 + stats: OK 42 43 Journald input: 44 + input(journald): OK Any line that’s not OK points at the problem - wrong HEC URL, missing token, unreadable path, expired license, blocked outbound traffic to the licensing endpoint. Fix it, rerun verify, and restart the service. Monitoring Linux Monitoring Linux - Troubleshooting - Verify configuration [TOC] Monitoring Linux Verify configuration Monitoring Linux - Troubleshooting Troubleshooting Monitoring Linux Monitoring Linux - Release history Monitoring Linux Monitoring Linux - Release history Release history 5.21.411 - 2024-11-18 Supports collectorforlinux version 5.21.x and below Update application for Splunk Cloud compatibility 5.21.410 - 2023-10-16 Supports collectorforlinux version 5.21.x and below New dashboard for Collectord metrics Added version=1.1 to all dashboards for Splunk Cloud compatibility and AppInspector Collectord updates: Support global replace configurations to sanitize data before forwarding to Splunk When both volatile and persistent journald destinations exist, Collectord identifies which has the most recent data Send more precise timestamps to Splunk Send logs to multiple Splunk HEC endpoints simultaneously collectord diag skips performance profiles unless --include-performance-profiles is set Performance improvements for the acknowledgement database Acknowledgement database keeps state longer by refreshing entries when files still exist on disk Verify that only one Collectord instance can access the data folder where Collectord stores its state Send events with event_id, a unique identifier for messages generated from logs Splunk output supports maximumMessageLength to truncate messages exceeding this size Splunk output supports requireExplicitIndex to drop events that don’t carry an explicit index Weighted Splunk output algorithm when multiple threads are used Improved grace period for expired licenses - bootstrap new nodes for 14 days after expiration Report source and source type for events with an incorrect index Allow multiple values for blacklist and whitelist on host logs Support for licensing server Support query parameters in Prometheus URLs for metrics Support journald databases written by systemd library 247+ Support for CPU-based licenses Support for cgroupv2 Support for arm64/aarch64 architecture Upgrade Go runtime to 1.21.3 Upgrade sqlite3 library to 3.43.1 Improved DNS resolution for Splunk output FQDNs Export internal Collectord metrics in Prometheus format Forward internal Collectord metrics to Splunk Include all open file descriptors in collectord diag Filter host logs with blacklist and whitelist Blacklist and whitelist Prometheus metrics - significantly reduces indexing cost Support templates in index, source, and sourcetype Allow excluding indexed fields when forwarding to Splunk Bug fix: Collectord clogs the output with WARN messages about closed Splunk outputs Bug fix: parse commas in log timestamps Bug fix: Collectord can clog the output if cgroupv2 is used and blkio is not enabled Bug fix: Collectord crashed when the default output.splunk was missing - now reports the error instead Bug fix: real license key is no longer included in diag bundles Bug fix: Collectord reports high CPU usage for newly started hosts Bug fix: include the values of whitelists and blacklists in diag Bug fix: verify command does not respect glob patterns for Prometheus inputs (certs, tokens) Bug fix: trim spaces in token value for Prometheus inputs Bug fix: Prometheus metrics parser - empty fields could be filled with previous fields Bug fix: better handling of connections to metrics endpoints exported in Prometheus format Bug fix: HTTP connection improvements when Splunk is unresponsive Bug fix: verify command can show an incorrect error when verifying journald input Bug fix: when an event pattern is used to join multi-line events, errors raised by the pipeline input were swallowed Bug fix: reduce warnings about failing to get the new event in the pipeline 5.12.272 - 2019-11-08 Collectord updates: Bug fix: when rotated files reuse FileID/DevID, Collectord stops forwarding rotated files 5.12.271 - 2019-11-07 Collectord updates: Bug fix: when an event pattern is used to join multi-line events, errors raised by the pipeline input were swallowed Bug fix: reduce warnings about failing to get the new event in the pipeline Stability improvements 5.12.270 - 2019-10-22 Initial release Monitoring Linux Monitoring Linux - Release history - 5.21.411 - 2024-11-18 [TOC] Monitoring Linux 5.21.411 - 2024-11-18 Monitoring Linux - Release history Release history Monitoring Linux Monitoring Linux - Release history - 5.21.410 - 2023-10-16 [TOC] Monitoring Linux 5.21.410 - 2023-10-16 Monitoring Linux - Release history Release history Monitoring Linux Monitoring Linux - Release history - 5.12.272 - 2019-11-08 [TOC] Monitoring Linux 5.12.272 - 2019-11-08 Monitoring Linux - Release history Release history Monitoring Linux Monitoring Linux - Release history - 5.12.271 - 2019-11-07 [TOC] Monitoring Linux 5.12.271 - 2019-11-07 Monitoring Linux - Release history Release history Monitoring Linux Monitoring Linux - Release history - 5.12.270 - 2019-10-22 [TOC] Monitoring Linux 5.12.270 - 2019-10-22 Monitoring Linux - Release history Release history Monitoring Linux Monitoring Kubernetes Product Monitoring Kubernetes Product Monitoring OpenShift Product Monitoring OpenShift Product Monitoring Docker Product Monitoring Docker Product Monitoring Linux Product Monitoring Linux Product Monitoring Windows Containers Product Monitoring Windows Containers Product ElasticSearch and OpenSearch Product ElasticSearch and OpenSearch Product Syslog (QRadar) Product Syslog (QRadar) Product Blog - Check Splunk search logs, just in case Blog Blog - Check Splunk search logs, just in case Check Splunk search logs, just in case We have been working on an interesting case with one of our customers. Every role in Splunk has a defined disk limit, and by default the user role has only 100MB. We are always cautious about how much data we bring to Splunk Dashboards and limit everything to make sure our applications can handle large clusters in our applications. One search that was causing an issue was a search used to populate filters in various places of our “Monitoring OpenShift” application. Depending on the number of nodes, namespaces and labels, we expect this search to return many thousands of values, but should not take a lot of disk space. text Copy 1( `macro_openshift_stats_cgroup` OR `macro_openshift_logs`) | 2stats count by host, openshift_node_labels, openshift_namespace, openshift_cluster_eval This search generated ~1,000 rows on our test cluster, but took almost 10 MB of disk space. In the customer’s environment, we were dealing with hundreds of MBs on disk, which is very unusual. The simple change to a simplified search would bring the disk space to just several MBs, instead of hundreds. text Copy 1( `macro_openshift_stats_cgroup` OR `macro_openshift_logs`) | 2stats count by host, openshift_node_labels, openshift_namespace, openshift_cluster The difference between the first search and the second one is the usage of the openshift_cluster_eval field, which is a calculated field that looks first at an indexed field openshift_cluster and if there is no value there, it will look in openshift_node_labels for the cluster field (for backward compatibility). Considering that those searches return the same results, something was very odd about that. On our test cluster one search would take 8.26MB and another 196KB (the difference is 42 times) When you inspect the Job (Search), you can find a Job ID Using this SID, you can find a folder on the Search Head, that represents it. It will be under $SPLUNK_HOME/var/run/splunk/dispatch, so we looked into it As you can see the difference between those two searches is just the size of the remote_logs folder. After some digging in those logs, we saw many repeating INFO messages that some of the calculated fields will be ignored, which is expected, but we definitely did not expect that it would clog the search logs. If you look at the Splunk documentation, you will find some information about Splunk logging, see Troubleshooting Manual-Enable debug logging. The logs we are looking at are part of the search process, not the Splunk daemon, so the configurations would be in the $SPLUNK_HOME/etc/log-searchprocess.cfg. Considering that in our case we have a Splunk cluster with a Search Head cluster and an Indexer cluster, the searches are scheduled on indexers, so if we want to remove those INFO messages, we need to modify the configuration on the indexers. If you look at the default configuration on the $SPLUNK_HOME/etc/log-searchprocess.cfg you will find ini Copy 1rootCategory=INFO,searchprocessAppender 2appender.searchprocessAppender=RollingFileAppender 3appender.searchprocessAppender.fileName=${SPLUNK_DISPATCH_DIR}/search.log 4appender.searchprocessAppender.maxFileSize=10000000 # default: 10MB (specified in bytes). 5appender.searchprocessAppender.maxBackupIndex=3 This means that all non-overridden categories will get by default a value INFO. The maximum size of the logs would be 30 MB (3 files of the maximum size of 10 MB each). So if you have 10 indexers, those logs could grow for each search up to 300 MB. There are two ways to fix this. First, we can override the values for a specific category, in our case it was CalcFieldProcessor, so we can create a file $SPLUNK_HOME/etc/log-searchprocess-local.cfg with a content ini Copy 1category.CalcFieldProcessor=WARN The second option is to override the default log level for all categories with the file $SPLUNK_HOME/etc/log-searchprocess-local.cfg and content ini Copy 1rootCategory=WARN,searchprocessAppender After we applied those changes, we saw that searches are not taking so much space on the disk. One important detail: if you are using Splunk Cloud, you would not have access to the Splunk File System. To find if you are affected by the same issue, you can run the search, go to the Job Inspector, scroll to the very bottom and expand Search Job Properties, scroll all the way down, and at the bottom of that page you can find log files from the indexers, so you can download them and see if there is anything that clogs the search logs. If you see that those logs take a lot of space, talk to Splunk Support and ask them to make the configurations on the indexers in your Splunk Cloud cluster. Blog Blog - Collectord update - thruput and time correction Blog Blog - Collectord update - thruput and time correction Collectord update - thruput and time correction Today we have shipped an updated version of Collectord (version 5.10.252) that brings two features: configuration for throughput and time correction. If you have been running your OpenShift, Kubernetes, or Docker clusters for a while, it is possible that you have gathered a lot of logs on the nodes. When you deploy Collectord, it will run as fast as it can (proving its outstanding performance), which may potentially bring a lot of load to your Splunk deployments. To be able to preload the data, we are providing two new features: Throughput - configure throughput at the global level (Collectord instance) or specifically for container or host logs. Time correction - configure the time range in which you want to forward the logs, for example, define that you want to forward logs only in the time range (-48 hours, +1 hour). All events that are outside of this time range will be ignored. Throughput First, you can configure the global throughput in the Collectord configuration. Under section [general] you can find thruputPerSecond, which you can set, for example, to 256Kb. Collectord will apply this throughput to all the logs it ships from this node. Important note: we do not count metrics that we ship from this node in the throughput, as we do not want to throttle metrics delivery, so we will not trigger unwanted alerts. For each container, you can configure thruput independently, and for host logs, you can configure thruput per set. For example, if you configure thruputPerSecond under [input.files::logs], that means that Collectord will have a throughput for the files that match all the files under configuration [input.files::logs]. If you configure thruputPerSecond under [input.files] (container logs), each container will have its own throughput. For example, if the node has two containers, one sending 100Kb per second and another 50Kb per second, and you have set thruputPerSecond to 80Kb, only the first container will be throttled to 80Kb because the second produces less than 80Kb per second. For container logs, you can also override this configuration with annotations by applying collectord.io/logs-ThruputPerSecond: 50Kb. Alerts for throttled logs We are providing two different alerts. The first one will tell you if Collectord containers are producing WARN messages, and the message will look similar to: text Copy 1WARN 2019/07/24 18:53:00.815293 outcoldsolutions.com/collector/pipeline/pipes/thruput/pipe.go:70: pipeline is getting throttled - /rootfs/var/lib/docker/containers/b2aa6678086cbe2cd4ca374743a25e89225279db26ec34c7f4af8434b43b9b38 - maximum throughput = 10240 bytes per second We produce this WARN message once a minute or less frequent. You can see these WARN messages with alert Collectord reports warnings or errors in Splunk. You will also know if logs are getting throttled with the alert Warning: Increasing lag between event time and indexing time in container logs, where we compare the _time of the event to the _indextime of the event and see if the lag is growing. Time correction Similar to throughput, you can configure events that you believe are too old or too new to be forwarded to Splunk. Under section [general] in the configuration, you can find two keys tooOldEvents and tooNewEvents which you can set to durations. For example: ini Copy 1[general] 2... 3 4# 168h = 7 days 5tooOldEvents = 168h 6 7# anything newer than 1 hour ahead is getting dropped 8tooNewEvents = 1h You can also configure these keys independently for container logs and host logs. In the case of container logs, you can override these values with annotations: yaml Copy 1annotations: 2 collectord.io/logs-TooOldEvents: 24h 3 collectord.io/logs-tooNewEvents: 30m Alerts for time correction If Collectord finds events that are too new or too old, it will raise a WARN message: text Copy 1WARN 2019/07/24 18:28:15.516115 outcoldsolutions.com/collector/pipeline/pipes/timecorrection/pipe.go:88: skipping too old or too new events - /rootfs/var/lib/docker/containers/7bef94bc58965ff059f7989ad9ae7db0b123b9e60615ffb28055884b85664cd3 - events should be in the scope (-7h, +30m) We produce this WARN message once a minute or less frequent. We can show these WARN messages with the alert Collectord reports warnings or errors in Splunk. Upgrade If you are on version 5.10, just upgrade the image to version 5.10.252. If you are on previous versions, please look at our upgrade instructions: Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Blog Blog - Complete guide for forwarding application logs from Kubernetes and OpenShift environments to Splunk Blog Blog - Complete guide for forwarding application logs from Kubernetes and OpenShift environments to Splunk Complete guide for forwarding application logs from Kubernetes and OpenShift environments to Splunk We have helped many of our customers forward various logs from their Kubernetes and OpenShift environments to Splunk. We have learned a lot, and that has helped us build many features in Collectord. And we do understand that some of the features could be hard to discover, so we would like to share our guide on how to set up proper forwarding of application logs to Splunk. In our documentation, we have an example of how to easily forward application logs from a PostgreSQL database running inside a container. This time we will look at the JIRA application. We assume that you already have Splunk and Kubernetes (or OpenShift) configured and have installed our solution for forwarding logs and metrics (if not, it takes 5 minutes, and you can request a trial license with our automated forms; please follow our documentation). And one more thing: no sidecar containers are required! Collectord is the container-native solution for forwarding logs from Docker, Kubernetes, and OpenShift environments. This guide looks pretty long. The reason for that is because we are going into a lot of detail and picked one of the most complicated examples. 1. Defining the logs The first step is simple: let’s find the logs that we want to forward. As we mentioned above, we will use a JIRA application running in a container. For simplicity, we will define it as a single Pod yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: jira 5spec: 6 containers: 7 - name: jira 8 image: atlassian/jira-software:8.14 9 volumeMounts: 10 - name: data 11 mountPath: /var/atlassian/application-data/jira 12 volumes: 13 - name: data 14 emptyDir: {} Let’s open a shell for this application and look at what the log files look like. bash Copy 1user@host# kubectl exec -it jira -- bash 2root@jira:/var/atlassian/application-data/jira# cd log 3root@jira:/var/atlassian/application-data/jira/log# ls -alh 4total 36K 5drwxr-x--- 2 jira jira 4.0K Dec 15 21:44 . 6drwxr-xr-x 9 jira jira 4.0K Dec 15 21:44 .. 7-rw-r----- 1 jira jira 27K Dec 15 21:44 atlassian-jira.log And we will tail atlassian-jira.log to see how the logs are structured: text Copy 12020-12-15 21:44:25,771+0000 JIRA-Bootstrap INFO [c.a.j.config.database.DatabaseConfigurationManagerImpl] The database is not yet configured. Enqueuing Database Checklist Launcher on post-database-configured-but-pre-database-activated queue 22020-12-15 21:44:25,771+0000 JIRA-Bootstrap INFO [c.a.j.config.database.DatabaseConfigurationManagerImpl] The database is not yet configured. Enqueuing Post database-configuration launchers on post-database-activated queue 32020-12-15 21:44:25,776+0000 JIRA-Bootstrap INFO [c.a.jira.startup.LauncherContextListener] Startup is complete. Jira is ready to serve. 42020-12-15 21:44:25,778+0000 JIRA-Bootstrap INFO [c.a.jira.startup.LauncherContextListener] Memory Usage: 5 --------------------------------------------------------------------------------- 6 Heap memory : Used: 102 MiB. Committed: 371 MiB. Max: 1980 MiB 7 Non-heap memory : Used: 71 MiB. Committed: 89 MiB. Max: 1536 MiB 8 --------------------------------------------------------------------------------- 9 TOTAL : Used: 173 MiB. Committed: 460 MiB. Max: 3516 MiB 10 --------------------------------------------------------------------------------- 2. Telling Collectord to forward logs The best scenario is when we can define a dedicated mount just for the path where the logs will be located. That will be the most performant way of setting up the forwarding pipeline. But considering that JIRA recommends mounting the data volume for all the data at /var/atlassian/application-data/jira, we can use that as well. You can tell collectord to match the logs by glob or match (regexp; we like to use regex101.com for testing - make sure to switch to Golang flavor). Glob is the easier and more performant way for matching logs, as we can split the glob pattern into parts of the path and be able to know how deep we should go inside the volume to match the logs. With match, it is a bit more complicated as .* can match any symbol in the path, including the path separator. So every time you are configuring the match with regexp, make sure that your volume does not have a really deep structure of folders inside. We always recommend starting with glob. If you specify both glob and match patterns, only match will be used. The data volume is mounted at /var/atlassian/application-data/jira/log. We can test the glob pattern by executing the shell in the container and staying in the path of the mounted volume, then try to execute the glob pattern with ls bash Copy 1root@jira:/var/atlassian/application-data/jira# ls log/*.log* 2log/atlassian-jira.log OK, so now we know the glob pattern log/*.log*. We are going to annotate the Pod. These annotations will tell Collectord to look at the data volume recursively and try to find the logs that match log/*.log*. bash Copy 1kubectl annotate pod jira \ 2 collectord.io/volume.1-logs-name=data \ 3 collectord.io/volume.1-logs-recursive=true \ 4 collectord.io/volume.1-logs-glob='log/*.log*' After doing that, you can check the logs on the Collectord pod to see if the new logs were discovered. You should see something similar to: text Copy 1INFO 2020/12/15 21:59:29.359039 outcoldsolutions.com/collectord/pipeline/input/file/dir/watcher.go:76: watching /rootfs/var/lib/kubelet/pods/007be5c2-cd20-4d5e-8044-5e2399e28764/volumes/kubernetes.io~empty-dir/data/(glob = log/*.log*, match = ) 2INFO 2020/12/15 21:59:29.359651 outcoldsolutions.com/collectord/pipeline/input/file/dir/watcher.go:178: data - added file /rootfs/var/lib/kubelet/pods/007be5c2-cd20-4d5e-8044-5e2399e28764/volumes/kubernetes.io~empty-dir/data/log/atlassian-jira.log If you see only the 1st line, that means that Collectord recognized the logs but could not find any logs matching the pattern. It’s also possible the configuration is incorrect, and maybe you need to run the troubleshooting steps to see if Collectord can see the volumes. At this point, we can go to Splunk and discover the logs in the Monitoring Kubernetes application. 3. Multiline events By default, Collectord merges all the lines starting with spaces with the previous lines. All the default configurations are under [input.app_logs] in the ConfigMap that you deploy with Collectord. Let’s cover the most important of them. disabled = false - the feature of discovering application logs is enabled by default. Obviously, if there are no annotations telling Collectord to pick up the logs from Containers, nothing is going to be forwarded. walkingInterval = 5s - how often Collectord will walk the path and see if there are new files matching the pattern. glob = *.log* - default glob pattern; in our example above we override it with log/*.log* type = kubernetes_logs - default source type for the logs forwarded from containers eventPatternRegex = ^[^\s] - that is the default pattern for how the new event should start (should not start with a space character). That is how we see that some of the logs are already forwarded as multiline events. eventPatternMaxInterval = 100ms - we expect that every line in the message should be written to the file within 100ms. When we see that there is a larger interval between the lines, we assume those are different messages. eventPatternMaxWait = 1s - the maximum amount of time we are going to wait for new lines in the pipeline. We never want to block the pipeline, so we will wait a maximum of 1s after the first line of the event before we decide to forward the event as-is to Splunk. The default pattern for matching multiline events works great, but considering that we know exactly how to identify the new event by looking at the pattern of the messages, we can define a unique pattern for this pod with regexp ^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}\+[^\s]+, where we are telling Collectord that every event should start with the timestamp like 2020-12-15 21:44:25,771+0000. Let’s add one more annotation bash Copy 1kubectl annotate pod jira \ 2 collectord.io/volume.1-logs-eventpattern='^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}\+[^\s]+' 4. Extracting time If you look at the events forwarded to Splunk, you will see that the timestamp of the event in Splunk does not match the timestamp of the event in the log line. Also, including the timestamp in the logs adds additional licensing cost for Splunk as well. For container logs, we recommend just completely removing the timestamp in the log line, as the container runtime provides an accurate timestamp for every log line. See Timestamps in container logs. We will try to extract the timestamp from the log lines and forward it as the correct timestamp of the event. In most cases, it is way easier to do, but with the current format in JIRA it is a little bit trickier, so we will need to include some magic. First, we need to extract the timestamp as a separate field. For this, we will use the already mentioned tool regex101.com. The regexp that I’ve built is ^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}[\.,]\d{3}\+[^\s]+) ((?:.+|\n)+)$. On the Match Information tab, you can see that the whole event is matching (could be tricky with multiline events), the timestamp field is extracted, and the rest is an unnamed group. The last unnamed group gets forwarded to Splunk with Collectord as a message field. A few notes about this regexp: In the middle of the timestamp, I don’t match the subseconds with just a ,, but instead match it with a dot or comma [\.,]. I will show you below the real reason for that - we will need to make a workaround, as golang cannot parse timestamps where subseconds are separated by a comma, not a dot. (?:YOUR_REGEXP) always use a non-capturing pattern when you don’t want to name this pattern but need to use parentheses to define the whole regexp pattern. That way, you are not telling Collectord to look at this as another field. Collectord is written in Go language, so we use the Go Parse function from time package to parse the time. You can always play with the golang playground, and we prepared a template for you to try to prepare your perfect parsing layout for timestamps. golang Copy 1package main 2 3import ( 4 "fmt" 5 "time" 6) 7 8func main() { 9 t, err := time.Parse("2006-01-02 15:04:05,000-0700", "2020-12-15 21:44:25,771+0000") 10 if err != nil { 11 panic(err) 12 } 13 fmt.Println(t.String()) 14} If you try to run this code, you will see an error: text Copy 1panic: parsing time "2020-12-15 21:44:25,771+0000" as "2006-01-02 15:04:05,000-0700": cannot parse "771+0000" as ",000" As I mentioned above, the reason for that is that the Go language cannot recognize milliseconds after the comma. With Collectord, we can replace the comma with a dot, and then our timestamp layout will be 2006-01-02 15:04:05.000-0700. First, these are annotations that will help us replace the comma with a dot. bash Copy 1kubectl annotate pod jira \ 2 collectord.io/volume.1-logs-replace.fixtime-search='^(?P<timestamp_start>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),(?P<timestamp_end>\d{3}\+[^\s]+)' \ 3 collectord.io/volume.1-logs-replace.fixtime-val='${timestamp_start}.${timestamp_end}' After that, we can apply annotations to extract the timestamp as a field and parse it as a timestamp field for events: bash Copy 1kubectl annotate pod jira \ 2 collectord.io/volume.1-logs-extraction='^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}[\.,]\d{3}\+[^\s]+) ((?:.+|\n)+)$' \ 3 collectord.io/volume.1-logs-timestampfield='timestamp' \ 4 collectord.io/volume.1-logs-timestampformat='2006-01-02 15:04:05.000-0700' The complete example After applying all the annotations, our pod definition should look similar to the example below. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: jira 5 annotations: 6 collectord.io/volume.1-logs-name: 'data' 7 collectord.io/volume.1-logs-recursive: 'true' 8 collectord.io/volume.1-logs-glob: 'log/*.log*' 9 collectord.io/volume.1-logs-eventpattern: '^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}\+[^\s]+' 10 collectord.io/volume.1-logs-replace.fixtime-search: '^(?P<timestamp_start>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),(?P<timestamp_end>\d{3}\+[^\s]+)' 11 collectord.io/volume.1-logs-replace.fixtime-val: '${timestamp_start}.${timestamp_end}' 12 collectord.io/volume.1-logs-extraction: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}[\.,]\d{3}\+[^\s]+) ((?:.+|\n)+)$' 13 collectord.io/volume.1-logs-timestampfield: 'timestamp' 14 collectord.io/volume.1-logs-timestampformat: '2006-01-02 15:04:05.000-0700' 15spec: 16 containers: 17 - name: jira 18 image: atlassian/jira-software:8.14 19 volumeMounts: 20 - name: data 21 mountPath: /var/atlassian/application-data/jira 22 volumes: 23 - name: data 24 emptyDir: {} The logs in Splunk should be well-formatted: Links Read more about available annotations that control the forwarding pipeline in the links below: Monitoring OpenShift - Annotations Monitoring Kubernetes - Annotations Blog Blog - Configuring Splunk HTTP Event Collector for performance Blog Blog - Configuring Splunk HTTP Event Collector for performance Configuring Splunk HTTP Event Collector for performance In this blog post, we will show you how you can configure your ingesting pipeline with Splunk HTTP Event Collector to get the best performance of your Splunk Configuration. We will focus on which metrics to monitor and suggestions about when you need to Scale your Splunk deployments. UPDATE: The amount of data forwarded by day was incorrect (we lost multiplier of 60). It should be ~2.5TiB/day, ~5TiB/day, ~10TiB/day. The official name is Splunk Heavy Forwarder (Splunk HF), not a Splunk Heavy-Weight Forwarder (Splunk HWF) This test will use Splunk Enterprise (the latest version at the current moment, 8.1.3) as a single Splunk instance that will perform as an indexer and search head. Additionally, in the beginning, we will install one Splunk Heavy Forwarder with Splunk HTTP Event Collector configured on this instance. Later we will show you how to scale the Splunk HF tier. We will perform all our tests on Amazon Web Services, EC2 instances running Amazon Linux 2 (~ CentOS 7). For ingesting data, we will use Collectord deployed on Docker. In this diagram, we show the architecture that we started working with. Test Run 1. Default Configurations We will start with the default configurations. To perform this test, we configured: Data receiving on Splunk Indexer Data forwarding without storing a copy of the data on Splunk Heavy Forwarder Splunk HTTP Event Collector on Splunk Heavy Forwarder Installed Docker version 20.10.4 on worker nodes Installed Collectord on Worker nodes with the default configurations To perform the test, we run on each Worker node ten copies of the following container. bash Copy 1docker run --rm -d docker.io/mffiedler/ocp-logtest:latest python ocp_logtest.py --line-length=1024 --num-lines=600000 --rate 60000 --fixed-line This test runs for about 10 minutes. A single container with this configuration generates 1,000 events per second with a size of 1KiB each. In total, we are forwarding approximately 10MiB/sec of data from a single worker. In total, Splunk receives 30MiB/sec of data (~2.47TiB/day). After the test finished, we looked at the Request Lag, where we found a delay in sending events. The lag is between 180-300 seconds (up to 5 minutes). This dashboard is a default dashboard that we provide with our Monitoring Solutions for Kubernetes, OpenShift and Docker. When we looked at the Overview dashboard of our solution Monitoring Docker, we found three raised alerts. We can confirm that the Collectord container was throttled by looking at the detailed metrics from the Container dashboard. By looking at the Request Lag, we can confirm that lag of container logs kept increasing. Collectord reports request lag. It is a difference between the time of the event and when it was sent to Splunk. If the difference is too high, that means that events are getting delayed in the pipeline. The most common reason for that is that Splunk’s HEC configuration cannot handle the amount of data forwarded by Collectord. Or, as we saw previously, that Collectord does not have enough resources. For the next test, we will increase the resources for Collectord to deal with the CPU throttling. Test Run 2. Increasing CPU resources for Collectord To deal with CPU throttling, we needed to increase CPU resources for the container. the CPU limit to two CPU Cores (--cpus=2), and changed CPU shares to 1024 (--cpu-shares=1024). We reran the same test and saw that Collectord was not throttled anymore. But request lag was still there Test Run 3. Increasing number of Splunk client threads Collectord provides the ability to increase the number of sending Splunk client threads. Under [splunk.output] in the configuration for Collectord, you can find the flag threads. In the case of Collectord deployed on Docker, you can simply change that by using an environment variable --env "COLLECTOR__SPLUNK_THREADS=output.splunk__threads=20". By default, it uses one thread for Collectord deployed in the Monitoring Docker solution. We set the number of threads to a small value because if the value is too high and you have too many workers sending data with too few Splunk HEC endpoints, this can result in a lot of connections from the Collectord side. When you change this value, calculate the total number of connections established from the Collectord side by multiplying the number of threads by the number of workers. Considering that we have ten independent Containers on each host, and each container produces a high number of events, Collectord can use a dedicated pipeline for each of them, from reading the logs files to combining a batch of events and sending them to Splunk HEC. After applying this change and rerunning the test, we saw an improvement, but still a lag for some events for about 180 seconds. Test Run 4. Recommended configurations for Splunk HTTP Event Collector In the Splunk Monitoring Console, we looked at the CPU usage of Splunk Indexer (highlighted period) And at the CPU Usage of Splunk Heavy Forwarder with Splunk HEC In both graphs, we don’t see yet, that Splunk uses all CPUs of this EC2 instance. But if you look at the Splunk configuration file inputs.conf under [http] (HEC Configuration), you can find the following configuration. ini Copy 1dedicatedIoThreads = <number> 2* The number of dedicated input/output threads in the event collectord input. 3* Default: 0 (The input uses a single thread) That tells us that by default, this input is using only one thread. We highly recommend reviewing the .conf2017 talk Measuring HEC Performance For Fun and Profit. The most important notes about configuring server-side of data ingesting pipeline: Splunk Parallel Ingestion Pipelines - Recommendation: Depends on event type, but typically 2 pipelines Splunk Dedicated IO Threads - Recommendation: set to roughly the number of CPU cores on the machine Collectord already implements client-side recommendations. We changed the configuration of Splunk Heavy Forwarder to the recommended: parallelIngestionPipelines = 2 in server.conf for [general] dedicatedIoThreads = 4 in inputs.conf for [http] After rerunning the test, we found that the lag is now below 5 seconds. If we look at the CPU usage of Splunk HF, we can see that it uses more CPU now. The colors of KVStore and splunkd_server are very close, but the splunkd_server uses most CPU. With CloudWatch metrics of EC2 instance it is easier to see that CPU usage went up Test Run 5. Doubling the amount of data - 5TiB/day We decided not to stop there but to test how our Splunk configuration would work if we doubled the amount of data. Instead of running ten containers on each worker, we started 20 containers to generate the data. It is essential to support not today’s load in production environments, but the load that you might receive tomorrow. Considering that Splunk HF was using close to 80% of CPU, the first thing was to increase the number of CPUs for this instance. So we switched from c5d.xlarge to c5d.4xlarge, which gave us 32 cores in total (16*2). We decided to increase the number of CPUs by 4, to get it ready for the next tests. To run this test, we have increased the CPU limit and Shares for Collectord container (--cpus=3 --cpu-shares=1024). Instead of running 10 containers on each worker node, we ran 20. That generated 60MiB/sec of data (5TiB/day). With this load, even with the more powerful instance for Splunk HF we noticed that lag increased again. The Request Time between Collectord and Splunk HTTP Event Collector also increased (max from ~0.005sec to ~0.05sec). After investigation, we have found several issues. The easiest one, our Worker nodes, were running close to 100% of CPU capacity. Before, we only looked at the CPU usage of Collectord. The testing containers, the docker daemon, and the system used a good portion of CPU resources. To solve this issue, we changed worker instances to c5d.2xlarge instances. On the Indexer, we noticed that the ingesting pipeline was busy. When running tests the indexerpipe thread was close to 100%, and the Pipeline Set was 100% busy. To deal with this issue we did two things. We increased the number of parallelIngestionPipelines to 2. Before this, we were sending all the data to just one index. Instead, we started forwarding data to two indexes. So half of the testing containers we ran had an annotation collectord.io/index=main2 We noticed the next issue by looking at the EC2 instance metrics and found out that the amount of data that Indexer was receiving was around 8G/sec. The type of instances we were using had an Ethernet connection with up to 10G/sec. So that was a signal as well. At this point, we had an option to split our single instance Indexer into an Indexer cluster. And in the production environment, you should not run single instance Indexers but the cluster instead. But that would not change much if we would implement an Indexer Cluster with three instances for durability. And the amount of traveling data would be pretty close. Considering that Collectord was sending data to Splunk HEC in compressed format (gzip compression over HTTP), this resulted in about 300M/sec. We decided to look into the possibility of forwarding compressed data between Splunk HF and Splunk Indexer. In the outputs.conf file you can find a flag compressed = <boolean>, which is set to false by default. Changing that reduced the traffic from Splunk HF to Splunk Indexer to around 300M/s as well. After doing all of that, we got the lag back to below 5 seconds And the request time between Collectord and Splunk HEC to below 0.005sec Test Run 6. Doubling the amount of data (Again!) - ~10TiB/day After performing that test, we decided to double the amount of data again. This time we kept running 20 containers on each worker, but changed the amount of logs generated by each container from 1MiB/sec to 2MiB/sec. In total, we were generating 40MiB/sec from a single worker, and 120MiB/sec in total from three workers (~10TiB/day). To run this test, we have increased the CPU limit and Shares for Collectord container again (--cpus=4 --cpu-shares=2048). After running the test, we have noticed again increased lag, and the request time. We started to notice some indexing lag, including filled queries, on Splunk Indexer at this point. But we were mostly curious about how Splunk HF could handle this load. As Collectord metrics were showing that the request time and the lag between Collectord and Splunk HEC kept increasing. We checked already that Collectord was not throttled, it was running using 3 CPUs at maximum, and we gave it 4 CPUs. And all workers were using about 70% of CPU in total. We used Splunk Monitoring Console to look at which threads were busy on Splunk HF running Splunk HTTP Event Collector and noticed that the thread httpinputserverdatathread was close to 100% every time we ran the test. We tried to adjust dedicatedIOThreads for Splunk HEC while also increasing dedicatedIOThreads for HTTP Server in server.conf, but none of the configurations would raise it from 1 to multiple threads. At this point we saw that Splunk Indexer could use around 5-6 CPUs in total, and could not scale more for our tests. At this point we decided to scale up the number of Splunk HF forwarders from one to three and put them behind a TCP Load Balancer. We used c5d.2xlarge instances for Splunk HF with Splunk HEC configured on them. So instead of one Splunk HF with 32 CPU Cores (c5d.4xlarge), we used three with 16 CPU Cores (c5d.2xlarge). That solved the issue with httpinputserverdatathread, each Splunk HF instance had it below 50%. But at this point, we started to see a lag in the indexing pipeline of the Splunk Indexer. We decided that it was time to finish our tests and realized that the next step would be scaling our indexing tier and implementing Indexer Cluster. Lesson learned By running those tests we have learned: Splunk is very configurable. And some default configurations might not work in all environments. Always test higher loads on your environments, to be prepared for tomorrow. Use all the tools available for troubleshooting performance. We used EC2 monitoring tools with CloudWatch, Splunk Monitoring Console, and of course dashboards provided by our solution Monitoring Docker (OpenShift and Kubernetes). We believe that the dashboard Splunk Monitoring Console - Indexing Performance: Advanced - Splunkd Thread Activity is essential for monitoring the Indexing pipeline and Splunk HTTP Event Collector. We believe in some of the first runs, we would find it very valuable and find how important it would be to change dedicatedIOThreads for Splunk HTTP Event Collector. Monitoring Network traffic is vital to estimate the load your network can handle. Request Time between the Splunk HEC Client and the Server is a good measurement that all your pipeline behind Splunk HEC is lagging. An important detail about our tests: Please use them as a guide that can help you investigate performance issues. Please do not use it as a guide for the Splunk environment you need to configure. Each workload, the format of the data, the frequency of the data can be very different from our tests, and in your case, some other configurations might work better. Blog Blog - Create a secure administrator password in Docker for Splunk 7.1.0 Blog Blog - Create a secure administrator password in Docker for Splunk 7.1.0 Create a secure administrator password in Docker for Splunk 7.1.0 tl;dr; Starting from Splunk 7.1, there is no more changeme password. Use --gen-and-print-passwd to generate a new password when starting Splunk for the first time bash Copy 1docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --gen-and-print-passwd" \ 4 splunk/splunk:7.1.0 How to specify the password for admin user at start time? All the examples below are based on Splunk documentation Create a secure administrator password. Option 1. Seed the password using arguments. Using --seed-passwd as an option, you can specify which password you want to use if the admin user does not have any password yet. bash Copy 1$ docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --answer-yes --seed-passwd changeme" \ 4 splunk/splunk:7.1.0 The password will be set when it is a fresh Splunk installation. If you have set or changed the admin password before, this command does not change the existing password. It is safe to keep this argument all the time, the same way you keep --accept-license --answer-yes. With this configuration, you will not be asked to change the password when you access Splunk for the first time using Splunk Web. Make sure to change the password to something more secure in Settings - Access Controls, as this password will be visible to all users who have access to the Docker instance. Option 2. Set the password using stdin. If you are playing with Docker and Splunk, you can run it with -it, allowing you to interact with the tty bash Copy 1$ docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --answer-yes" \ 4 -it \ 5 splunk/splunk:7.1.0 6 7This appears to be your first time running this version of Splunk. 8 9An Admin password must be set before installation proceeds. 10Password must contain at least: 11 * 8 total printable ASCII character(s). 12Please enter a new password: 13Please confirm new password: 14... That way, your password will not be exposed to logs or anywhere else. Keeping it is safe. Option 3. Use autogenerated password You can use the --gen-and-print-passwd flag. In that way, you will get a new autogenerated password when you start Splunk for the first time. bash Copy 1$ docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --gen-and-print-passwd --answer-yes" \ 4 splunk/splunk:7.1.0 5 6This appears to be your first time running this version of Splunk. 7 8Randomly generated admin password: 9_,4G5Reu 10... Because the password is logged, make sure to change it after the first login. Option 4. Use user-seed.conf You can create user-seed.conf with the clear text password as ini Copy 1[user_info] 2USERNAME = admin 3PASSWORD = Your5ecureP@assw0wd It will be more secure to store a hashed version of the password instead. For that, you need to have a running Splunk instance. bash Copy 1$ splunk hash-passwd 'Your5ecureP@assw0wd' 2$6$1hfVCT0MACVOq.pd$hiflBxVd36YLeaThJY0x2RxVCYUD60iz3g72plrKeYPgm3fwXnC20k9XxznQDXpefy79dilaQvOJPBge0Zc3C1 You can use one of the options above to start Splunk in the container and access Splunk with docker exec -it [container_id] entrypoint.sh splunk-bash. Execute ./bin/splunk hash-passwd ... there. To use a hashed password instead of clear text, specify it in user-seed.conf with HASHED_PASSWORD. ini Copy 1[user_info] 2USERNAME = admin 3HASHED_PASSWORD = $6$1hfVCT0MACVOq.pd$hiflBxVd36YLeaThJY0x2RxVCYUD60iz3g72plrKeYPgm3fwXnC20k9XxznQDXpefy79dilaQvOJPBge0Zc3C1 Now you need to embed this file into the container. You can do it by mounting the file under /var/opt/splunk/etc. This folder is a backup directory for the default Splunk etc files. On first start (or upgrade), the container copies all files from this directory to /opt/splunk/etc. bash Copy 1docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --answer-yes" \ 4 --volume $(pwd)/user-seed.conf:/var/opt/splunk/etc/system/local/user-seed.conf \ 5 splunk/splunk:7.1.0 You can also build your own image on top of the Splunk image with a Dockerfile and just one command to place the user-seed.conf. text Copy 1FROM splunk/splunk:7.1.0 2COPY user-seed.conf /var/opt/splunk/etc/system/local/user-seed.conf Build the image with docker build -t example.com/splunk:7.1.0 . and run your image similarly to the example above. bash Copy 1docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --answer-yes" \ 4 --volume $(pwd)/user-seed.conf:/var/opt/splunk/etc/system/local/user-seed.conf \ 5 example.com/splunk:7.1.0 If you keep the password in clear text in user-seed.conf, make sure to change it on first login. Option 5. Use python to write the user-seed.conf on start. A more advanced option: if you already have a hashed password, you can use the SPLUNK_BEFORE_START_CMD environment variable to invoke Python to write the content of user-seed.conf. bash Copy 1docker run \ 2 --publish 8000:8000 \ 3 --env SPLUNK_START_ARGS="--accept-license --answer-yes" \ 4 --env SPLUNK_BEFORE_START_CMD='cmd --accept-license python -c '"'"'open("/opt/splunk/etc/system/local/user-seed.conf", "w").write("[user_info]\nUSERNAME = admin\nHASHED_PASSWORD = $6$1hfVCT0MACVOq.pd$hiflBxVd36YLeaThJY0x2RxVCYUD60iz3g72plrKeYPgm3fwXnC20k9XxznQDXpefy79dilaQvOJPBge0Zc3C1")'"'"'' \ 5 splunk/splunk:7.1.0 Blog Blog - Forwarding 10,000 1k events per second generated by containers from a single host with ease Blog Blog - Forwarding 10,000 1k events per second generated by containers from a single host with ease Forwarding 10,000 1k events per second generated by containers from a single host with ease It is good to know the limits of your infrastructure. We are continually testing Collectord in our labs. Today we want to share with you the results of the tests that we have performed on AWS EC2 instances. So you can use it as a reference for planning the capacity and the cost of your deployments. We will provide you with information on how we ran the tests and how we measured the performance. [UPDATE (2018-11-04)] Up to 35% CPU performance improvements, 3 times less memory usage in upcoming version 5.3. Tests performed on 2018-10-20 Environment AWS We used two EC2 instances. In the same VPC, in the same AZ, default Tenancy, with AMI ami-0d1000aff9a9bad89 (Amazon Linux 2). c5d.xlarge (4 vCPU, 8GiB, 100 NVMe SSD) for Splunk m5.large (2 vCPU, 8GiB, 20GB gp2 EBS) for testing environment (for most tests) c5d.xlarge (4 vCPU, 8GiB, 100 NVMe SSD) for 10,000 1k events per second test (we note below) Splunk We deployed Splunk inside of the container. We used version 7.2.0. One index for all events. Docker bash Copy 1docker version 2Client: 3 Version: 18.06.1-ce 4 API version: 1.38 5 Go version: go1.10.3 6 Git commit: e68fc7a215d7133c34aa18e3b72b4a21fd0c6136 7 Built: Wed Sep 26 23:00:19 2018 8 OS/Arch: linux/amd64 9 Experimental: false 10 11Server: 12 Engine: 13 Version: 18.06.1-ce 14 API version: 1.38 (minimum version 1.12) 15 Go version: go1.10.3 16 Git commit: e68fc7a/18.06.1-ce 17 Built: Wed Sep 26 23:01:44 2018 18 OS/Arch: linux/amd64 19 Experimental: false JSON logging driver configuration json Copy 1{ 2 "log-driver": "json-file", 3 "log-opts" : { 4 "max-size" : "100m", 5 "max-file" : "3" 6 } 7} Collectord for docker In our tests we used the latest released version of Collectord for Docker 5.2. We used two configurations. One that works out of the box (with gzip compression, SSL, join rules). The second configuration used HTTP connection for HEC, disabled gzip compression and no join rules. bash Copy 1... 2--env "COLLECTOR__SPLUNK_URL=output.splunk__url=http://splunk-example:8088/services/collector/event/1.0" \ 3--env "COLLECTOR__SPLUNK_GZIP=output.splunk__compressionLevel=nocompression" \ 4--env "COLLECTOR__JOIN=pipe.join__disabled=true" \ 5... Kubernetes We used a single instance cluster bootstrapped with kubeadm. bash Copy 1kubectl version 2Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} 3Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} Collectord for kubernetes In our tests we used the latest released version of Collectord for Kubernetes 5.2. Similarly to Docker, we used two configurations. One that works out of the box (with gzip compression, SSL, join rules). The second configuration used HTTP connection for HEC, disabled gzip compression and no join rules. ini Copy 1[output.splunk] 2url = http://splunk-example:8088/services/collector/event/1.0 3compressionLevel=nocompression 4 5[pipe.join] 6disabled = true Log generator We used ocp_logtest with the following configuration bash Copy 1python ocp_logtest.py --line-length=1024 --num-lines=300000 --rate 60000 --fixed-line That configuration generates close to 1,000 events from one container with the average size of 1024 bytes. To run it in Docker we used the command below. To forward 5,000 events we would run five of these containers in parallel. bash Copy 1docker run --rm \ 2 --label=test=testN \ 3 -d docker.io/mffiedler/ocp-logtest:latest \ 4 python ocp_logtest.py --line-length=1024 --num-lines=300000 --rate 60000 --fixed-line To run the log generator in Kubernetes we used Jobs. Each had the same definition as the one below. To forward 5,000 events we would run five of these Jobs in parallel (you need to change the name of the job). yaml Copy 1apiVersion: batch/v1 2kind: Job 3metadata: 4 name: logtestX 5 labels: 6 test: 'testN' 7spec: 8 template: 9 spec: 10 restartPolicy: Never 11 containers: 12 - name: c1 13 image: docker.io/mffiedler/ocp-logtest:latest 14 command: 15 - python 16 args: 17 - ocp_logtest.py 18 - --line-length=1024 19 - --num-lines=300000 20 - --rate=60000 21 - --fixed-line Tests In our test dashboards we show: The number of messages in total (to verify that we have not lost any messages). The number of messages per second (only from tested containers). Message len (len(_raw), only the size of the logline, excluding the metadata that we attach). Collectord CPU usage percent of a single core. Lag, the difference between _indextime and _time (... | timechart avg(eval(_indextime-_time))). We show max, avg and min. Collectord memory usage in MB. Network transmit from the Collectord container (in case of Kubernetes network transmit from the host, as it is running on the host network). Test 1. Docker environment. Default Configuration. Forwarding 1,000 1k events per second Test 2. Docker environment. No SSL, No Gzip, No Join. Forwarding 1,000 1k events per second Changing the default configuration can significantly impact the results. Because we do not use SSL and we do not use Gzip we reduced the CPU usage from 6-7% to 4-5%. Because we do not use Gzip we reduced memory usage from 60Mb to 20Mb, simply because our batchSize is still the same 768K, but now we are talking about not-compressed 768K. But the network usage grew from 50KB/s to almost 2MB/s. Test 3. Docker environment. Default Configuration. Forwarding 5,000 1k events per second Compared to Test 1, now we forward 5 times more events. Instead of 6-7% CPU usage we see 30-35% CPU usage of a single core, and increased memory usage from 60MB to around 110MB. Test 4. Docker environment. No SSL, No Gzip, No Join. Forwarding 5,000 1k events per second Disabling SSL, Gzip and Join rules can reduce CPU usage from 30-35% to 25%. But disabling Gzip compression increases the network traffic. When you do not pay for network traffic between your nodes and Splunk instance, and if you have enough bandwidth to support it, you can choose not to use Gzip compression. Test 5. Kubernetes environment. Default Configuration. Forwarding 1,000 1k events per second Similarly to the Docker environment, we tested the Kubernetes environment as well. CPU and Memory usage on Kubernetes clusters is slightly higher for several reasons. First, these Collectord instances are still performing a collection of all other data that we configure, including system metrics, Prometheus metrics, network metrics, host logs. Collectord for Kubernetes forwards more data by default, because of the more complex environment. The second reason is that we attach more metadata to logs and metrics, as we forward not only information about the Containers and Pods, but also information about the Workloads that created this Pod, and the Host information as well. Memory usage in this test is not very accurate, because we ran the 5,000 1k events test right before this test, which increased the memory usage of the Collectord and kept the memory reserved. Test 6. Kubernetes environment. No SSL, No Gzip, No Join. Forwarding 1,000 1k events per second Similar result to Docker environment. Not using Gzip compression can reduce CPU usage, and reduce Memory usage. From 9-10% to 5-6% CPU usage of a single core. Test 7. Kubernetes environment. Default Configuration. Forwarding 5,000 1k events per second Forwarding 5,000 events uses 40% of CPU. Compared to test 5, we see a 4x change. Test 8. Kubernetes environment. No SSL, No Gzip, No Join. Forwarding 5,000 1k events per second Disabling Gzip compression reduces CPU usage from 40% to 26% of a single core. Test 9. Docker environment. Default Configuration. Forwarding 10,000 1k events per second To be able to forward more than 5,000 events, we reserved a c5d.xlarge instance for this test, to make sure that we would not be affected by the performance of the gp2 EBS volume. We changed the configuration of Collectord and increased the number of Splunk threads to 5. In our tests, we see that one Splunk Client with the default configurations (SSL, Gzip compression, 768K batch size) can forward about 5,000 events. We recommend increasing this value if you have more than 4,000 events per second. bash Copy 1--env "COLLECTOR__SPLUNK_THREADS=output.splunk__threads=5" Doing that allowed us to forward 10,000 events per second. Compared to test 3 we are using 60% of a single core. The memory usage grew to 400MB because of the dedicated threads (and buffers allocated for them). An important detail is that with this amount of events dockerd CPU uses around 25% of a single core. The Splunk process used 80% of a single core CPU on its host. Summary If you want to reproduce these tests in your environment, we have shared with you all the steps we performed. If you find that some steps are missing, please let us know. Forwarding up to 5,000 1k events per second does not require any changes to the configuration. To forward beyond that you need to change the number of threads. These results are not our limit. We will keep working on improving performance and memory usage in the future. Blog Blog - Forwarding Kubernetes and OpenShift Log to QRadar (syslog) - Beta Blog Blog - Forwarding Kubernetes and OpenShift Log to QRadar (syslog) - Beta Forwarding Kubernetes and OpenShift Log to QRadar (syslog) - Beta We are thrilled to announce a beta version of Collectord solutions that will help you forward Kubernetes and OpenShift logs to QRadar (Syslog). As always, we promise the most performant and easiest to use solution for forwarding logs from your Kubernetes clusters. We already have several customers who use the early beta, and we are ready to expand the beta program. If you are interested, please contact us at contact@outcoldsolutions.com Our plan is to release the GA version of this solution in Q3 2021. Blog Blog - Forwarding logs to ElasticSearch and OpenSearch with Collectord Blog Blog - Forwarding logs to ElasticSearch and OpenSearch with Collectord Forwarding logs to ElasticSearch and OpenSearch with Collectord Large teams might have different requirements for the log management system. Some teams might prefer to use Elasticsearch or OpenSearch for log management. In this version of Collectord, we have added support for sending logs to Elasticsearch and OpenSearch. You can install Collectord with Elasticsearch or OpenSearch support and run it in the same cluster as Collectord for Splunk. In that case, you can configure Collectord to send logs to both Splunk and Elasticsearch or OpenSearch. Collectord version 5.20 and later supports sending logs to Elasticsearch and OpenSearch. Our installation instructions for Elasticsearch and OpenSearch provide dedicated configuration files for Elasticsearch and OpenSearch. The main difference is pre-configured mappings and templates for Elasticsearch and OpenSearch. You can find installation instructions on our website: Forwarding logs to Elasticsearch and OpenSearch with Collectord Preview of the Elasticsearch Observability Dashboard with logs ingested by Collectord Collectord ingests logs with the Elastic Common Schema (ECS) format. The following screenshot shows the Elasticsearch Observability Dashboard with logs ingested by Collectord. Preview of the OpenSearch Dashboards with logs ingested by Collectord The following screenshot shows the OpenSearch Dashboards with logs ingested by Collectord. Extracting fields from the logs and redirecting to custom data streams With Collectord annotations you can configure field extractions and redirect logs to a different data stream. In our example, we have configured an nginx pod running. First, since we will extract some additional fields, we will create a new data stream called logs-nginx-web. To do that, first we will download the default index template created by Collectord and add additional fields. bash Copy 1curl -k -u elastic:elastic https://localhost:9200/_index_template/logs-collectord-5.20.400 | jq '.index_templates[].index_template' > default.json In the default.json file we will change the index_patterns to logs-nginx-web and add additional fields to the mappings.properties section. text Copy 1"request": { 2 "properties": { 3 "remote_addr": {"type": "ip"}, 4 "remote_user": {"ignore_above": 1024, "type": "keyword"}, 5 "method": {"ignore_above": 1024, "type": "keyword"}, 6 "path": {"ignore_above": 1024, "type": "keyword"}, 7 "http_referer": {"ignore_above": 1024, "type": "keyword"}, 8 "http_user_agent": {"ignore_above": 1024, "type": "keyword"} 9 } 10}, 11"response": { 12 "properties": { 13 "status": {"type": "long"}, 14 "body_bytes": {"type": "long"} 15 } 16} For the Pod we will add the following annotations: Important detail: __ is used to create nested fields in Elasticsearch, so the request__remote_addr will be converted to request.remote_addr in Elasticsearch. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 elasticsearch.collectord.io/stdout-logs-extraction: '^((?P<request__remote_addr>[\d.]+)\s+(?P<request__remote_user>-|\w+) -\s+\[(?P<timestamp>[^\]]+)\]\s+"(?P<request__method>[^\s]+)\s(?P<request__path>[^\s]+)\s(?P<request__type>[^"]+)"\s+(?P<response__status>\d+)\s+(?P<response__body_bytes>\d+)\s+"(?P<request__http_referer>[^"]*)"\s+"(?P<request__http_user_agent>[^"]*)" "-")$' 7 elasticsearch.collectord.io/stdout-logs-timestampfield: timestamp 8 elasticsearch.collectord.io/stdout-logs-timestampformat: '02/Jan/2006:15:04:05 -0700' 9 elasticsearch.collectord.io/stdout-logs-index: 'logs-nginx-web' 10# ... After that we can review the logs in the Elasticsearch Dashboards. If you define a mapping incorrectly, the events that could not be indexed will be redirected to the data stream defined under [output.elasticsearch] in the field dataStreamFailedEvents and you will see WARN in Collectord logs similar to text Copy 1WARN 2023/04/08 11:53:16.679396 outcoldsolutions.com/collectord/pipeline/output/elasticsearch/output.go:322: thread=1 datastream="logs-nginx-broken" first error from bulk insert: item create failed with status 400 (failed to parse field [request.remote_addr] of type [long] in document with id 'iwySYYcB8kxjWZpbYyHp'. Preview of field's value: '127.0.0.1') 2WARN 2023/04/08 11:53:16.679426 outcoldsolutions.com/collectord/pipeline/output/elasticsearch/output.go:333: thread=1 datastream="logs-nginx-broken" response contains errors, 3 events failed to be indexed, posting to logs-collectord-failed-5.20.400 Forwarding logs from Persistent Volumes Collectord can forward logs from Persistent Volumes without any additional deployments on the cluster. To do that you can just add a simple annotation to the Pod elasticsearch.collectord.io/volume.1-logs-name: 'logs' where logs is the name of the volume. In the example below we also use some existing features of Collectord to extract fields from the logs, especially the proper timestamp. Additionally we use some new features of Collectord to match files by a glob pattern, where we use the {{kubernetes.pod.name}} variable, and store the acknowledgement database on the Persistent Volume, so when it is getting attached to other host, the logs will be forwarded from the last acknowledged position. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: postgres-pod0 5 annotations: 6 elasticsearch.collectord.io/volume.1-logs-name: 'logs' 7 elasticsearch.collectord.io/volume.1-logs-glob: '{{kubernetes.pod.name}}/*.log' 8 elasticsearch.collectord.io/volume.1-logs-extraction: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} [^\s]+) (.+)$' 9 elasticsearch.collectord.io/volume.1-logs-timestampfield: 'timestamp' 10 elasticsearch.collectord.io/volume.1-logs-timestampformat: '2006-01-02 15:04:05.000 MST' 11 elasticsearch.collectord.io/volume.1-logs-timestamplocation: 'Europe/Oslo' 12 elasticsearch.collectord.io/volume.1-logs-onvolumedatabase: 'true' 13spec: 14 containers: 15 - name: postgres 16 image: postgres 17 env: 18 - name: POSTGRES_HOST_AUTH_METHOD 19 value: trust 20 command: 21 - docker-entrypoint.sh 22 args: 23 - postgres 24 - -c 25 - logging_collector=on 26 - -c 27 - log_min_duration_statement=0 28 - -c 29 - log_directory=/var/log/postgresql/postgres-pod0/ 30 - -c 31 - log_min_messages=INFO 32 - -c 33 - log_rotation_age=1d 34 - -c 35 - log_rotation_size=10MB 36 volumeMounts: 37 - name: data 38 mountPath: /var/lib/postgresql/data 39 - name: logs 40 mountPath: /var/log/postgresql/ 41 volumes: 42 - name: data 43 emptyDir: {} 44 - name: logs 45 persistentVolumeClaim: 46 claimName: myclaim0 Links Documentation Installation Instructions Blog Blog - Forwarding pretty JSON logs to Splunk Blog Blog - Forwarding pretty JSON logs to Splunk Forwarding pretty JSON logs to Splunk One problem our collectord solves for our customers is support for multi-line messages, including lines generated by rendering pretty JSON messages. When you can avoid it, I suggest you avoid it. As an example, if you want to see pretty JSON messages for development, you can keep a configuration flag (which can be an environment variable) that changes how messages are rendered. But if you are dealing with software you did not write, read below. As an example, I took sample JSON messages from https://json.org/example.html and wrote a simple Python script to render them to stdout. python Copy 1import json 2 3examples = [ 4 '{ "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } } } }}', 5 '{"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] }}}', 6 '{"widget": { "debug": "on", "window": { "title": "Sample Konfabulator Widget", "name": "main_window", "width": 500, "height": 500 }, "image": { "src": "Images/Sun.png", "name": "sun1", "hOffset": 250, "vOffset": 250, "alignment": "center" }, "text": { "data": "Click Here", "size": 36, "style": "bold", "name": "text1", "hOffset": 250, "vOffset": 100, "alignment": "center", "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;" }}}', 7 '{"web-app": { "servlet": [ { "servlet-name": "cofaxCDS", "servlet-class": "org.cofax.cds.CDSServlet", "init-param": { "configGlossary:installationAt": "Philadelphia, PA", "configGlossary:adminEmail": "ksm@pobox.com", "configGlossary:poweredBy": "Cofax", "configGlossary:poweredByIcon": "/images/cofax.gif", "configGlossary:staticPath": "/content/static", "templateProcessorClass": "org.cofax.WysiwygTemplate", "templateLoaderClass": "org.cofax.FilesTemplateLoader", "templatePath": "templates", "templateOverridePath": "", "defaultListTemplate": "listTemplate.htm", "defaultFileTemplate": "articleTemplate.htm", "useJSP": false, "jspListTemplate": "listTemplate.jsp", "jspFileTemplate": "articleTemplate.jsp", "cachePackageTagsTrack": 200, "cachePackageTagsStore": 200, "cachePackageTagsRefresh": 60, "cacheTemplatesTrack": 100, "cacheTemplatesStore": 50, "cacheTemplatesRefresh": 15, "cachePagesTrack": 200, "cachePagesStore": 100, "cachePagesRefresh": 10, "cachePagesDirtyRead": 10, "searchEngineListTemplate": "forSearchEnginesList.htm", "searchEngineFileTemplate": "forSearchEngines.htm", "searchEngineRobotsDb": "WEB-INF/robots.db", "useDataStore": true, "dataStoreClass": "org.cofax.SqlDataStore", "redirectionClass": "org.cofax.SqlRedirection", "dataStoreName": "cofax", "dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver", "dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon", "dataStoreUser": "sa", "dataStorePassword": "dataStoreTestQuery", "dataStoreTestQuery": "SET NOCOUNT ON;select test=\'test\';", "dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log", "dataStoreInitConns": 10, "dataStoreMaxConns": 100, "dataStoreConnUsageLimit": 100, "dataStoreLogLevel": "debug", "maxUrlLength": 500}}, { "servlet-name": "cofaxEmail", "servlet-class": "org.cofax.cds.EmailServlet", "init-param": { "mailHost": "mail1", "mailHostOverride": "mail2"}}, { "servlet-name": "cofaxAdmin", "servlet-class": "org.cofax.cds.AdminServlet"}, { "servlet-name": "fileServlet", "servlet-class": "org.cofax.cds.FileServlet"}, { "servlet-name": "cofaxTools", "servlet-class": "org.cofax.cms.CofaxToolsServlet", "init-param": { "templatePath": "toolstemplates/", "log": 1, "logLocation": "/usr/local/tomcat/logs/CofaxTools.log", "logMaxSize": "", "dataLog": 1, "dataLogLocation": "/usr/local/tomcat/logs/dataLog.log", "dataLogMaxSize": "", "removePageCache": "/content/admin/remove?cache=pages&id=", "removeTemplateCache": "/content/admin/remove?cache=templates&id=", "fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder", "lookInContext": 1, "adminGroupID": 4, "betaServer": true}}], "servlet-mapping": { "cofaxCDS": "/", "cofaxEmail": "/cofaxutil/aemail/*", "cofaxAdmin": "/admin/*", "fileServlet": "/static/*", "cofaxTools": "/tools/*"}, "taglib": { "taglib-uri": "cofax.tld", "taglib-location": "/WEB-INF/tlds/cofax.tld"}}}', 8 '{"menu": { "header": "SVG Viewer", "items": [ {"id": "Open"}, {"id": "OpenNew", "label": "Open New"}, null, {"id": "ZoomIn", "label": "Zoom In"}, {"id": "ZoomOut", "label": "Zoom Out"}, {"id": "OriginalView", "label": "Original View"}, null, {"id": "Quality"}, {"id": "Pause"}, {"id": "Mute"}, null, {"id": "Find", "label": "Find..."}, {"id": "FindAgain", "label": "Find Again"}, {"id": "Copy"}, {"id": "CopyAgain", "label": "Copy Again"}, {"id": "CopySVG", "label": "Copy SVG"}, {"id": "ViewSVG", "label": "View SVG"}, {"id": "ViewSource", "label": "View Source"}, {"id": "SaveAs", "label": "Save As"}, null, {"id": "Help"}, {"id": "About", "label": "About Adobe CVG Viewer..."} ]}}' 9] 10 11def main(): 12 for idx, j in enumerate(examples): 13 indent=None 14 if idx % 2 == 1: 15 indent = ' ' 16 print(json.dumps(json.loads(j), indent=indent)) 17 print() 18 19if __name__ == "__main__": 20 main() This script outputs every second JSON in a pretty format. text Copy 1{"glossary": {"title": "example glossary", "GlossDiv": {"title": "S", "GlossList": {"GlossEntry": {"ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": {"para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"]}, "GlossSee": "markup"}}}}} 2{ 3 "menu": { 4 "id": "file", 5 "value": "File", 6 "popup": { 7 "menuitem": [ 8 { 9 "value": "New", 10 "onclick": "CreateNewDoc()" 11 }, 12 { 13 "value": "Open", 14 "onclick": "OpenDoc()" 15 }, 16 { 17 "value": "Close", 18 "onclick": "CloseDoc()" 19 } 20 ] 21 } 22 } 23} 24... Examples below are based on the Monitoring Docker application, but solution can be applied to Kubernetes and OpenShift as well, you can find relevant documentation (including how to get started and how to configure collectord) for Kubernetes and OpenShift under docs. Let’s see first what to expect when we run collectord with default configuration from our manual. And run our script from above in docker. bash Copy 1docker run --name json_example --volume $PWD:/app -w /app python:3 python3 app.py That does not result in what you want in Splunk. To parse JSON messages as one message, you need to configure a Join Rule and specify it in the configuration file. ini Copy 1[pipe.join::json] 2# Configure for which containers we want to apply this match rule. You can configure it for 3# containers running with a specific image as below or with matchRegex.docker_container_name = ^json_example$ 4matchRegex.docker_container_image = ^python:3$ 5# We want to define that all messages should start with `{` 6patternRegex = ^{ One way to configure it is by using environment variables with a format like COLLECTOR__{ANY_NAME}={section}__{key}={value}. bash Copy 1docker run -d \ 2 --name collectorfordocker \ 3 --volume /sys/fs/cgroup:/rootfs/sys/fs/cgroup:ro \ 4 --volume /proc:/rootfs/proc:ro \ 5 --volume /var/log:/rootfs/var/log:ro \ 6 --volume /var/lib/docker/containers/:/var/lib/docker/containers/:ro \ 7 --volume /var/run/docker.sock:/var/run/docker.sock:ro \ 8 --volume collector_data:/data/ \ 9 --cpus=1 \ 10 --cpu-shares=102 \ 11 --memory=256M \ 12 --restart=always \ 13 --env "COLLECTOR__SPLUNK_URL=output.splunk__url=https://hec.example.com:8088/services/collector/event/1.0" \ 14 --env "COLLECTOR__SPLUNK_TOKEN=output.splunk__token=B5A79AAD-D822-46CC-80D1-819F80D7BFB0" \ 15 --env "COLLECTOR__SPLUNK_INSECURE=output.splunk__insecure=true" \ 16 --env "COLLECTOR__ACCEPTLICENSE=general__acceptLicense=true" \ 17 --env "COLLECTOR__SPLUNK_JSON1=pipe.join::json__patternRegex=^{" \ 18 --env "COLLECTOR__SPLUNK_JSON2=pipe.join::json__matchRegex.docker_container_image=^python:3$" \ 19 --privileged \ 20 outcoldsolutions/collectorfordocker:3.0.86.180207 Let’s try to rerun our example. bash Copy 1docker run --name json_example --volume $PWD:/app -w /app python:3 python3 app.py This time we see everything as expected in Splunk. If you have a mix of JSON output messages with other types of messages, you can also configure it so that you expect the line to not start with } and not start with any space characters. ini Copy 1[pipe.join::json] 2# Configure for which containers we want to apply this match rule. You can configure it for 3# containers running with a specific image as below or with matchRegex.docker_container_name = ^json_example$ 4matchRegex.docker_container_image = ^python:3$ 5# We want to define that all messages should not start with `}` or any space characters. 6patternRegex = ^[^}\s] Join rules are a powerful tool. Read more about them in our documentation Join Rules. Blog Blog - Getting started with Monitoring Kubernetes, Openshift and Docker on your development box Blog Blog - Getting started with Monitoring Kubernetes, Openshift and Docker on your development box Getting started with Monitoring Kubernetes, Openshift and Docker on your development box This blog post will guide you through the process of setting up a development environment with Docker, Kubernetes, and OpenShift, and monitoring them using Splunk. This guide is mostly referencing macOS as a development environment, but you can adjust it for other operating systems as well. We provide configurations that work out of the box in most cases. The good thing is that most of the Kubernetes and OpenShift providers have very similar default configurations. In this blog post I will guide you through steps that you need to perform on a local development box to install all three of our main applications for Monitoring Docker, Kubernetes and OpenShift in Splunk. If you found any issues or having troubles with this guide, please reach out to us at support@outcoldsolutions.com. Install Splunk Enterprise using Docker You can install Splunk as usual on your development box, or use Splunk official Docker image to download and install Splunk Enterprise. Below is a basic configuration to get Splunk up and running; you can find more details about available configuration options in the documentation. Note for macOS users, make sure in Docker for Mac settings under General enable Use Rosetta for x86_64/amd64 emulation on Apple Silicon You can change the default password changeme to a more secure password, and change the token to a unique GUID, make sure to keep this change in sync with other commands below. bash Copy 1docker run -d --platform linux/amd64 \ 2 --name splunk \ 3 -p 8000:8000 \ 4 -p 8088:8088 \ 5 -e 'SPLUNK_GENERAL_TERMS=--accept-sgt-current-at-splunk-com' \ 6 -e 'SPLUNK_START_ARGS=--accept-license' \ 7 -e 'SPLUNK_PASSWORD=changeme' \ 8 -e 'SPLUNK_HEC_TOKEN=00000000-0000-0000-0000-000000000000' \ 9 -v splunk_etc:/opt/splunk/etc \ 10 -v splunk_var:/opt/splunk/var \ 11 splunk/splunk:latest Give it a few minutes and check the logs from the container docker logs splunk; at the end you should see a message Ansible playbook complete, will begin streaming var/log/splunk/splunkd_stderr.log. Open Splunk Web in the Browser http://localhost:8000 and log in with the default user admin and password changeme Splunk comes with the trial license for 500MB ingestion a day, and for 30 days. If you want to restart the trial license, you can stop the container, remove it, remove both volumes docker volume rm splunk_etc splunk_var, and recreate it again. But if you want to run it for more than 30 days, you can get a developer license from here. Which you can renew every 6 months, and it will allow you to ingest up to 10GB a day. Install applications for Monitoring Docker, OpenShift and Kubernetes Depending on which platform you are planning to use, you can install one or more of our applications. Assuming you are running Splunk localy, you can go to the page http://localhost:8000/en-US/manager/launcher/apps/local and click on the Browse More Apps button in top right corner. And search for a specific application or just for outcold solutions. When you will click install button, it will ask you for the Splunk credentials. Those are credentials you use to login to https://splunk.com, including https://splunkbase.splunk.com. If you don’t have account, you will need to create one. Another option is to download applications directly from SplunkBase: Monitoring Docker Monitoring Kubernetes Monitoring OpenShift And upload them to your Splunk instance, using the same page http://localhost:8000/en-US/manager/launcher/apps/local, just use the button Install App From File. Request development license for Collectord You can request development license for Collectord from here. License will be delivered instantly to your inbox. You can renew license every 180 days using the same form. Install Collectord on Docker You can follow our latest installation instructions to learn more details how to install Collectord. But below is the example of how to install Collectord on Docker, that will forward metrics and logs to Splunk instance that we just started. Just make sure to replace ... with the license key that you received from us. bash Copy 1docker run -d \ 2 --name collectorfordocker \ 3 --volume /:/rootfs/:ro \ 4 --volume collector_data:/data/ \ 5 --cpus=1 \ 6 --cpu-shares=204 \ 7 --memory=256M \ 8 --restart=always \ 9 --env "COLLECTOR__SPLUNK_URL=output.splunk__url=https://host.docker.internal:8088/services/collector/event/1.0" \ 10 --env "COLLECTOR__SPLUNK_TOKEN=output.splunk__token=00000000-0000-0000-0000-000000000000" \ 11 --env "COLLECTOR__SPLUNK_INSECURE=output.splunk__insecure=true" \ 12 --env "COLLECTOR__ACCEPTLICENSE=general__acceptLicense=true" \ 13 --env "COLLECTOR__LICENSE=general__license=..." \ 14 --env "COLLECTOR__CLUSTER=general__fields.docker_cluster=-" \ 15 --privileged \ 16 outcoldsolutions/collectorfordocker:26.04.3 Navigate to Monitoring Docker application in Splunk http://localhost:8000/en-US/manager/monitoringdocker/ After that, we highly recommend looking at use cases of how you can configure forwarding pipelines with the annotations. Kubernetes using minikube The easiest way to start playing with Kubernetes is to use minikube. You can follow official documentation to learn how to install it. But if you already have Homebrew installed on your Mac, you can install minikube with brew install vfkit minikube kubernetes-cli (we also install vfkit driver and kubectl with this command). Depending on your system, you might want to change default configuration: minikube config set cpus 4 - adjust CPU (depends on how much CPU your development box has) minikube config set disk-size 60GB - give more disk space for minikube VM (default is 20GB) minikube config set memory 4096 - provide more memory (default is 2048) And on macOS change default driver to vfkit and containerd as container runtime. minikube config set driver vfkit minikube config set container-runtime containerd After that you can start local instance with command bash Copy 1minikube start Install Collectord to start collecting and forwarding metrics and logs to Splunk instance that we just started. Just make sure to replace ... with the license key that you received from us. For simplicity we will use helm to install Collectord. Install helm with Homebrew bash Copy 1brew install helm And after that install Collectord (make sure to replace ... with the license key that you received from us) bash Copy 1helm install collectorforkubernetes \ 2 --namespace collectorforkubernetes \ 3 --create-namespace \ 4 --set collectord.configuration.general.acceptLicense=true \ 5 --set collectord.configuration.general.license=... \ 6 --set collectord.configuration.outputs.splunk.default.url=https://host.minikube.internal:8088/services/collector/event/1.0 \ 7 --set collectord.configuration.outputs.splunk.default.token=00000000-0000-0000-0000-000000000000 \ 8 --set collectord.configuration.outputs.splunk.default.insecure=true \ 9 oci://registry-1.docker.io/outcoldsolutions/collectord-splunk-kubernetes Navigate to Monitoring Kubernetes application in Splunk http://localhost:8000/en-US/app/monitoringkubernetes/. You can try use cases of how to transform and forward additional data the annotations for Collectord. OpenShift using crc Installing local OpenShift using crc is pretty straightforward as well. But to use it with official OpenShift distribution, you will need to create a RedHat Account. If for some reason you cannot create RedHat account, you can still use crc to install OKD distribution of OpenShift. To use crc with okd you can download it from here and set the preset okd with crc config set preset okd. See Running OKD with CRC. bash Copy 1crc config set kubeadmin-password kubeadmin 2crc config set memory 12288 3crc config set disk-size 100 4crc config set cpus 8 5crc config set host-network-access true In case of OpenShift preset you also will need to set crc config set pull-secret-file ~/Downloads/pull-secret (where ~/Downloads/pull-secret is the path to your pull secret file that you will download from RedHat Account). Now just setup and start crc: bash Copy 1crc setup 2crc start After that you can get access to oc tool (which is just an OpenShift version of kubectl) and login to the cluster: bash Copy 1eval $(crc oc-env) 2oc login -u kubeadmin If you have not already install helm with Homebrew bash Copy 1brew install helm And install Collectord on the cluster (make sure to replace ... with the license key that you received from us) bash Copy 1helm install collectorforopenshift \ 2 --namespace collectorforopenshift \ 3 --create-namespace \ 4 --set collectord.configuration.general.acceptLicense=true \ 5 --set collectord.configuration.general.license=... \ 6 --set collectord.configuration.outputs.splunk.default.url=https://host.crc.testing:8088/services/collector/event/1.0 \ 7 --set collectord.configuration.outputs.splunk.default.token=00000000-0000-0000-0000-000000000000 \ 8 --set collectord.configuration.outputs.splunk.default.insecure=true \ 9 oci://registry-1.docker.io/outcoldsolutions/collectord-splunk-openshift Navigate to Monitoring OpenShift application in Splunk http://localhost:8000/en-US/app/monitoringopenshift/. You can try use cases of how to transform and forward additional data the annotations for Collectord. Learning Kubernetes and OpenShift If you have just started learning Kubernetes and OpenShift, take a look at Kubernetes tutorials and OKD documentation and Red Hat OpenShift Documentation. Blog Blog - Integrating OpenShift Web Console 4.x with Monitoring OpenShift application in Splunk Blog Blog - Integrating OpenShift Web Console 4.x with Monitoring OpenShift application in Splunk Integrating OpenShift Web Console 4.x with Monitoring OpenShift application in Splunk Compared to OpenShift 3.11, OpenShift 4.x looks completely different. In the first releases of OpenShift 4.x we suggested missing features, one of them was integration with the Web Console. See github.com/openshift/console: Pod Log Links Extension. For OpenShift 3.x look at Monitoring OpenShift in Splunk: integration with Web Console In OpenShift version 4.2 we gained a feature for adding links to external logging solutions. So you can integrate the Web Console with our Monitoring OpenShift application in Splunk Enterprise or Splunk Cloud. Let’s walk through the steps for integration. Official OpenShift documentation is available at Defining a template for an external log link. Navigate as suggested to the Custom Resource Definition, find ConsoleExternalLogLink. At the Instances tab, click on the button Create Console External Log Link Define the YAML as in the example below (replace https://search.splunk.outcold.vmlocal:8000 with the URL of your Splunk Search Head). Web Console Expects only https links. yaml Copy 1apiVersion: console.openshift.io/v1 2kind: ConsoleExternalLogLink 3metadata: 4 name: monitoring-openshift 5spec: 6 hrefTemplate: >- 7 https://search.splunk.outcold.vmlocal:8000/en-US/app/monitoringopenshift/search?q=search%20%60macro_openshift_logs%60%20openshift_pod_id%3D%22${resourceUID}%22%20openshift_container_name%3D%22${containerName}%22%20openshift_namespace%3D%22${resourceNamespace}%22 8 text: Monitoring OpenShift After that, you can go to the Pod page in the OpenShift Web Console and open logs in Splunk When you click on Monitoring OpenShift, that will open a window with the logs in the Monitoring OpenShift application Blog Blog - Layering Collectord annotations: pod, namespace, and Configuration CRD Blog Blog - Layering Collectord annotations: pod, namespace, and Configuration CRD Layering Collectord annotations: pod, namespace, and Configuration CRD Collectord lets app teams own how their data gets forwarded - without anyone touching a central config. The harder question is where each annotation should live: on the pod, on the workload, on the namespace, or in a cluster-level Configuration CRD. Each layer has a different audience, a different blast radius, and slightly different precedence rules. This post is the long version of the annotations docs, aimed at platform teams running Collectord across many tenants. Examples are Kubernetes; everything carries over to OpenShift (oc instead of kubectl). Why annotations exist in the first place Before per-resource annotations, the only way to tell a forwarder “send the payments namespace to the payments Splunk index” was to edit a giant central config - a Splunk forwarder inputs.conf, an OpenTelemetry pipeline, a Fluentd <match> block. Every team that wanted a routing change filed a ticket with the platform team, and the platform team became the bottleneck for changes that should have taken five minutes. Annotations move that decision back to the team that owns the workload. The app team labels their namespace, deployment, or pod; Collectord reads the label at pod startup and routes accordingly. The platform team configures Collectord once; everything beyond that is self-service. But fully self-service is rarely what large organizations want either - compliance often needs the platform team to enforce a non-negotiable rule (mandatory PII masking, a required audit index). That’s where the Configuration CRD comes in: a layer that lets the platform team set policy across teams without editing the ConfigMap or visiting every namespace. The layers, in order of precedence When a pod starts, Collectord assembles its effective annotation set by walking five sources, from highest precedence to lowest: Pod - annotations on the pod itself (or its template inside a Deployment / StatefulSet / DaemonSet). Workload - annotations on the owning Deployment, StatefulSet, or DaemonSet. Namespace - annotations on the namespace. Configuration CRD - collectord.io/v1/Configuration resources whose spec regex matches the pod’s metadata. ConfigMap defaults - what’s in 001-general.conf / 002-daemonset.conf / 004-addon.conf for the Collectord pods themselves. The first layer to set a given annotation wins - pod beats workload beats namespace beats CRD beats ConfigMap. That matches the intuition: the closer to the data, the more authoritative the override. Same annotation, four layers — Pod wins Podcollectord.io/logs-index: kubernetes_team_x Workload— no annotation set Namespacecollectord.io/logs-index: kubernetes_payments Configuration CRDcollectord.io/logs-index: kubernetes_default Resolved logs-index = kubernetes_default[configuration:cluster-default] logs-index = kubernetes_payments[namespace] logs-index = kubernetes_team_x[pod] The one exception is force: true on a Configuration CRD, which lets the platform team flip that order for a specific rule. We’ll come back to it below. When to use which layer Where does this annotation belong? Pod Specific to one pod — container layout, log volume, mount path. collectord.io/volume.1-logs-name: logs Workload Same for every replica of one Deployment / StatefulSet / DaemonSet. collectord.io/output: splunk::prod1 Namespace Default for an entire team — index routing, output, masking. collectord.io/index: kubernetes_payments Configuration CRD Cluster-wide rule by metadata regex — not tied to one namespace. spec.kubernetes_namespace: ".+-prod$" A simple decision guide for a brand-new annotation: Does this only apply to one specific pod? Put it on the Pod (or the Deployment template - same effect for replicas). Does it apply to every replica of one workload? Put it on the Deployment / StatefulSet / DaemonSet. Does it apply to everything in one team’s namespace? Put it on the Namespace. This is by far the most common spot - index routing, output selection, and per-team defaults belong here. Does it apply to everything matching some metadata pattern, regardless of who owns the namespace? That’s a Configuration CRD job. End-to-end examples for each: Pod / workload - local quirks A Tomcat pod writes its access logs and catalina.out to /usr/local/tomcat/logs/. Pointing Collectord at that volume - and parsing the timestamp out of each line - only makes sense for this container layout, so the annotations belong on the Pod (or the Deployment template, which propagates to every replica): yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: tomcat 5 annotations: 6 # tell Collectord which volume holds the logs 7 collectord.io/volume.1-logs-name: 'logs' 8 collectord.io/volume.1-logs-type: 'tomcat_log' 9 # parse the timestamp out of `02-Jan-2026 15:04:05.123 INFO ...` so _time matches the log line 10 collectord.io/volume.1-logs-extraction: '^(?P<ts>\d{2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2}\.\d{3}) (.+)$' 11 collectord.io/volume.1-logs-timestampfield: 'ts' 12 collectord.io/volume.1-logs-timestampformat: '02-Jan-2006 15:04:05.000' 13spec: 14 containers: 15 - name: tomcat 16 image: tomcat:9 17 volumeMounts: 18 - name: logs 19 mountPath: /usr/local/tomcat/logs/ 20 volumes: 21 - name: logs 22 emptyDir: {} This is the kind of configuration that has to live next to the workload - only this image writes to that path, only this format needs that timestamp regex. Pushing it up to a Namespace would force every other workload in the namespace to know about Tomcat’s quirks. If every replica of a Deployment needs the same annotation, set it on the spec.template.metadata.annotations field of the Deployment - Collectord reads the resulting Pod’s annotations, which are identical for every replica. Namespace - per-team defaults The team that owns payments wants their data in their own Splunk index for chargeback and access control: yaml Copy 1apiVersion: v1 2kind: Namespace 3metadata: 4 name: payments 5 annotations: 6 collectord.io/index: kubernetes_payments Every pod in payments - current and future - inherits this. New apps deploy and route correctly with zero per-pod work, and the team can still override anything they need at the pod level. Configuration CRD - when the rule isn’t tied to a namespace What if the rule isn’t scoped to a namespace or a workload, but to a property - every namespace whose name ends in -prod, every pod with the tier=frontend label, every container running an nginx image? Repeating the same namespace-level annotation across dozens of unrelated namespaces doesn’t scale. The platform team writes a Configuration resource that names the rule and the targets. Below, every namespace whose name ends in -prod routes its data to a shared kubernetes_prod index - no per-namespace annotation needed: yaml Copy 1apiVersion: "collectord.io/v1" 2kind: Configuration 3metadata: 4 name: route-prod-namespaces 5 annotations: 6 collectord.io/index: kubernetes_prod 7spec: 8 kubernetes_namespace: ".+-prod$" This is the same annotation you’d put on a Namespace - collectord.io/index - but applied via metadata regex instead of one namespace at a time. New *-prod namespaces start routing correctly the moment they appear. Multi-container pods: different rules per container A common Kubernetes pattern is multi-container pods - a primary container alongside one or more sidecars (auth proxies, audit forwarders, log shippers, service meshes). Each container in the same pod often produces wildly different logs: a web container emits high-volume access logs, an audit-logger sidecar emits low-volume but security-critical events, and an envoy proxy emits debug noise that’s already covered by its metrics. The natural temptation is to treat the whole pod as one unit, but Collectord lets you scope every annotation to a single container by prefixing it with the container’s name and a double-dash: collectord.io/{annotation} - applies to every container in the pod. collectord.io/{container_name}--{annotation} - applies only to that named container. Below, a webportal pod has three containers and we want very different things for each: yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: webportal 5 annotations: 6 # web container - access logs to a low-retention index, custom sourcetype 7 collectord.io/web--logs-index: 'kubernetes_webportal_access' 8 collectord.io/web--logs-type: 'nginx_access' 9 10 # audit-logger sidecar - security index, mandatory PII masking, custom sourcetype 11 collectord.io/audit-logger--logs-index: 'kubernetes_security_audit' 12 collectord.io/audit-logger--logs-type: 'webportal_audit' 13 collectord.io/audit-logger--logs-replace.1-search: '(\d{1,3}\.){3}\d{1,3}' 14 collectord.io/audit-logger--logs-replace.1-val: 'X.X.X.X' 15 16 # envoy proxy - drop logs entirely; we already have its metrics 17 collectord.io/envoy--logs-disabled: 'true' 18 19 # untagged annotation - applies to every container in the pod 20 collectord.io/userfields.cost_center: 'CC-1234' 21spec: 22 containers: 23 - name: web 24 image: nginx 25 - name: audit-logger 26 image: myregistry.io/audit-logger:1.4 27 - name: envoy 28 image: envoyproxy/envoy:v1.28 Each container’s logs land in a different Splunk index with a different sourcetype, so downstream searches and dashboards see clean, datasource-tagged events. The audit container gets PII masking that the web container doesn’t need. The envoy container is silenced at the source. And the cost_center: 'CC-1234' user field - set without a container prefix - gets attached to every event from every container in this pod. logs-disabled vs logs-output: devnull: both stop data from reaching Splunk, but they leave Collectord in different states. With logs-output: 'devnull', Collectord still reads the log files and advances its position tracker - it just acks the events without doing anything with them (no pipes, no forwarding). If you switch the container back to splunk later, forwarding resumes from the moment of the switch - everything that happened during the devnull window is gone for good. With logs-disabled: 'true', Collectord doesn’t read the file at all and the position tracker doesn’t move; re-enabling later replays from wherever it last left off, which for a brand-new container means the beginning of the file. Pick devnull when you want to mute a chatty container now and not backfill if you re-enable. Pick disabled when you want to leave the door open to going back and forwarding everything from the start. The container prefix is a Pod / Workload / Namespace / Configuration CRD concept - it works at every layer. A platform-team Configuration CRD can scope its annotations to one container too: yaml Copy 1apiVersion: "collectord.io/v1" 2kind: Configuration 3metadata: 4 name: mask-ips-on-nginx 5 annotations: 6 # only the nginx container in any pod gets this masking 7 collectord.io/nginx--logs-replace.1-search: '(\d{1,3}\.){3}\d{1,3}' 8 collectord.io/nginx--logs-replace.1-val: 'X.X.X.X' 9spec: 10 kubernetes_container_name: "^nginx$" Stdout vs stderr is separate from container scoping. Use stdout- and stderr- to split the two streams of one container; use the container prefix to split containers from each other. They compose: collectord.io/web--stderr-logs-type: 'nginx_error' is a valid annotation that targets the web container’s stderr stream specifically. A tour of what annotations can do Annotations control everything from where data lands in Splunk to whether it shows up at all. The sections above focus on where to put annotations; this section is a topical tour of what they can do. For the exhaustive list, see the Annotations reference. Routing - index, source, sourcetype, host, output The four big knobs are index, source, type (sourcetype), and host. Each comes in a generic catch-all form (collectord.io/index) that applies to every datatype, and a per-datatype form for finer control: Generic Container logs Container stats Process / network stats Events (namespace-only) App logs (volume) Prometheus collectord.io/index logs-index stats-index procstats-index, netstats-index, nettable-index events-index volume.{N}-logs-index prometheus.{N}-index source, type, host, and output follow the same pattern. Most clusters set a generic collectord.io/index at the namespace level for everything, then override one or two datatypes when retention or access control demands it - for example, keeping logs in kubernetes_payments and metrics in a smaller kubernetes_payments_metrics index with longer retention. Splitting one stream into multiple sourcetypes A single container often emits multiple log formats on the same stream - an nginx container writes both access logs (starting with an IP) and error logs (starting with a date). Override pipes split that stream at ingest time so each format gets its own sourcetype and source: yaml Copy 1collectord.io/logs-override.1-match: '^(\d{1,3}\.){3}\d{1,3}' 2collectord.io/logs-override.1-source: '/kubernetes/nginx/access' 3collectord.io/logs-override.1-type: 'nginx_access' 4 5collectord.io/logs-override.2-match: '^\d{4}/\d{2}/\d{2}' 6collectord.io/logs-override.2-source: '/kubernetes/nginx/error' 7collectord.io/logs-override.2-type: 'nginx_error' Lines matching the IP regex get the access-log routing; lines matching the date regex get the error-log routing; anything else keeps the container default. Content transformation - replace, hashing, whitelist Three pipes operate on log content before it reaches Splunk: Replace - find a regex, substitute a value. Mask PII, drop noisy lines (replace with empty string), or rewrite. Pipes apply in numeric order - replace.1 runs before replace.2, so you can chain a “drop noise” pipe before a “mask PII” pipe. yaml Copy 1collectord.io/logs-replace.1-search: '(\d{1,3}\.){3}\d{1,3}' 2collectord.io/logs-replace.1-val: 'X.X.X.X' Use ${groupname} in the replacement to reference named capture groups: (?P<IPv4p1>\d{1,3})(\.\d{1,3}){3} with replacement ${IPv4p1}.X.X.X keeps the first octet and masks the rest. Hashing - replace a regex match with a deterministic hash. Use this instead of replace when you need to correlate events on a sensitive value without sending the value itself: yaml Copy 1collectord.io/logs-hashing.1-match: '(\d{1,3}\.){3}\d{1,3}' 2collectord.io/logs-hashing.1-function: 'fnv-1a-64' Searching for the hash of a known IP finds every line that contained that IP - but the IP itself never reaches Splunk. fnv-1a-64 is the cheapest non-cryptographic option and is fine for correlation; use sha256 if you have a security requirement that demands a cryptographic hash. Whitelist - only forward events matching a regex; drop everything else. Cheaper than chained replace calls when the keep-list is small: yaml Copy 1collectord.io/logs-whitelist: '((DELETE)|(POST))$' Field extraction and timestamp parsing Field extraction at ingest time pulls structured values out of unstructured log lines and indexes them as fields rather than scanning _raw. Performance gain on high-volume indexes is dramatic. yaml Copy 1collectord.io/logs-extraction: '^(?P<ip>[^\s]+) .* \[(?P<ts>[^\]]+)\] (.+)$' 2collectord.io/logs-timestampfield: 'ts' 3collectord.io/logs-timestampformat: '02/Jan/2006:15:04:05 -0700' The first unnamed capture group becomes _raw (override with logs-extractionMessageField). When timestampfield is set, the parsed timestamp overrides ingest time as _time - important when log files are batched, replayed, or affected by clock skew. Collectord uses Go’s time parser, which formats the reference date Mon Jan 2 15:04:05 MST 2006. For unix epoch timestamps (5.24.440+), use the format string @unixtimestamp. Multiline events The default logs-eventpattern is ^[^\s] - any line not starting with whitespace begins a new event, which handles most stack traces. Override per-container when continuation lines start in column 0: yaml Copy 1# Java/Elasticsearch logs where every event begins with `[` 2collectord.io/logs-eventpattern: '^\[' Volume control - sampling and throttling When Splunk costs are a concern or one chatty container would otherwise starve everyone else on the node, four annotations cap or reduce log volume: logs-sampling-percent - keep N% of lines randomly. Good for trend-only signals (error rates, latency distributions). logs-sampling-key - combined with sampling-percent, hash on a key (user ID, session, request ID) so all events sharing that key are kept-or-dropped together. Preserves per-user investigation that random sampling breaks. logs-ThruputPerSecond - hard rate cap (128Kb, 1MiB/s). Anything over the limit is dropped, not buffered. logs-TooOldEvents / logs-TooNewEvents - ignore events with timestamps outside a window around “now”. Prevents replaying weeks of old logs after a restart, or rejecting future-dated events from a misconfigured container clock. Each has a volume.{N}-logs- variant for application logs from mounted volumes, and stdout-/stderr- variants for splitting per stream. Custom indexed fields with userfields Tag every event from a pod with indexed fields - useful for cost-center reporting, environment tags, or service IDs without modifying the application: yaml Copy 1collectord.io/userfields.cost_center: 'CC-1234' 2collectord.io/userfields.environment: 'production' 3collectord.io/userfields.service_id: 'webportal' Each appears as an indexed field in Splunk you can | stats over. Per-datatype variants exist (logs-userfields.{name}, stats-userfields.{name}, volume.{N}-logs-userfields.{name}, events-userfields.{name}) when you want the field on logs but not metrics, or vice versa. Application logs from mounted volumes When an app writes logs to a file rather than stdout - common for audit logs, GC logs, slow-query logs, anything that needs to survive a process restart - declare the volume with volume.{N}-logs-name and Collectord auto-discovers files on it (no sidecar required). Every container-log annotation has a volume.{N}-logs- analog: volume.1-logs-replace, volume.1-logs-extraction, volume.1-logs-sampling-percent, etc. yaml Copy 1collectord.io/volume.1-logs-name: 'audit-logs' 2collectord.io/volume.1-logs-glob: '*.log' # files to match (default *.log*) 3collectord.io/volume.1-logs-type: 'audit_log' 4collectord.io/volume.1-logs-recursive: 'true' # walk subdirectories A pod can declare multiple volumes (volume.1-, volume.2-, …). Collectord supports emptyDir, hostPath, and persistentVolumeClaim. For PVC-backed volumes that move between nodes, set volume.{N}-logs-onvolumedatabase: 'true' so the position-tracking database lives on the volume itself - otherwise the new node replays from the start. Prometheus auto-discovery Annotations make Collectord a per-pod Prometheus scrape target - no central scrape config needed: yaml Copy 1collectord.io/prometheus.1-port: '9527' 2collectord.io/prometheus.1-path: '/metrics' 3collectord.io/prometheus.1-interval: '60s' 4collectord.io/prometheus.1-whitelist: '^(http_requests|process_cpu)_.+' A pod can expose multiple endpoints (prometheus.1-*, prometheus.2-*). For HTTPS, set scheme: 'https' plus insecure: 'true' or caname for verification. For protected endpoints, username/password (basic auth) or authorizationkey. Annotations on Docker containers work the same way - both collectord.io/{annotation} and io.collectord.{annotation} label forms are accepted. To send Prometheus metrics to a Splunk metrics-type index instead of the default events index: yaml Copy 1collectord.io/prometheus.1-output: 'splunk::metrics' 2collectord.io/prometheus.1-index: 'kubernetes_metrics' 3collectord.io/prometheus.1-indexType: 'metrics' The HEC token behind splunk::metrics must have a metrics-type index as its default - a standard event-token rejects metrics writes. Sending to multiple Splunk outputs at once Sometimes the same log line needs to land in two places - a security index for SIEM and an apps index for developers. Comma-separate output names in collectord.io/logs-output: yaml Copy 1collectord.io/logs-output: 'splunk::apps[kubernetes_logs],splunk::security[kubernetes_security]' Each event is sent to both endpoints. The square brackets override the index per output so each side gets the right index without you needing two different annotations. User outputs - SplunkOutput CRD In a multi-tenant cluster the platform team owns Collectord but app teams want to define their own Splunk destinations without filing a ticket to edit the central ConfigMap. The SplunkOutput CRD lets a team declare a destination in their own namespace and reference it from their workloads: yaml Copy 1apiVersion: "collectord.io/v1" 2kind: SplunkOutput 3metadata: 4 namespace: payments 5 name: payments-team-splunk 6spec: 7 url: https://splunk.payments.example.com:8088/services/collector/event/1.0 8 token: 1a8b9c3e-7789-4353-821f-15b9662bac99 # or reference a Secret since 25.10 9 insecure: false 10--- 11apiVersion: apps/v1 12kind: Deployment 13metadata: 14 namespace: payments 15 name: payments-api 16spec: 17 template: 18 metadata: 19 annotations: 20 collectord.io/output: 'splunk::user/payments/payments-team-splunk' 21 spec: 22 containers: 23 - name: api 24 image: myregistry.io/payments-api:2.4 The reference format is splunk::user/<namespace>/<name>. Since 25.10, tokens can be referenced from Secrets instead of inlined in the CRD - see the 25.10 release notes. Inside the Configuration CRD A few details worth knowing before you write more than the trivial example. Match fields and AND semantics The spec is a flat map of metadata-field-name → regex pattern. Common fields you’ll match on: kubernetes_namespace kubernetes_pod_name kubernetes_pod_labels kubernetes_container_name kubernetes_container_image kubernetes_daemonset_name You can match on any field Collectord forwards as event metadata. When you specify more than one, all must match - combinations are logical AND: yaml Copy 1spec: 2 kubernetes_namespace: "^.+-prod$" 3 kubernetes_container_image: "^myregistry\\.io/audit-logger:.+$" This matches an audit-logger image only in production namespaces. Regexes are unanchored - anchor them yourself Collectord uses Go’s regexp.MatchString, which returns true on a substring match. kubernetes_container_image: "nginx" will also match nginx-ingress, bitnami/nginx-exporter, and anything else with nginx somewhere in the image string. Always anchor (^...$) when you mean an exact name - "^nginx(:.*)?$" matches the official nginx image with any tag, and nothing else. Match by pod label kubernetes_pod_labels is a multi-value field - every label on the pod becomes its own key=value entry, and the CRD regex is tested against each entry independently. To match pods carrying tier=frontend, write a regex that matches the full key=value string: yaml Copy 1spec: 2 kubernetes_pod_labels: "(?:^|,)tier=frontend(?:,|$)" A bare tier=frontend matches any entry containing that substring - tier=frontend-canary would slip through. The (?:^|,) / (?:,|$) boundaries pin the match to a complete entry; ^tier=frontend$ works just as well. Stick with the comma-tolerant form if you’d like the same regex to be safe against any future joined-string representation. Multiple CRDs matching the same pod Nothing stops you from having ten Configuration resources that all match the same pod - that’s normal as your policy library grows (one CRD per concern: PII, retention, output routing, throttling). Collectord applies each matching CRD in turn; the first one to set a given annotation wins, and a later CRD only overrides if it uses force: true (next section). Different CRDs setting different annotations layer cleanly. Cluster-scoped, watched live Configuration is a cluster-scoped resource (no namespace). Collectord watches the CRD continuously, the same way it watches Pods - kubectl apply a new Configuration and Collectord reapplies the merged annotation set on the next event from the affected pods, no restart needed. When the platform team needs to win: force: true Available since Collectord version 5.19.390 The default specificity order is what most app teams want - they can override anything from above. But it’s the wrong default when the platform team is enforcing policy. If a Configuration says “audit logs from production namespaces always go to the kubernetes_audit_secured index,” an app team should not be able to flip that with a pod annotation. Set force: true at the top level of the CRD (sibling to spec) and the CRD’s annotations beat anything below them: yaml Copy 1apiVersion: "collectord.io/v1" 2kind: Configuration 3metadata: 4 name: mandatory-audit-index 5 annotations: 6 collectord.io/audit-logger--logs-index: 'kubernetes_audit_secured' 7 collectord.io/audit-logger--logs-replace.1-search: '(\d{3}-\d{2}-\d{4})' 8 collectord.io/audit-logger--logs-replace.1-val: 'XXX-XX-XXXX' 9spec: 10 kubernetes_container_name: "^audit-logger$" 11force: true The same CRD does two things at once: routes every audit-logger container’s logs to the secured audit index, and masks anything resembling a Social Security Number on the way through. Even if a workload sets collectord.io/audit-logger--logs-index: my_team_index on its template, the forced CRD wins. Specificity still beats force There’s one subtlety: force: true makes a CRD beat the same annotation set lower down. It does not promote a generic annotation over a more specific one. collectord.io/logs-index is more specific than collectord.io/index - index applies to every datatype, logs-index only to container logs. A pod-level collectord.io/logs-index: foo will still beat a forced Configuration setting collectord.io/index: bar, because the pod is targeting logs directly while the CRD is targeting all data. The mechanics are worth knowing because they explain why this “leak” is harmless: the CRD’s index: bar is applied (force or not - the pod didn’t set index, so the merge accepts it), and that value still routes everything else from this pod - stats, events, process metrics - to bar. It just loses out on container logs, where the more-specific logs-index: foo resolves first. So the platform team’s intent (bar for everything not otherwise specified) and the app team’s override (foo for logs only) compose cleanly without one silently swallowing the other. If you need to lock container logs down specifically, force the most specific form - logs-index, not index. Debugging: where did this annotation come from? By the time you have pod-level overrides, namespace defaults, and three or four Configuration CRDs in flight, “what’s actually applied to this pod?” gets hard to answer by reading manifests. collectord describe is the single source of truth - it asks Collectord to compute the merged annotation set for one pod and prints each one tagged with its origin. Starting in 26.04, each resolved field carries a bracketed source tag: bash Copy 1kubectl exec -n collectorforkubernetes \ 2 collectorforkubernetes-fqhmv -- \ 3 /collectord describe \ 4 --namespace payments \ 5 --pod webportal-7c9f8d-xqz2t \ 6 --container nginx | grep '\[' text Copy 1logs-index [namespace] = kubernetes_payments 2logs-replace.1-search [configuration:mask-ips-on-nginx] = (\d{1,3}\.){3}\d{1,3} 3logs-replace.1-val [configuration:mask-ips-on-nginx] = X.X.X.X 4volume.1-logs-name [pod] = logs That’s a layered config working exactly as designed: the team owns the index (namespace), the platform team enforces masking (CRD), and the app declares its log volume (pod). Describe strips the container-name prefix once it’s resolved against the target container, so a CRD annotation like collectord.io/nginx--logs-replace.1-search shows up as logs-replace.1-search when you’re describing the nginx container. The [configuration:<name>] tag was added in 26.04 - see the release notes and Troubleshooting → Describe. Common gotchas A short collection of things customers run into: The match regex isn’t anchored. kubernetes_container_image: "nginx" matches nginx-ingress too. Anchor with ^...$. The same applies to kubernetes_namespace, kubernetes_pod_name, kubernetes_container_name - substrings match by default. Container prefix doesn’t match the container name. collectord.io/web--logs-index: ... only applies if the container is named web. Typos in the container name silently drop the annotation - Collectord won’t warn you. Run collectord describe --container <name> to confirm. force: true on a generic annotation doesn’t beat a specific one. Use logs-index (specific) instead of index (generic) if container logs are what you want to lock down. Same for logs-output vs output, etc. logs-disabled and logs-output: devnull are not the same thing. Both stop data from reaching Splunk and neither runs the pipes - they differ in what happens to the file position tracker. devnull reads the file and advances the position, so switching back to splunk resumes from the moment of the switch (the muted window is gone). disabled doesn’t read the file and the position doesn’t move, so re-enabling replays from wherever it last left off - often the beginning. Pick devnull to silence a chatty container now without committing to a backfill later; pick disabled when you want the option to forward everything if you change your mind. Pod annotations are read from the live Watch stream. Edit a pod or workload annotation and Collectord picks it up almost immediately - no restart, no waiting. The same is true for Configuration CRDs. events-output only works at the namespace level. Kubernetes events are forwarded per namespace, not per pod, so collectord.io/events-output set on a pod has no effect. Pod-label regex needs anchors. kubernetes_pod_labels: "tier=frontend" will also match tier=frontend-canary because the regex is unanchored. Use ^tier=frontend$ or the comma-tolerant (?:^|,)tier=frontend(?:,|$). Container prefix wraps the stream prefix, not the other way around. When combining the two, the order is {container}--{stream}-{annotation} - for example, collectord.io/web--stderr-logs-type: 'nginx_error'. collectord.io/stderr--web-logs-type is not the same thing - Collectord would interpret it as targeting a container literally named stderr. Multi-tenant annotationsSubdomain. When you run more than one Collectord instance on the same cluster, set [general]annotationsSubdomain per instance. Annotations under <subdomain>.collectord.io/... only apply to the matching instance; collectord.collectord.io/... is shared. The same filtering applies to annotations on Configuration CRDs. Wrap-up Annotations are how Collectord lets app teams own their data routing without touching a central config - and the Configuration CRD is how the platform team takes back the keys when policy demands it. Most clusters end up with a mix: namespace annotations for per-team defaults, pod and workload annotations for app-specific quirks, and a small library of Configuration CRDs for masking, mandatory indexes, and routing rules that don’t fit a single namespace. When you’re ever unsure where a setting is coming from, run collectord describe and read the brackets. For the full annotation list, see the Annotations reference. For OpenShift, the OpenShift annotations docs cover the same ideas with oc. Blog Blog - Monitoring Amazon EKS with Splunk Enterprise and Splunk Cloud Blog Blog - Monitoring Amazon EKS with Splunk Enterprise and Splunk Cloud Monitoring Amazon EKS with Splunk Enterprise and Splunk Cloud Congratulations to the AWS team for shipping such a great product. Based on the data provided by CNCF, more than half of all companies who run Kubernetes choose to do so on AWS. Managing the Control Plane is not the most straightforward task. EKS does that for you. The only thing that is up to you is to bootstrap worker nodes and run your applications. Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a managed service that makes it easy for you to run Kubernetes on AWS without needing to stand up or maintain your own Kubernetes control plane. We are proud to announce that our solution for Monitoring Kubernetes works with Amazon EKS from day one. To get started follow the Installation instructions and use appropriate configuration for the specific version of Kubernetes. At this moment only Kubernetes version 1.10 can be deployed on EKS. In our example, we used EKS and Splunk deployed in the same Region and the same VPC. But there are no special requirements for your Splunk Enterprise deployment. You can also use Splunk Cloud with our solution. The only requirement is to give the EKS cluster access to the Splunk HTTP Event Collector endpoint, which is usually deployed on port 8088. After performing all the steps from the Installation instructions, you will see that the DaemonSet for worker nodes will schedule Pods with our collectord on every worker node, and one addon Pod will be deployed for collecting Kubernetes events. Because you don’t have access to the Master nodes, you can delete the DaemonSet for masters or safely ignore it. With the default configuration, you will get metrics from the worker nodes. You will see detailed metrics for the nodes, pods, containers, and processes. Container and host logs will be automatically forwarded as well. From the control plane, you will be able to see the Kubelet metrics in the application. You will be able to review Network And monitor PVC and Instance storage usage We have over 30 alerts pre-built for you, which will highlight issues with your deployments and workloads you are running All other Cluster information will be unavailable because you don’t have access to the metrics of the Scheduler, etcd, and controller. But you can still collect metrics from the API Server. By default, in our configuration we expect every collector on master nodes to collect metrics from the Kubernetes API processes. But because in the case of EKS you don’t have access to the Master nodes, you can schedule collection of the Kubernetes API from the addon. In our configuration file, find the section of ConfigMap with the file definition for the addon 004-addon.conf and add a section as in the example below (lines 6-42). yaml Copy 1 004-addon.conf: | 2 [general] 3 4 ... 5 6 [input.prometheus::kubernetes-api] 7 8 # disable prometheus kubernetes-api metrics 9 disabled = false 10 11 # override type 12 type = prometheus 13 14 # specify Splunk index 15 index = 16 17 # override host 18 host = kubernetes-eks-api-server 19 20 # override source 21 source = kubernetes-api 22 23 # how often to collect prometheus metrics 24 interval = 60s 25 26 # prometheus endpoint 27 endpoint.kubeapi = https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/metrics 28 29 # token for "Authorization: Bearer $(cat tokenPath)" 30 tokenPath = /var/run/secrets/kubernetes.io/serviceaccount/token 31 32 # server certificate for certificate validation 33 certPath = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 34 35 # client certificate for authentication 36 clientCertPath = 37 38 # Allow invalid SSL server certificate 39 insecure = true 40 41 # include metrics help with the events 42 includeHelp = false After that, restart the addon pod. Find the pod id bash Copy 1$ kubectl get pods --namespace collectorforkubernetes 2NAME READY STATUS RESTARTS AGE 3collectorforkubernetes-addon-546bd58878-4qk44 1/1 Running 0 48m 4collectorforkubernetes-g2wbg 1/1 Running 0 55m 5collectorforkubernetes-gwdg5 1/1 Running 0 55m 6collectorforkubernetes-rsh44 1/1 Running 0 55m And delete the addon pod with bash Copy 1$ kubectl delete pod collectorforkubernetes-addon-546bd58878-4qk44 --namespace collectorforkubernetes 2pod "collectorforkubernetes-addon-546bd58878-4qk44" deleted A new pod will be scheduled with updated configurations. In a few minutes, you should be able to see API Kubernetes Metrics in our application. Links AWS News Blog - Amazon EKS – Now Generally Available. Getting Started with Amazon EKS If you are getting errors when trying to access the API from CLI, like error: the server doesn't have a resource type "cronjobs" or error: You must be logged in to the server (Unauthorized), check the article Common errors when setting up EKS for the first time. You need to be sure that you are creating the EKS cluster with the same IAM that is going to access the API. In our case, we were using MFA for managing temporary sessions, which caused errors similar to those described above. Blog Blog - Monitoring Amazon Elastic Container Service (ECS) Clusters with Splunk Enterprise and Splunk Cloud Blog Blog - Monitoring Amazon Elastic Container Service (ECS) Clusters with Splunk Enterprise and Splunk Cloud Monitoring Amazon Elastic Container Service (ECS) Clusters with Splunk Enterprise and Splunk Cloud [UPDATE (2018-06-15)] Based on Amazon ECS Adds Daemon Scheduling, we updated our blog post to show how you can schedule our collectord on ECS by using the new Daemon Scheduling. [UPDATE (2018-10-15)] Updated to Monitoring Docker v5.2 Amazon EC2 Container Service (ECS) is a highly scalable, high-performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances. Because ECS runs Docker as a Container Engine, our solution for Monitoring Docker works out of the box with it as well. In our example, we used ECS and Splunk deployed in the same Region and the same VPC. But there are no special requirements for your Splunk Enterprise deployment. You can also use Splunk Cloud with our solution. The only requirement is to give the ECS cluster access to the Splunk HTTP Event Collector endpoint, which is usually deployed on port 8088. We expect that you have already finished the Splunk configuration from our manual Monitoring Docker Installation, installed our application Monitoring Docker, and enabled HTTP Event Collector in your Splunk environment. First, you need to create a new Task Definition At the very bottom, you can find a Configure JSON button Use the following template to create a Task Definition, where at a minimum you need to set the URL for your HTTP Event Collector and specify your HTTP Event Collector Token and include the license key (request an evaluation license key with this automated form). You can also revisit Memory and CPU limits specific to your load, and you can adjust those later. json Copy 1{ 2 "containerDefinitions": [ 3 { 4 "name": "collectorfordocker", 5 "image": "outcoldsolutions/collectorfordocker:{{ collectorfordocker_version }}", 6 "memory": "256", 7 "cpu": "256", 8 "essential": true, 9 "portMappings": [], 10 "environment": [ 11 { 12 "name": "COLLECTOR__SPLUNK_URL", 13 "value": "output.splunk__url=https://hec.example.com:8088/services/collector/event/1.0" 14 }, 15 { 16 "name": "COLLECTOR__SPLUNK_TOKEN", 17 "value": "output.splunk__token=B5A79AAD-D822-46CC-80D1-819F80D7BFB0" 18 }, 19 { 20 "name": "COLLECTOR__SPLUNK_INSECURE", 21 "value": "output.splunk__insecure=true" 22 }, 23 { 24 "name": "COLLECTOR__ACCEPTLICENSE", 25 "value": "general__acceptLicense=true" 26 }, 27 { 28 "name": "COLLECTOR__LICENSE", 29 "value": "general__license=..." 30 }, 31 { 32 "name": "COLLECTOR__EC2_INSTANCE_ID", 33 "value": "general__ec2Metadata.ec2_instance_id=/latest/meta-data/instance-id" 34 }, 35 { 36 "name": "COLLECTOR__EC2_INSTANCE_TYPE", 37 "value": "general__ec2Metadata.ec2_instance_type=/latest/meta-data/instance-type" 38 } 39 ], 40 "mountPoints": [ 41 { 42 "sourceVolume": "cgroup", 43 "containerPath": "/rootfs/sys/fs/cgroup", 44 "readOnly": true 45 }, 46 { 47 "sourceVolume": "proc", 48 "containerPath": "/rootfs/proc", 49 "readOnly": true 50 }, 51 { 52 "sourceVolume": "var_log", 53 "containerPath": "/rootfs/var/log", 54 "readOnly": true 55 }, 56 { 57 "sourceVolume": "var_lib_docker_containers", 58 "containerPath": "/rootfs/var/lib/docker/", 59 "readOnly": true 60 }, 61 { 62 "sourceVolume": "docker_socket", 63 "containerPath": "/rootfs/var/run/docker.sock", 64 "readOnly": true 65 }, 66 { 67 "sourceVolume": "collector_data", 68 "containerPath": "/data", 69 "readOnly": false 70 } 71 ], 72 "volumesFrom": null, 73 "hostname": null, 74 "user": null, 75 "workingDirectory": null, 76 "privileged": true, 77 "readonlyRootFilesystem": true, 78 "extraHosts": null, 79 "logConfiguration": null, 80 "ulimits": null, 81 "dockerLabels": null, 82 "logConfiguration": { 83 "logDriver": "json-file", 84 "options": { 85 "max-size": "1m", 86 "max-file": "3" 87 } 88 } 89 } 90 ], 91 "volumes": [ 92 { 93 "name": "cgroup", 94 "host": { 95 "sourcePath": "/cgroup" 96 } 97 }, 98 { 99 "name": "proc", 100 "host": { 101 "sourcePath": "/proc" 102 } 103 }, 104 { 105 "name": "var_log", 106 "host": { 107 "sourcePath": "/var/log" 108 } 109 }, 110 { 111 "name": "var_lib_docker_containers", 112 "host": { 113 "sourcePath": "/var/lib/docker/" 114 } 115 }, 116 { 117 "name": "docker_socket", 118 "host": { 119 "sourcePath": "/var/run/docker.sock" 120 } 121 }, 122 { 123 "name": "collector_data", 124 "host": { 125 "sourcePath": "/var/lib/collectorfordocker/data/" 126 } 127 } 128 ], 129 "networkMode": null, 130 "memory": "256", 131 "cpu": "0.5 vcpu", 132 "placementConstraints": [], 133 "family": "collectorfordocker", 134 "taskRoleArn": "" 135} After saving it, you can use this Task Definition on your ECS clusters. The next step is to schedule this Task on every host we have. Open your cluster and go to the Services tab Create a new service with the following configuration: Launch type - EC2 Task Definition - choose the just-created task definition collectorfordocker Cluster - the name of your ECS cluster Service name - collectorfordocker [UPDATE (2018-06-15)] Service Type - DAEMON Minimum healthy percent - keep defaults (we aren’t going to use them) On the second step, choose: Load balancer type - None On the third step: Service Auto Scaling - Do not adjust the service’s desired count After creating the service, give it a minute to download the image and set up our collectord. If everything works as expected, you should see data in the Monitoring Docker application. Known issues By default, collectord picks up Docker daemon logs from /rootfs/var/log/docker. You need to update macro macro_docker_host_logs_docker and change it to text Copy 1(`macro_docker_host_logs` AND source="*var/log/docker*") Blog Blog - Monitoring Docker Universal Control Plane (UCP) with Splunk Enterprise and Splunk Cloud Blog Blog - Monitoring Docker Universal Control Plane (UCP) with Splunk Enterprise and Splunk Cloud Monitoring Docker Universal Control Plane (UCP) with Splunk Enterprise and Splunk Cloud Docker UCP is the real king of orchestration, not only does it allow you to deploy workloads using docker-compose files, including docker services and docker stacks, but also it runs Kubernetes control plane and allows you to deploy Kubernetes Workloads. [UPDATE (2018-11-14)] If you are using Docker UCP 3.1.0 or above please follow installation instructions from Installing Monitoring Kubernetes. Docker Universal Control Plane (UCP) is the enterprise-grade cluster management solution from Docker. You install it on-premises or in your virtual private cloud, and it helps you manage your Docker cluster and applications through a single interface. https://docs.docker.com/ee/ucp/ It can be very challenging to set up infrastructure that will allow you to increase observability not only of your micro-services but also supporting infrastructure. Outcold Solutions offers dedicated solutions for Monitoring Docker and Monitoring Kubernetes, but if you are running UCP, which solution should you choose? Both solutions will allow you to monitor all containers running on the cluster, including control plane containers, and application containers. If you deploy mostly Kubernetes workloads on UCP, you should consider using Monitoring Kubernetes solution. And if most of your applications are deployed with docker-compose files you should use Monitoring Docker, as Monitoring Kubernetes has additional concepts that do not apply to docker (Pods, Workloads). Below we walk through how you can install both solutions, so you will be able to compare. In our scenarios, we used Docker EE with Universal Control Plane 3.0.5. Installing Monitoring Kubernetes on UCP Installing Monitoring Docker on UCP For Docker UCP version 3.1.0 or above use Installing Monitoring Kubernetes instructions. Installing Monitoring Kubernetes on UCP A few details that you should be aware of regarding Kubernetes support on UCP: With UCP 3.0.5 it uses Kubernetes v1.8.11. In our example, we will use configuration built for Kubernetes 1.8. UCP does not use Kubernetes RBAC Authorization. It uses own User Management system. We will need to strip all RBAC related configuration from our manifest and configure service account with Docker UCP User Management. You cannot deploy DaemonSets on worker nodes outside of the kube-system namespace. For UCP deployment we change the namespace from collectorforkubernetes to kube-system. The first step is simple, install our application from SplunkBase and enable HTTP Event Collector. Please follow our official guide on how to configure Splunk in Monitoring Kubernetes solution. As for collectord for kubernetes, the steps will be slightly different. Grant collectorforkubernetes service account permissions to access Kubernetes API First, you need to create a service account collectorforkubernetes using UCP. Go to the tab Service Accounts under Kubernetes and click the Create button. Change namespace to kube-system and paste yaml Copy 1apiVersion: v1 2kind: ServiceAccount 3metadata: 4 labels: 5 app: collectorforkubernetes 6 name: collectorforkubernetes 7 namespace: kube-system After creating this service account, we need to give it view-only permissions for Kubernetes API Service. You can do that with User Management by creating a new grant. Go to Grants under User Management and click the Create button. In the wizard, on step 1, choose Service Account as a subject type, kube-system as a namespace, collectorforkubernetes as a Service Account and click Next On step 2, choose View Only as a Role Type and click Next. On step 3, choose namespaces as a Type and enable toggle Apply grant to all existing and new namespaces and click create. Installing collectorforkubernetes Download collectorforkubernetes.yaml that we specifically prepared for UCP deployment. Similarly to the general installation instructions you need to accept the License, configure Splunk URL and Token and include license key (request an evaluation license key with this automated form). ini Copy 1[general] 2 3acceptLicense = true 4 5license = ... 6 7... 8 9# Splunk output 10[output.splunk] 11 12# Splunk HTTP Event Collector url 13url = https://hec.example.com:8088/services/collector/event/1.0 14 15# Splunk HTTP Event Collector Token 16token = B5A79AAD-D822-46CC-80D1-819F80D7BFB0 17 18# Allow invalid SSL server certificate 19insecure = true Copy the whole content of the YAML file, and go to the UCP console, go to Controllers under Kubernetes, and click the Create button. Change namespace to collectorforkubernetes and paste the whole content to the Object YAML section and click Create. If everything is correct, you should start seeing data in a few moments in the Monitoring Kubernetes application in Splunk. Within the application, when you navigate to a specific node, you will be able to see pods scheduled with Kubernetes And below you will be able to see all containers that have been scheduled with Kubernetes or Docker Services and Stacks Please read Next Steps that we recommend after installation. Installing Monitoring Docker on UCP At first install our application from SplunkBase and enable HTTP Event Collector. Please follow our official guide on how to configure Splunk in Monitoring Docker solution. To install collectord on your Docker nodes we recommend using the CLI, as our configuration has a lot of mounts and it is easy to make a mistake by adding them manually. To get access to the CLI from UCP, you can find instructions on the main Dashboard if you scroll to the very bottom of the page After configuring the CLI, create a file collectorfordocker.yaml with the content as in the example below. Specify the correct Splunk URL and Token and accept the License. yaml Copy 1version: "3" 2services: 3 4 collectorfordocker: 5 image: outcoldsolutions/collectorfordocker:5.2 6 volumes: 7 - /sys/fs/cgroup:/rootfs/sys/fs/cgroup:ro 8 - /proc:/rootfs/proc:ro 9 - /var/log:/rootfs/var/log:ro 10 - /var/lib/docker/:/rootfs/var/lib/docker/:ro 11 - /var/run/docker.sock:/rootfs/var/run/docker.sock:ro 12 - collector_data:/data/ 13 environment: 14 - COLLECTOR__SPLUNK_URL=output.splunk__url=https://hec.example.com:8088/services/collector/event/1.0 15 - COLLECTOR__SPLUNK_TOKEN=output.splunk__token=B5A79AAD-D822-46CC-80D1-819F80D7BFB0 16 - COLLECTOR__SPLUNK_INSECURE=output.splunk__insecure=true 17 - COLLECTOR__ACCEPTLICENSE=general__acceptLicense=true 18 - COLLECTOR__LICENSE=general__license=... 19 - COLLECTOR__CGROUPS=general.docker__containersCgroupFilter=^(/([^/\s]+/)*(docker-|docker/|kubepods/.*)[0-9a-f]{64}(\.scope)?)$$ 20 deploy: 21 mode: global 22 restart_policy: 23 condition: any 24 resources: 25 limits: 26 cpus: '1' 27 memory: 256M 28 reservations: 29 cpus: '0.1' 30 memory: 64M 31 32volumes: 33 collector_data: Create services with the Docker CLI bash Copy 1docker stack deploy --compose-file ./collectorfordocker.yml collectorfordocker Check that services have been deployed bash Copy 1docker stack services collectorfordocker Give it a few moments, and you should see the data in the Monitoring Docker application. Similarly to Monitoring Kubernetes application, you will be able to see all containers running on your Docker UCP cluster Please read Next Steps that we recommend after installation. Summary Both applications, Monitoring Docker and Monitoring Kubernetes, provide you with a way to monitor your clusters, see logs from the containers and from the hosts as well. Monitoring Kubernetes also provides dashboards dedicated for the Kubernetes Control Plane. If you prefer to use both applications, it is possible to add aliases for Monitoring Docker application to reuse the data that we forward for the Monitoring Kubernetes application. Have a question? We are one email away. Blog Blog - Monitoring Docker, OpenShift and Kubernetes - v5.2 - bug postmortem (lookup with alerts causing replication activities on SHC) Blog Blog - Monitoring Docker, OpenShift and Kubernetes - v5.2 - bug postmortem (lookup with alerts causing replication activities on SHC) Monitoring Docker, OpenShift and Kubernetes - v5.2 - bug postmortem (lookup with alerts causing replication activities on SHC) We have released a patch 5.2.180 for version 5.2 of our applications for Splunk. This patch fixes an important issue in Search Head Cluster deployments: lookup with alerts causing very often replication activities on SHC. If you are currently using version 5.2 with Splunk Search Head Clustering deployments, please upgrade ASAP. Actions for Upgrade If you are using version 5.2 upgrade to 5.2.180. Verify that you don’t have any triggers left in the local folder, if you have already overwritten some of our alerts cat $SPLUNK_ETC/apps/monitoring*/default/savedsearches.conf | grep 'action.lookup'. If you do, remove all of the lines, starting with action.lookup. Remove lookups for alerts ls $SPLUNK_ETC/apps/monitoring*/lookups/*_alerts.csv from our applications monitoringopenshift, monitoringdocker, monitoringkubernetes. We have several static lookups as well in the lookups folder. Postmortem Since version 2.0 we have had alerts in our applications. But all of these alerts were only about license overuse and expiration, or for letting you know that you are using not the correct version of Collectord. With version 5.2 we brought many alerts to help you monitor the health of the system. We tested every alert separately to verify that these alerts work, and that we have a dashboard in the application that can help you diagnose this alert further. Triggered alerts aren’t visible in Splunk if you don’t know where to look. You need to specify custom triggers for them or check triggered alerts manually in the Activity → Triggered Alerts. We wanted to be able to tell customers right out of the box if something is wrong with their deployments or clusters. Among all the available triggers that Splunk has by default, we felt that writing to a CSV file was the best option. This option did not require any additional configuration, and we believed it should not affect the system anyhow. And we used this file to show the triggered alerts on the first page of our applications. We knew that for every new append in the CSV file, this would trigger a replication on the SHC. We did the math; 30 alerts, running in ranges from 5 minutes to 24 hours, in the worst-case scenario could trigger 30 SHC replications in 5 minutes. Even if the file grows to a few megabytes, that should be easy to replicate. After we heard that one of our customers had an issue with this lookup, we started an investigation and set up a lab with the Search Head Cluster, where we tried to reproduce the worst scenario. We changed every alert so that it always triggers. In just 3 hours we observed that: Lookup file grew to over 1Mb. Time spent on actions went from 5 seconds to 30 seconds. This behavior caused over 600 replicated configuration operations every 5 minutes. Considering that we only had 30 savedsearches writing to this lookup file, it felt too high. Lookup file got corrupted a few times (you could see a text in several _time cells). A corrupted lookup file causes searches to fail. We go back to the same problem: the application wants to show you alerts, and you will not be able to see them. It was clear that this was not an option for us anymore. We could not use CSV for storing fired alerts. Sadly, KVStore lookups are not available as a trigger action for alerts. KVStore would not cause any issues with the SHC replication. Action 1. Remove alert lookups The first action was simple. We removed all lookup actions from our alerts (savedsearches.conf) text Copy 1action.lookup = 1 2action.lookup.append = 1 3action.lookup.filename = monitoring_docker_alerts.csv And modified our searches. We do not need to normalize all the results to the same set of fields anymore. Instead, we can show more information with every saved search, as in the example below Action 2. Use rest API to show alerts on the first page Instead of relying on the CSV lookups, we started to use REST API directly to get the list of fired alerts. text Copy 1| rest splunk_server=local count=1024 /servicesNS/-/monitoringopenshift/alerts/fired_alerts/- Alternatively you can find fired alerts from the _audit index text Copy 1index=_audit action="alert_fired" ss_app=monitoringdocker | But for that you need to give your users access to the internal indexes. And most of the non-admins do not have access to these indexes. After the patch After we changed the implementation of saved searches, we reran a test. We replaced every alert so it always triggers; after running this test for 12 hours, we got almost a thousand fired events. These events do not affect SHC in any way, no unnecessary replications, and time for actions stays the same. Below is the dashboard of searches scheduled on Search Head Cluster. About 120 searches per hour on average. Conclusion If you are using Search Head Cluster with version 5.2 of our applications, please make sure that you have upgraded to the latest patched version 5.2.180. Note that using dynamic CSV lookups for SHC environments is not a great option, even if these lookups are small. Blog Blog - Release 5 Blog Blog - Release 5 Monitoring Docker, OpenShift and Kubernetes - Version 5 (Application Logs and Annotations) We are happy to announce the 5th version of our applications for Monitoring Docker, OpenShift, and Kubernetes. First of all, we want to thank our customers! Feedback and close work with our customers help us to build one of the best tools for monitoring container environments. Now, let us share with you what we worked on this summer. Application Logs It is a best practice to forward logs to the standard out and standard error of the container. But that is not always achievable, perhaps because of legacy software, or even if you are trying to containerize something very complicated like a database, and having all the logs in one stream can reduce readability and observability. Our solutions for monitoring Docker, OpenShift, and Kubernetes offer the simplest way to forward logs stored inside the container. No need to install any sidecar containers, map host paths, or change the configuration for the collectord. Just two things: define a volume with local driver (Docker) or emptyDir (Kubernetes/OpenShift), and tell the collectord the name of this volume. An example of forwarding application logs from a PostgreSQL container running with Docker. bash Copy 1docker run -d \ 2 --volume psql_data:/var/lib/postgresql/data \ 3 --volume psql_logs:/var/log/postgresql/ \ 4 --label 'collectord.io/volume.1-logs-name=psql_logs' \ 5 postgres:10.4 \ 6 docker-entrypoint.sh postgres -c logging_collector=on -c log_min_duration_statement=0 -c log_directory=/var/log/postgresql -c log_min_messages=INFO -c log_rotation_age=1d -c log_rotation_size=10MB With that, you will get logs from the standard output, standard input, and stored inside the volume psql_logs. Please read more on the topic: Monitoring OpenShift v5 - Annotations - Application Logs Monitoring Kubernetes v5 - Annotations - Application Logs Monitoring Docker v5 - Annotations - Application Logs Annotations We used annotations in our solutions for Monitoring OpenShift and Kubernetes for overriding indexes, sources, source types, and hosts for data we forward with the collectord. With version 5, we are bringing annotations to Monitoring Docker and also adding more features that can be enabled with annotations. That includes: Extracting fields Extracting timestamps Hiding sensitive information from the logs Redirecting events to /dev/null on some pattern Stripping terminal colors from container logs Defining multi-line event patterns Defining application logs An example: obfuscating all IP addresses in nginx container logs running in Kubernetes yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/logs-replace.1-search: (?P<IPv4p1>\d{1,3})(\.\d{1,3}){3} 7 collectord.io/logs-replace.1-val: ${IPv4p1}.X.X.X 8spec: 9 containers: 10 - name: nginx 11 image: nginx Instead of IP Addresses We have a lot of examples in our documentation. Monitoring OpenShift v5 - Annotations Monitoring Kubernetes v5 - Annotations Monitoring Docker v5 - Annotations Splunk Output There are two significant improvements for configuring HTTP Event Collector with the collectord. You can define multiple endpoints, which allows collectord to load balance forwarding between numerous Splunk HTTP Event Collectors if you don’t have a Load Balancer in front of them. Or you can use it for fail-overs if your Load Balancer or DNS fails, so the collector can switch to the next in the list. The second improvement is handling invalid Indexes with Splunk HTTP Event Collector. We learned that it is a very common issue to misprint an index name or forget to add the index to the list of indexes where HTTP Event Collector can write. The collectord can recognize error messages from the Splunk HTTP Event Collector and decide how to handle this error. With version 5, in case of an error, it redirects all the data to the default index. You can change this behavior. Specify to Drop messages or Block the pipeline. Waiting on the pipeline is similar behavior to the previous version with one exception. Previously, the whole pipeline could be blocked. Now only data to this index will be blocked, process stats and events. Pod, Container stats, and logs are not affected by the blockage. You can read more about the configurations for Splunk Output with examples in our documentation. Monitoring OpenShift v5 - Configurations for Splunk HTTP Event Collector Monitoring Kubernetes v5 - Configurations for Splunk HTTP Event Collector Monitoring Docker v5 - Configurations for Splunk HTTP Event Collector Installation and Upgrade As always, this upgrade is available for free for all our customers. And you can try it for free with the embedded trial license. Release notes: Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions: Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Upgrade instructions: Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Blog Blog - Release 5.1 Blog Blog - Release 5.1 Monitoring Docker, OpenShift and Kubernetes - Version 5.1 (Network metrics and socket tables, prometheus autodiscovery) We are bringing network metrics and socket tables with the minor update Monitoring Docker, OpenShift and Kubernetes Version 5.1. This release also includes visual and usability improvements in the application, performance and stability improvements in collectord, and new configurations to dynamically discover metrics from Pods exported in Prometheus format. Network Metrics Dashboards for Hosts, Host, Pods, Containers and Workloads include network metrics: received and transmitted MB, Packets, Errors and Drops. Network socket table Network metrics alone do not give you a lot of visibility into how your containers communicate with each other or the public network. Version 5.1 brings network socket tables, which can give you information about current connections established with the Host, Pod or Container, and a list of the ports the current container is listening on. Network Review Network socket tables allowed us to bring another security review dashboard that allows you to find all the connections to the outside world, or between the containers, pods, hosts or services. Forwarding Prometheus metrics from Pods We are expanding annotations with support for collecting Prometheus metrics directly from the pods. If your Pod or container can export metrics in Prometheus format, and you want to see these metrics in Splunk, you can simply annotate them, and collectord will automatically forward these metrics to Splunk. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/prometheus.1-port: '9527' 7 collectord.io/prometheus.1-path: '/metrics' 8spec: 9 containers: 10 - name: nginx 11 image: sophos/nginx-prometheus-metrics To learn more about annotations and forwarding Prometheus metrics from Pods, please read our documentation Monitoring OpenShift - Forwarding Prometheus metrics Monitoring Kubernetes - Forwarding Prometheus metrics Links Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.10 Blog Blog - Release 5.10 Monitoring Docker, OpenShift and Kubernetes - Version 5.10 - Security dashboards, multi-cluster monitoring Minor update 5.10 is focused on the usability improvements for the multi-cluster monitoring and security monitoring use cases. Security and audit dashboards With version 5.10 we improved existing security and audit monitoring and introduced a new set of dashboards grouped under the Security tab. If you have Security use-cases that we haven’t covered yet, feel free to send us a feature request at contact@outcoldsolutions.com. Access dashboard Review ssh sessions, super user sessions, exec sessions on your pods and forbidden requests to API server. Audit (users and projects) In addition to the Audit dashboard, we introduced a new dashboard focused on the access to the API server from the users (excluding system accounts). Now you can easily find who worked on the specific project in the last 72 hours, the list of actions they executed, and the number of requests they have made. Network (traffic) Monitor the network traffic for your hosts and namespaces/projects Network (connections) Review the connections from your nodes and namespaces, track cross-namespace connections. Objects (pods) This dashboard is based on the streaming of Pods objects from API server. Now you can review all pods that are running on the host network, pods age (if you have a policy how often the pods need to be updated), review image pull policy and image versions, review all pods that mount host paths, and pods that change their security context. Improved Multi-cluster monitoring With the previous releases, we suggested to use node labels (or docker engine labels) for identifying the cluster names. Although an easy to use solution, it’s not always available in cases when you don’t have control over your cluster. We have introduced custom fields in earlier versions. Custom fields can be attached to any data that collectord forwards to Splunk. Now we use custom fields to help you identify the clusters in our applications. For backward compatibility we extract the cluster name from node labels cluster as well. If you have a different node label that identifies your clusters, you will have to update calc fields kubernetes_cluster_eval, openshift_cluster_eval or docker_cluster_eval accordingly. Or you can just switch to the custom fields as described in our installation instructions. In case of OpenShift just update the ConfigMap (make sure to restart pods after update) ini Copy 1[general] 2fields.openshift_cluster = development Most dashboards in the application have a filter Cluster that can help you filter the data from a specific cluster. Monitoring OpenShift v5 - Installation Monitoring Kubernetes v5 - Installation Monitoring Docker v5 - Installation Dashboard: Clusters (Allocations and usage) For multi-cluster setups we also included a dashboard that can help you to review allocations and usage for the clusters. Improved support for custom sourcetypes You can override sourcetypes for the container logs with the annotations collectord.io/logs-type=.... Before release 5.10, you would need to update the macro macro_openshift_logs (in case of OpenShift) to be able to see these logs in our application. Starting from version 5.10, we will identify logs by source /openshift/logs/..., /kubernetes/logs/... and /docker/logs/..., so now you can easily change the sourcetype without worrying about updating macros. Base macro for configuring indexes With version 5.10 we introduced a base macro that can help you configure indexes by modifying only base macros macro_docker_base, macro_openshift_base or macro_kubernetes_base. Monitoring OpenShift v5 - Splunk Indexes Monitoring Kubernetes v5 - Splunk Indexes Monitoring Docker v5 - Splunk Indexes Improved support for OpenShift 4.x We’ve updated our configuration page with the instructions on how to install Monitoring OpenShift for OpenShift 4.x following the public release. For OpenShift 4.x we also provide RHEL8 certified images with the prefix -ubi8, built from base images -ubi8-minimal. You can find configuration for OpenShift 4.x that refers to -ubi8 images. Support for volatile journald storage With 5.10 we can automatically forward logs from the journald volatile storage. Please refer to our updated configurations. We include a mount for /run/log and allow you to configure multiple paths for input.journald. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.11 Blog Blog - Release 5.11 Monitoring Docker, OpenShift and Kubernetes - Version 5.11 - Support for PVC application logs Minor update 5.11 focuses on stability improvements and adds support for PVC volumes for collecting application logs. Collecting application logs from PVC volumes Starting from Version 5.11, you can collect application logs from volumes with type persistentVolumeClaim. Similarly to our example with emptyDir volumes, you can use persistentVolumeClaim as the volume type. For example: yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: postgres-pod 5 annotations: 6 collectord.io/volume.1-logs-name: 'logs' 7spec: 8 containers: 9 - name: postgres 10 image: postgres 11 command: 12 - docker-entrypoint.sh 13 args: 14 - postgres 15 - -c 16 - logging_collector=on 17 - -c 18 - log_min_duration_statement=0 19 - -c 20 - log_directory=/var/log/postgresql 21 - -c 22 - log_min_messages=INFO 23 - -c 24 - log_rotation_age=1d 25 - -c 26 - log_rotation_size=10MB 27 volumeMounts: 28 - name: data 29 mountPath: /var/lib/postgresql/data 30 - name: logs 31 mountPath: /var/log/postgresql/ 32 volumes: 33 - name: data 34 emptyDir: {} 35 - name: logs 36 persistentVolumeClaim: 37 claimName: logs Collectord will automatically discover logs stored in volume logs mounted with persistentVolumeClaim logs. In order to take advantage of this feature, please make sure to update the YAML configuration and upgrade collectord to version 5.11.260. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.12 Blog Blog - Release 5.12 Monitoring Docker, OpenShift and Kubernetes - Version 5.12 - less storage, less license, more features Version 5.12 of our applications and Collectord is available. Reduced size for the metrics In previous versions, we were sending raw metrics from the proc file system and doing calculations on the Splunk side to show you correct metric values, like CPU usage and Memory usage. Now, instead, we are pre-calculating these metrics on the Collectord side and forwarding already user-friendly metric values. This provides a few improvements: less data to transfer over the network, less storage for the metrics, faster searches, as Splunk does not need to perform these evaluations anymore. Global configurations with Custom Resources Collectord watches for collectord.io configuration Custom Resources and applies these annotations based on the selectors. As an example, you can apply the configuration to all Pods created with the image that contains nginx in the name as follows: yaml Copy 1apiVersion: "collectord.io/v1" 2kind: Configuration 3metadata: 4 name: apply-to-all-nginx 5 annotations: 6 collectord.io/nginx--logs-replace.1-search: '^.+\"GET [^\s]+ HTTP/[^"]+" 200 .+$' 7 collectord.io/nginx--logs-replace.1-val: '' 8 collectord.io/nginx--logs-hashing.1-match: '(\d{1,3}\.){3}\d{1,3}' 9 collectord.io/nginx--logs-hashing.1-function: 'fnv-1a-64' 10spec: 11 kubernetes_container_image: "^nginx(:.*)?$" Watching for namespaces (projects) and workloads Collectord watches for changes in the namespaces (OpenShift projects) and workloads. When it sees new or updated annotations, it automatically recreates pipelines for the Pods. yaml Copy 1[general.kubernetes] 2watch.namespaces = v1/namespace 3watch.deployments = apps/v1/deployment 4watch.configurations = apis/v1/collectord.io/configuration Base macro for alerts To be able to enable alerts only for Production clusters, you can change the macro_openshift_alerts_base ( macro_kubernetes_alerts_base or macro_docker_alerts_base) to macro_openshift_alerts_base = (openshift_cluster_eval=prod1 OR openshift_cluster_eval=prod2). This will generate alerts only for these two clusters. Troubleshooting command A new troubleshooting command has been introduced that can show you the list of all the annotations applied to a specific container or pod. For example, you can see all the annotations applied to the specific pod postgres-pod running in namespace default: bash Copy 1kubectl exec -n collectorforkubernetes collectorforkubernetes-master-4gjmc -- /collectord describe --namespace default --pod postgres-pod --container postgres Beta: dynamic index names based on meta fields You can apply dynamic index names in the configurations to forward logs or stats to a specific index based on the meta fields. For example, you can define an index as: ini Copy 1[input.files] 2 3index = oc_{{openshift_namespace}} In that case, all the logs from the namespace foo will be forwarded to the index oc_foo and all the logs from the namespace bar will be forwarded to the index oc_bar. This feature is in beta. We will publish additional documentation on how to set it up. Beta: diagnostic checks With this release, we are bringing a new capability in Collectord that will allow you to diagnose and fire alerts directly from Collectord. For now, we implemented only one check for the Node Entropy that can verify that the entropy_avail is above the threshold 800. Next, we are planning to add more alerts directly to Collectord, so this data will not be forwarded to Splunk, but Collectord will generate alerts directly from the nodes. These alerts will be written as part of the standard output of the Collectord containers. For example: text Copy 1ALARM-OFF "node-entropy" 2ALARM-ON "node-entropy" entropy value 720 is below the threshold 800 This feature is in beta. By default, it is disabled and can be enabled in the YAML configuration for collectorforopenshift.yaml or collectorforkubernetes.yaml (under diagnostics::node-entropy). Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.14 Blog Blog - Release 5.14 Monitoring Docker, OpenShift and Kubernetes - Version 5.14 - containerd, templates for indexes and sources Version 5.14 of our applications and Collectord is available. Placeholders in indexes and sources You can apply dynamic index names in the configurations to forward logs or stats to a specific index based on the meta fields. For example, you can define an index as: ini Copy 1[input.files] 2 3index = oc_{{openshift_namespace}} Similarly, you can change the source of all the forwarded logs like: ini Copy 1[input.files] 2 3source = /{{openshift_namespace}}/{{::coalesce(openshift_daemonset_name, openshift_deployment_name, openshift_statefulset_name, openshift_cronjob_name, openshift_job_name, openshift_replicaset_name, openshift_pod_name)}}/{{openshift_pod_name}}/{{openshift_container_name}} Support for containerd runtime Collectord now supports Docker, CRI-O, and containerd runtimes for Kubernetes and OpenShift. Make sure to download the latest configuration for Kubernetes to be able to use the containerd runtime. New volumes have been added to reference the containerd unix socket. Exclude fields from forwarded events If you want to reduce the amount of fields forwarded with every event, you can set which fields you want to ignore like: ini Copy 1[output.splunk] 2 3excludeFields.openshift_pod_ip = true Logs dashboard improvement All filters also affect drop-downs in other fields. For example, selecting a cluster will filter suggestions for Pods only from the selected cluster. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.14.285 Blog Blog - Release 5.14.285 Monitoring Docker, OpenShift and Kubernetes - Version 5.14.285 - whitelisting the messages We have released a patch release for Collectord that includes a few new features, performance and memory improvements. Whitelisting the messages With the replace annotations, you could configure which log messages you want to drop and not forward to Splunk. If you want to configure a whitelist pattern and tell Collectord which messages should be forwarded to Splunk, you can configure a whitelist pattern now. For example, for an NGINX Pod you can specify that only log messages with POST and DELETE words should be forwarded to Splunk: yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/logs-whitelist: '((DELETE)|(POST))$' 7spec: 8 containers: 9 - name: nginx 10 image: nginx Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.15 Blog Blog - Release 5.15 Monitoring Docker, OpenShift and Kubernetes - Version 5.15 This release includes a lot of minor improvements for stability and usability and several new features. We will highlight just a few of them. For the full list, please go to the release notes (links below this page). Whitelists and blacklists for Prometheus metrics With the new feature allowing you to define whitelists and blacklists for metrics exported in Prometheus format, you can significantly reduce the number of metrics forwarded to Splunk. This allows you to reduce indexing cost, reduce storage, and improve overall performance. We have updated our configuration files to include only metrics that we use in our dashboards for the endpoints that we monitor out of the box. If you are forwarding metrics using annotations, you can also define a regexp pattern as a whitelist or blacklist to include only metrics that you are interested in. New annotations for Collectord Now you can configure any custom fields that you want to be forwarded with your events in Splunk. You can configure these annotations as collectord.io/userfields.team=team-a to forward all events from a specific namespace or pod with the pre-indexed field team and value team-a. You can configure these fields only for logs as collectord.io/logs-userfields.team=team-a. Support for Kubernetes 1.18+ and OpenShift 4.5+ A lot of metrics have been deprecated with Kubernetes 1.18 (see github.com/kubernetes/kubernetes/pull/76496). We have updated our dashboards to support new metrics and to be able to use old metrics as well. OpenShift 4.4 is built on top of Kubernetes 1.17, and we are guessing that 4.5 will be built on top of Kubernetes 1.18. Old versions of deployment configs for Kubernetes and OpenShift are published on GitHub You can find all versions of deployments starting from 5.0 in our GitHub repository github.com/outcoldsolutions/collectord-configurations. That way you can track all the changes and be able to use old configurations if you need them. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.16 Blog Blog - Release 5.16 Monitoring Docker, OpenShift and Kubernetes - Version 5.16 The major feature of this release is self-monitoring of Collectord. With the metrics published to Splunk from Collectord, you can easily monitor the performance of the logging pipeline and Splunk HEC input. We have included many small bug fixes and usability improvements in this release as well. Collectord Metrics Collectord publishes metrics for connections to Splunk, how long the requests take, how large the lag is for the events sent in every batch and many more. Now you can easily find if your Splunk HEC is not performant enough for accepting the number of logs sent from your clusters. To see data on this dashboard, make sure to update your configuration for OpenShift and Kubernetes and include input input.collectord_metrics. These metrics can also be exported in Prometheus format. For that, you need to enable httpServerBinding under [general] and metrics will be available under the path /metrics/prometheus. More annotations for Prometheus inputs With annotations for Prometheus metrics collection you can configure the caname of the certificate and include various Authorization headers. New configurations You can filter host file (input.files and input.journald) logs. Include the blacklist and whitelist patterns to reduce the number of logs from chatty hosts. ini Copy 1# Blacklisting and whitelisting the logs 2# whitelist = ^regexp$ 3# blacklist = ^regexp$ Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.16.361 Blog Blog - Release 5.16.361 Monitoring Docker, OpenShift and Kubernetes - Version 5.16.361 (ARM support) Yesterday we released a patch for Monitoring Kubernetes, OpenShift, and Docker, with version 5.16.361. This patch release focused on stability and compatibility improvements for the latest versions of Kubernetes and OpenShift. Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes With this release, we included ARM support for Monitoring Kubernetes and Monitoring Docker (RedHat OpenShift does not support ARM architecture at this moment). Now you can run Collectord on your ARM-powered nodes. If you are hosting containers on AWS, we highly recommend looking at Graviton ARM-powered EC2 instances. You can find many articles and blog posts from AWS where they claim up to 50% price-performance benefits compared to x86_64 architecture. Most of the runtimes (Java, Go, Node, Python, Rust, etc.) already support ARM, so the only thing you need to do is rebuild them with a new architecture. We were also curious to test ARM-powered EC2 instances, so we ran our regular test of sending 10,000 events per second with 1KB in size, and saw how much benefit we could see from the ARM-powered EC2 instance. Collectord did not show much better performance (~5%). Still, we saw 14% performance improvement overall and, most notably, a 20% reduction in cost (not taking into account that with the Graviton instance, we also more than doubled NVMe storage). ARM images of Collectord are available with the suffix -arm64. Links Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.17 Blog Blog - Release 5.17 Monitoring Docker, OpenShift and Kubernetes - Version 5.17 This release includes a lot of usability improvements for troubleshooting and configuring Collectord, and monitoring your clusters. Using Cores/Milli-cores in the dashboards instead of percents When we built the first versions of our dashboards, we used similar to the Linux top command values to show the CPU usage of the containers and pods. In that case, if one of the containers uses 2CPU cores, our dashboards will show 200%. However, over time we have found that it isn’t obvious for our customers, as they define the limits and requests with the concept of milli-cores (or cores), and it takes some time to adjust to the percent values. With version 5.17 we have changed the values of CPU usage from percent to cores. Instead of showing 199% (almost 2 CPU cores usage), we now show 1.990 (almost two cores are used). We keep percentages only when we show Usage relative to Requests or Limits. Resource Quotas monitoring We have updated the ConfigMap to watch the ResourceQuotas defined in the cluster and built a dashboard to monitor the usage. To use the Resource Quotas dashboard (under Review), make sure to configure watching those objects with the ConfigMap. Please use the latest ConfigMap from the installation instructions to see how it needs to be defined. Update for the Clusters dashboard We significantly updated the Clusters dashboard to include more information about the Limits, Requests, and Usage, so you can look over time at the maximum CPU Limit or Usage and plan your clusters’ capacity. License Server Starting with version 5.17 you can point Collectord to the remote HTTP server to download the license key. That way, if you are managing more than one cluster, it takes only a change in one place to update the license on all of your clusters. You can use a custom HTTP server or an instance of Collectord to distribute the license. Please read how to configure the license server in your environments by following the links below Monitoring OpenShift v5 - License Server Monitoring Kubernetes v5 - License Server Monitoring Docker v5 - License Server Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.18 Blog Blog - Release 5.18 Monitoring Docker, OpenShift and Kubernetes - Version 5.18 The main focus of this release was support for clusters using cgroupv2 and support for a different memory usage calculation. Memory usage There are a lot of ways to calculate memory usage. Since the release of the first version of Collectord, we reported the true RSS memory, which means the actual memory that a Container or the Host uses. From the cgroupv1 definition: Only anonymous and swap cache memory is listed as part of ‘rss’ stat. This should not be confused with the true ‘resident set size’ or the amount of physical memory used by the cgroup. ‘rss + mapped_file" will give you resident set size of cgroup. But that might not be the same compared to what you might see in other tools, including OpenShift monitoring. They show you the memory that cannot be freed. The definition from the Metrics Server: In an ideal world, the “working set” is the amount of memory in-use that cannot be freed under memory pressure. However, calculation of the working set varies by host OS, and generally makes heavy use of heuristics to produce an estimate. It includes all anonymous (non-file-backed) memory since Kubernetes does not support swap. The metric typically also includes some cached (file-backed) memory, because the host OS cannot always reclaim such pages. With release 5.18, we have changed that too, to show the memory usage in the same way as Metrics Server collects it. Cgroupv2 You will not find a lot of clusters deployed with the cgroupv2 controller. But Docker Desktop recently changed the default cgroup controller from v1 to v2. To support enthusiasts, we added support for the cgroupv2 controller as well. Collectord automatically detects if cgroupv2 is used instead of cgroupv1 and collects the metrics in the same manner as with cgroupv1. Allow specifying the message group name for fields extraction Collectord allows you to extract fields from Container logs. To specify what should be a message field, you just need to keep the first unnamed group, which Collectord will automatically use as the message for the event. We have found that to be confusing for a lot of users, and it made some fields extraction too complicated. So we have added a simple annotation collectord.io/logs-extractionMessageField that specifies which field should be used as the message. As an example yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/logs-extraction: '^(?P<ip_address>[^\s]+) .* \[(?P<timestamp>[^\]]+)\] (?P<message>.+)$' 7 collectord.io/logs-timestampfield: timestamp 8 collectord.io/logs-timestampformat: '02/Jan/2006:15:04:05 -0700' 9 collectord.io/logs-extractionMessageField: message 10spec: 11 containers: 12 - name: nginx 13 image: nginx arm64 support for OpenShift OpenShift added support for arm64 nodes, and we have implemented images with arm64 architecture as well. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.19 Blog Blog - Release 5.19 Monitoring Docker, OpenShift and Kubernetes - Version 5.19 The main focus of this release was to implement feature requests that we received from our users and various configuration updates for the latest versions of Kubernetes, OpenShift and Docker. (Kubernetes and OpenShift) Modifying objects streamed with Kubernetes Watch Input Collectord can Watch and Stream any types of objects from the Kubernetes API. Both OpenShift and Kubernetes deployments have by default enabled Kubernetes Watch inputs for Pods and ResourceQuotas. And users always had the ability to add their own inputs for the types of objects they want: ConfigMaps, Deployments, and any other types of objects. But there was an issue: if you wanted to also stream Secrets to Splunk, you did not want to expose secret values. With this release we have added the ability to remove some fields from the objects, or hash their values. If in the ClusterRole collectorforkubernetes or collectorforopenshift you add secrets under resources to give Collectord the ability to have access to those objects, you can add another input in 004-addon.conf ini Copy 1[input.kubernetes_watch::secrets] 2disabled = false 3refresh = 10m 4apiVersion = v1 5kind = Secret 6namespace = 7type = kubernetes_objects 8index = 9output = 10excludeManagedFields = true 11# hash all fields before sending them to Splunk 12modifyValues.object.data.* = hash:sha256 13# remove annotations like last-applied-configuration not to expose values by accident 14modifyValues.object.metadata.annotations.kubectl* = remove One of the secrets that I had on my cluster was yaml Copy 1apiVersion: v1 2kind: Secret 3metadata: 4 name: bootstrap-token-5emitj 5 namespace: kube-system 6data: 7 auth-extra-groups: c3lzdGVtOmJvb3RzdHJhcHBlcnM6a3ViZWFkbTpkZWZhdWx0LW5vZGUtdG9rZW4= 8 expiration: MjAyMC0wOS0xM1QwNDozOToxMFo= 9 token-id: NWVtaXRq 10 token-secret: a3E0Z2lodnN6emduMXAwcg== 11 usage-bootstrap-authentication: dHJ1ZQ== 12 usage-bootstrap-signing: dHJ1ZQ== 13immutable: true Collectord forwarded this secret to Splunk and hashed all values under data The syntax of modifyValues. is simple, everything that goes after is a path with a simple glob pattern where * can be in the beginning of the path property or the end. The value can be a function remove or hash:{hash_function}, the list of hash functions is the same that can be applied with annotations. You can read more about how to Stream and Query API Objects in Monitoring OpenShift: Streaming Kubernetes Objects from the API Server Monitoring Kubernetes: Streaming Kubernetes Objects from the API Server (OpenShift, Kubernetes) Allow overriding collectord.io annotations from Configurations With the annotations collectord.io you can change how Collectord forwards events to Splunk HTTP Event Collector. In version 5.12 we also introduced Cluster Level Annotations where you can define annotations for multiple Pods in your cluster by defining matching specs (for example apply those annotations when the image name is matching regular expression pattern). But if you already have an annotation, for example, collectord.io/index=foo defined on Namespace, Deployment or Pod, and if you are trying to apply this annotation from Cluster Level Configuration as collectord.io/index=bar, the one from the objects will take priority. With this version we introduced a force modifier that will force overriding those annotations, even if you have them defined on the objects. yaml Copy 1apiVersion: "collectord.io/v1" 2kind: Configuration 3metadata: 4 name: apply-to-all-nginx 5 annotations: 6 collectord.io/index: bar 7spec: 8 kubernetes_container_image: "^nginx(:.*)?$" 9force: true NOTE: if you have an annotation defined in the namespace as collectord.io/logs-index=foo, it will still take priority over index=bar, as logs-index=foo is type specific. (Docker) Streaming system/df to get information about Docker Volumes We have improved the Docker API input as well. There are some API responses that don’t return arrays, but objects with properties containing arrays. One of them is system/df that can return information about Volumes. By default, collectorfordocker now has input.docker_api::system enabled, that forwards information about Volumes. Monitoring Docker application now has list of volumes under Review->Storage (OpenShift, Kubernetes, Docker) Monitoring if node needs to be rebooted In version 5.12 we have added first diagnostics check for the node-entropy, in this release we have added a new one [diagnostics::node-reboot-required] that will monitor for the presence of files under /var/run/reboot-required* and write in the logs ALARM-ON "node-reboot-required". Applications now have an alert enabled that will notify you if some ALARMS are ON (entropy or reboot-required). (Kubernetes, OpenShift) Improved work with Kubernetes API server, when watching Pods Collectord was built from day one as container-native logging solution. We provide a different approach for collecting logs, where we watch first of all new created containers, and only after that monitor container logs on the disk. When Collectord learns about a new Pod, it traverses the ownership tree to collect as much metadata as possible. That approach worked great for a while, but with the growing number of Operators, the ownership tree can be really large. That could cause 403 requests from Collectord to the API Server, as there could be some resources that aren’t allowed by ClusterRole. With this release Collectord has an API Gate that will not allow it to traverse the ownership tree with the objects it does not have access to. Under [general.kubernetes] you just need to tell Collectord which clusterrole is used. For OpenShift that would be clusterrole = collectorforopenshift, for Kubernetes clusterrole = collectorforkubernetes. And if this ClusterRole allows Collectord to read clusterroles, it will read it, and use it to block any requests to API Server, not causing any 403 requests on API Server. Splunk output additional configurations maximumMessageLength You can configure maximumMessageLength to truncate messages before sending them to Splunk. For example if you define maximumMessageLength = 256K, Collectord truncates message for all events that have length exceeding this size, and adds a field to the event collectord_errors=truncated, allowing you to review truncated events. requireExplicitIndex This was a popular feature requests. Adds additional option to implement opt-out by default behavior for forwarding logs and metrics. If requireExplicitIndex is set to true Collectord does not forward events (logs and metrics) that do not have index explicitly configured with annotations or in the ConfigMap. By default, Collectord forwards those events with empty index, and in case of HTTP Event Collector it uses default index set for the Token. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.2 Blog Blog - Release 5.2 Monitoring Docker, OpenShift and Kubernetes - Version 5.2 (Storage usage and alerts) With version 5.2 we are increasing observability of your clusters by providing you with information about Storage usage for mounts where you run your runtime (docker or kubelet) and helping you to react to the issues faster with pre-built alerts. We updated several control plane dashboards to help you resolve issues raised by alerts, and improved performance for the Overview dashboard, so you will be able to find workloads and pods quicker. On the server side, we also improved performance, added new annotations to help you reduce the amount of unnecessary data that you want to forward to Splunk, improved security by building our docker.io from scratch instead of alpine, and added self-diagnostic CLI commands that help you to troubleshoot our deployments and verify configurations. Application updates Storage Dashboards Under Review you will find a new dashboard Storage that will show you information about used storage on the disk, where you are running your runtime (which will be docker, or kubelet in case of Monitoring Kubernetes and OpenShift applications). Monitoring OpenShift and Kubernetes applications will show you information for Persistent Volume Claims as well. These metrics are provided by metrics exported from the kubelet. These metrics are available since Kubernetes 1.8 and OpenShift 3.9. Not all of the Persistent Volume providers support these metrics, but the majority does. Alerts We have added 26 alerts for Monitoring OpenShift and Kubernetes applications, and 4 alerts for the Monitoring Docker application that will help you to monitor the health of your clusters and performance of your applications. Please review predefined alerts for every application Monitoring OpenShift v5 - Alerts Monitoring Kubernetes v5 - Alerts Monitoring Docker v5 - Alerts These alerts are especially useful when you are managing your own control plane, as a lot of alerts help you monitor the health of the etcd cluster and communication between control plane components. We updated our control plane dashboards to help you react to these alerts. Below is an example of the etcd dashboard with information about communication between etcd members, and calls to the etcd cluster. Collectord updates Verify configuration We are proud to have a solution that is so easy to install, that does not require almost any changes in your existing infrastructure, but we still wanted to improve the user experience. So we added a verify command that can help you to test your existing configuration and tell you when something is wrong Additionally, we have added a diag command to collect diagnostic information. You can find more information about how to invoke verify command in every environment Monitoring OpenShift v5 - Troubleshooting Monitoring Kubernetes v5 - Troubleshooting Monitoring Docker v5 - Troubleshooting Security The best way to deal with vulnerabilities in the image is not to have any components that can have vulnerabilities. We have switched our base image from alpine to scratch, which is a 0-size image that does not have anything in it. The images that we distribute through hub.docker.com/r/outcoldsolutions/ only have a base configuration, our statically compiled binary, root ca-certificates and timezone database. You cannot run any shell scripts, install any additional software with package managers or perform any other actions. If you want to test connectivity to your Splunk instance, you can use verify command. New annotations We added the ability to turn on an opt-out option by default for forwarding container logs. That can be useful if you want to reduce licensing cost or the amount of unnecessary information forwarded to Splunk. For that, we added a second output devnull, which you can set as a default output for any type of data, including container logs under input.files. With the annotations collectord.io/logs-output=splunk you will be able to override outputs for specific containers. Monitoring OpenShift v5 - Annotations - Change output destination Monitoring Kubernetes v5 - Annotations - Change output destination Monitoring Docker v5 - Annotations - Change output destination The second set of annotations is an override annotation that can change the source, type and index of the events that match specific patterns. Monitoring OpenShift v5 - Annotations - Overriding index source and type for specific events Monitoring Kubernetes v5 - Annotations - Overriding index source and type for specific events Monitoring Docker v5 - Annotations - Overriding index source and type for specific events Links You can find more information about other minor updates by following the links below. Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.20 Blog Blog - Release 5.20 Monitoring Docker, OpenShift and Kubernetes - Version 5.20 Version 5.20 of our applications, configurations, and Collectord is now available. In this blog post, we will cover some highlights of the release. (OpenShift) Cluster Resource Quotas Dashboard The Cluster Resource Quotas dashboard provides a high-level overview of the resource quotas for all namespaces in your cluster. The dashboard shows the current usage of CPU, memory, and storage for each namespace, as well as the total quota for each resource. The dashboard also shows the percentage of the quota that is currently in use. When selecting a quota, the dashboard shows a drill-down by namespace. (Kubernetes and OpenShift) Pod conditions In both applications, Monitoring Kubernetes and OpenShift, a new table is added to the Pod dashboard to show the Pod conditions. (Kubernetes and OpenShift) Event duplication for multiple Splunk HTTP Event Collector endpoints Collectord has supported multiple Splunk HTTP Event Collector endpoints since version 5.9, but with the possibility to only send events to a single endpoint. This was a limitation for users who wanted to send events to multiple Splunk HEC endpoints. In this version, we have added support for sending events to multiple Splunk HEC endpoints at the same time. This version only supports forwarding container logs and volume logs to multiple Splunk HEC endpoints at the same time. With annotation collectord.io/logs-output you can configure multiple Splunk HEC endpoints, for example splunk::apps and splunk::security using the comma-separated list: collectord.io/logs-output=splunk::apps,splunk::security. Assuming you have them defined in the ConfigMap like [output.splunk::apps] and [output.splunk::security]. Additionally, you can configure indexes for the endpoints in square brackets, for example collectord.io/logs-output=splunk::apps[kubernetes_logs],splunk::security[kubernetes_security]. In that case, each event will be sent to both Splunk HEC endpoints. (Kubernetes and OpenShift) Improvements for forwarding logs from Persistent Volumes Collectord supports forwarding logs from Persistent Volumes since version 5.11. The main purpose of supporting forwarding logs from persistent volumes is to support the use case when you want to forward logs from the applications running as Stateful Sets, like databases. The weak point of the current implementation is that the Collectord stores acknowledgment information for each file on the host where the Persistent Volume is mounted. This means that if the Persistent Volume is remounted on a different host, the Collectord will start sending logs from the beginning, as it will not be able to find the acknowledgment information. In this version, we have added support for storing acknowledgment information on the volume itself. You can add an annotation collectord.io/volume.{N}-logs-onvolumedatabase=true to the Pod to enable this feature. In that case Collectord creates a database file in volume root .collectord.db that stores information about all the files already sent to Splunk. So when the volume is getting attached to a different host, another Collectord instance will be able to find the database file and start forwarding logs from the last acknowledged position. Important details about this feature, you need to mount the /rootfs directory in the Collectord container with write access. By default, the /rootfs directory is mounted as read-only. (Kubernetes and OpenShift) Support for placeholders in the glob configuration for Persistent Volumes If you are mounting the same volume to multiple Pods and you want to differentiate the logs, you can now specify placeholders in the glob configuration. For example, if you have a volume mounted to the Pod with the name my-pod and to the Pod with the name my-pod-2, you can specify the glob configuration like this: {{kubernetes_pod_name}}.log, so Collectord will be able to identify that files my-pod.log and my-pod-2.log are coming from different Pods. (Kubernetes, OpenShift and Docker) multi-architecture images Collectord was available with multi-architecture images for amd64 and arm64 architectures since version 5.16.361, but the arm64 image was always available as a separate image with the prefix -arm64. Since version 5.20.400, we are providing multi-platform images for amd64 and arm64 architectures. This simplifies the installation process for users with multi-architecture clusters. (Kubernetes and OpenShift) Support for sending logs to Elasticsearch and OpenSearch Large teams might have different requirements for the log management system. Some teams might prefer to use Elasticsearch or OpenSearch for log management. In this version, we have added support for sending logs to Elasticsearch and OpenSearch. You can install Collectord with Elasticsearch or OpenSearch support and run it in the same cluster as Collectord for Splunk. In that case, you can configure Collectord to send logs to both Splunk and Elasticsearch or OpenSearch. Please read the blog post for more details. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.3 Blog Blog - Release 5.3 Monitoring Docker, OpenShift and Kubernetes - Version 5.3 We are happy to share with you a minor update of our solutions for Monitoring Docker, Kubernetes and OpenShift. This update brings improved capabilities for monitoring multiple clusters within one application, better observability for the state of the forwarding data, and also insights into the Splunk Usage. New annotations Hashing sensitive data If you need to hide sensitive data (to hide PII data and be compliant with GDPR) we suggest using the replace patterns so that you can replace IP addresses with static values like X.X.X.X. But that can complicate observability if you want to see the trace, or see all the requests from a specific IP address. Now, by using hashing functions, you can get the same values for the same IP addresses, so that can help you to identify similar values. With the annotation logs-hashing.1-match you can specify a match regexp. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/logs-hashing.1-match: '(\d{1,3}\.){3}\d{1,3}' 7spec: 8 containers: 9 - name: nginx 10 image: nginx The default hashing function is sha256. So the resulting hash value can be larger than the source value. text Copy 1EsoXtJryKJQ28wPgFmAwoh5SXSZuIJJnQzgBqP1AcaA - - [18/Nov/2018:01:25:27 +0000] "GET /404 HTTP/1.1" 404 153 "-" "Wget" "-" But you can specify the hash function. For example, when we set collectord.io/logs-hashing.1-function: 'fnv-1a-64' to minimize the length of the hash result, we get a smaller hash result text Copy 1qrr-cQTZFL4 - - [18/Nov/2018:01:27:17 +0000] "GET /404 HTTP/1.1" 404 153 "-" "Wget" "-" Monitoring OpenShift v5 - Annotations Monitoring Kubernetes v5 - Annotations Monitoring Docker v5 - Annotations Annotations for specific container Pods can have more than one container, but you cannot specify annotations on the container level. With version 5.3 we allow defining container-specific annotations with the format collectord.io/{container_name}--{annotation}: {annotation-value}. As an example, if you have nginx containers running with other images, and you want to define various annotations, you can do that as in this example yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/nginx--logs-hashing.1-match: '(\d{1,3}\.){3}\d{1,3}' 7 collectord.io/get-trigger--logs-output: devnull 8spec: 9 containers: 10 - name: nginx 11 image: nginx 12 - name: get-trigger 13 image: busybox 14 args: [/bin/sh, -c, 15 'while true; do wget -qO- localhost:80; sleep 5; done'] In that example, the annotation logs-hashing.1-match is applied only to the nginx container, and logs-output to the get-trigger container. Other annotations collectord.io/logs-joinmultiline - disable multi-line joining for the Pod collectord.io/logs-disabled - completely disable log processing. The difference from logs-output=devnull is that in case of devnull output Collectord still reads the logs, so if you change the output later, Collectord will start processing logs right from the moment when you changed the output. In the case of changing disabled=true to false, Collectord will start forwarding logs from this container as this is a new container, starting from the beginning of the log files. Improved observability We have added several alerts that can help you to troubleshoot issues with Collectord. Alerts to show when Collectord reports errors in the processing pipeline, for example when it fails to extract the fields. Alert for showing when Collectord reports Warning messages that can identify issues with the access to API Server, or that not all the requests to Splunk HEC can be delivered from the first time. The third alert is about the lag between the time of event and indexing time, this alert can identify issues with the performance of Collectord or the Splunk Indexing pipeline. Reducing Splunk Licensing cost for Network Socket Data and Events We improved identification for the events that we already sent to Splunk. That allows reducing the amount of events Collectord forwards to Splunk. With a very high number of events, that can be a significant change. In version 5.3 Collectord groups network socket connections with the similar remote and local IP. For example, if a local container has two connections text Copy 1remote_addr | remote_port | local_addr | local_port | protocol | tcp_state | time 210.128.0.3 | 9090 | 10.128.0.1 | 55338 | tcp | TIME_WAIT | 2018-11-17 16:53:03.668 310.128.0.3 | 9090 | 10.128.0.1 | 55432 | tcp | TIME_WAIT | 2018-11-17 16:53:03.668 With version 5.3 Collectord groups them and adds an additional field connections text Copy 1remote_addr | remote_port | local_addr | local_port | protocol | tcp_state | time | connections 210.128.0.3 | 9090 | 10.128.0.1 | 55338-55432 | tcp | TIME_WAIT | 2018-11-17 16:53:03.668 | 2 We have found that this grouping can reduce licensing cost of network socket table data by a factor of 4. You can also see how much licensing cost is taken by the application with the Splunk Usage dashboard. Performance improvements With version 5.3 we significantly improved memory usage and log processing performance. You can see the result in a separate blog post Performance comparison between Collectord, Fluentd and Fluent-bit. Links You can find more information about other minor updates by following the links below. Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.4 Blog Blog - Release 5.4 Monitoring Docker, OpenShift and Kubernetes - Version 5.4 We are excited to share with you a minor update of our solutions for Monitoring Docker, Kubernetes and OpenShift. This update brings bug fixes and improvements. Attaching EC2 Metadata You can attach EC2 Metadata to all the forwarded and collected data. Collectord reads this metadata from the Instance Metadata and User Data. As an example, if you are running Collectord for Docker, you can add additional configuration for EC2 Metadata, and specify that you want to include two fields ec2_instance_type and ec2_instance_id with bash Copy 1... 2--env "COLLECTOR__EC2_INSTANCE_ID=general__ec2Metadata.ec2_instance_id=/latest/meta-data/instance-id" \ 3--env "COLLECTOR__EC2_INSTANCE_TYPE=general__ec2Metadata.ec2_instance_type=/latest/meta-data/instance-type" \ 4... Please read more about this feature Monitoring OpenShift v5 - Attaching EC2 Metadata Monitoring Kubernetes v5 - Attaching EC2 Metadata Monitoring Docker v5 - Attaching EC2 Metadata Monitoring Kubernetes: CoreDNS dashboard If you are using CoreDNS in Kubernetes, you can collect metrics exported in Prometheus format, and we have provided for you a Dashboard and Alerts for monitoring CoreDNS. To start collecting metrics from CoreDNS, you need to annotate the CoreDNS deployment to let Collectord know that you want to collect these metrics bash Copy 1kubectl annotate deployment/coredns --namespace kube-system 'collectord.io/prometheus.1-path=/metrics' 'collectord.io/prometheus.1-port=9153' 'collectord.io/prometheus.1-source=coredns' --overwrite Read more about this at Monitoring Kubernetes v5 - CoreDNS Links You can find more information about other minor updates by following the links below. Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.5 Blog Blog - Release 5.5 Monitoring Docker, OpenShift and Kubernetes - Version 5.5 The first release of 2019 is out. We have added support for AWS ECS services and Docker Swarm Services, and also a dashboard for reviewing quotas and limits for the Kubernetes namespaces and OpenShift projects. Monitoring Docker - ECS and Swarm Services If you are using Docker Swarm or AWS ECS as orchestration tools, we have an update for you in Monitoring Docker v5.5. Under the Services you can find two dashboards, each for its own orchestration tool. With these dashboards you will be able to monitor containers running on multiple hosts under the same service. AWS ECS Services Docker Swarm Monitoring Kubernetes Namespaces and OpenShift Projects Under Review you can find a new dashboard that will help you to review requests and allocations for the OpenShift Projects or Kubernetes Namespaces, and also review limits and requests for Pods running under these projects and namespaces. Links You can find more information about other minor updates by following the links below. Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.6 Blog Blog - Release 5.6 Monitoring Docker, OpenShift and Kubernetes - Version 5.6 Version 5.6 brings dark theme support, refreshed Logs dashboard (free-text search and more control), support for auto-refresh dashboards, and bug fixes. With Collectord 5.6 we included support for log sampling (random and hash-based), small improvements and bug fixes. Auto-refresh Every dashboard has an option now that allows you to specify how often you want to refresh the dashboard. useful for keeping the dashboard open for a long time and monitor the performance of your applications. Dark Theme You can find a link on every dashboard that will allow you to switch to a dark theme. There is a limitation in Splunk, that does not allow us to keep the preference, so if you navigate with the navigation bar at the top, the theme configuration will be switched back to light. With the auto-refresh configuration you can keep the dashboards open on a large screen. Updated Logs dashboard We have updated the logs dashboard, including the possibility to add a free-text search, and specify the limit for the number of logs you want to see. We also added a visualization to quickly identify the period of time of the log messages. Logs sampling Monitoring OpenShift v5 - Annotations - Logs sampling Monitoring Kubernetes v5 - Annotations - Logs sampling Monitoring Docker v5 - Annotations - Logs sampling Example 1. Random-based sampling When the application produces a high volume of logs, in some cases it could be enough to just look at the sampled portion of the logs to understand how many failed requests the application has, or how it behaves. You can add an annotation for the logs to specify the percentage of logs that should be forwarded to Splunk. In the following example, this application produces 300,000 log lines. Only about 60,000 log lines are going to be forwarded to Splunk. bash Copy 1docker run -d --rm \ 2 --label 'collectord.io/logs-sampling-percent=20' \ 3 docker.io/mffiedler/ocp-logtest:latest \ 4 python ocp_logtest.py --line-length=1024 --num-lines=300000 --rate 60000 --fixed-line Example 2. Hash-based sampling In situations where you want to look at the pattern for a specific user, you can specify that you want to sample logs based on the hash value, to ensure that if the same key is present in two different log lines, both of them will be forwarded to Splunk. In the following example, we define a key (should be a named submatch pattern) as an IP address. bash Copy 1docker run -d --rm \ 2 --label 'collectord.io/logs-sampling-percent=20' \ 3 --label 'collectord.io/logs-sampling-key=^(?P<key>(\d+\.){3}\d+)' \ 4 nginx To test it, we can run 500 containers with different IP addresses (make sure to change the IP address of the original container that we ran from the nginx image) Make sure that you have enough capacity to run 500 additional containers. bash Copy 1seq 500 | xargs -L 1 -P 10 docker run -d -it alpine:3.8 sh -c 'apk add bash curl 1>/dev/null 2>&1 && bash -c "(while true; do curl --silent 172.17.0.7:80; sleep 5; done;) 1>/dev/null 2>&1"' From this example, we can see that only data from 69 containers was forwarded to Splunk (you will get close to 20% with a much higher number of different values). But each IP address has more than one value in Splunk. Forward annotations You can configure collectorforopenshift and collectorforkubernetes to include annotations similarly how we attach labels. Labels in Kubernetes usually have data that can identify the pods or workloads, annotations in most cases have some data that can be useful for other tools and Kubernetes components, but sometimes it can have valuable information that you might want to forward to Splunk as well. As an example, if you want to forward annotation openshift.io/display-name= for the OpenShift Projects, you can add configuration in the ConfigMap ini Copy 1[general.kubernetes] 2 3includeAnnotations.displayName = openshift\.io\/display-name After that you can find these annotations attached to the logs and stats Change in licensing We have changed the way we generate a unique InstanceID for every Collectord instance, to take into account that you might need to run multiple Collectord instances per host/node in order to send data to different Splunk Clusters. Now each Collectord will have the same InstanceID unique to this host, which will allow us to count it as one host. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.7 Blog Blog - Release 5.7 Monitoring Docker, OpenShift and Kubernetes - Version 5.7 - Journald input Version 5.7 of our applications and Collectord includes bug fixes and a new input that allows you to forward logs directly from Journald. Journald input For OpenShift clusters, we recommended using rsyslog to forward messages from journald to /var/log/message. Now you can uninstall rsyslog if you don’t need it anymore and forward messages directly from journald. You can find the reference for the journald input in configurations for Docker, Kubernetes, and OpenShift: Docker OpenShift Kubernetes As follows: yaml Copy 1[input.journald] 2 3# disable host level logs 4disabled = false 5 6# root location of log files 7path = /rootfs/var/log/journal/ 8 9# when reach end of journald, how often to pull 10pollingInterval = 250ms 11 12# if you don't want to forward journald from the beginning, 13# set the oldest event in relative value, like -14h or -30m or -30s (h/m/s supported) 14startFromRel = 15 16# override type 17type = kubernetes_host_logs 18 19# specify Splunk index 20index = 21 22# sample output (-1 does not sample, 20 - only 20% of the logs should be forwarded) 23samplingPercent = -1 24 25# sampling key (should be regexp with the named match pattern `key`) 26samplingKey = 27 28# set output (splunk or devnull, default is [general]defaultOutput) 29output = In the case of Kubernetes and OpenShift clusters, include it in your ConfigMap in file 002-daemonset.conf. If you are upgrading from the previous version of the application, we recommend specifying ini Copy 1startFromRel = -1h This will tell Collectord to start reading journald from only one hour behind. Considering that you have already forwarded all the host logs from /var/log/messages, this will minimize the amount of forwarded journald logs from the first start and cause fewer duplications in Splunk. Links You can find more information about other minor updates by following links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.8 Blog Blog - Release 5.8 Monitoring Docker, OpenShift and Kubernetes - Version 5.8 - Usability improvements and bug fixes With version 5.8 we include usability improvements for dashboards and various bug fixes. Multiselection on Dashboards With the multiselection and custom values you can easily filter data on dashboards by including multiple namespaces and wildcard filters. Most of the dashboards, including events and container logs, now allow you to easily select multiple filters and include wildcard-type filters. Critical pods and PriorityClass for Kubernetes and OpenShift We have updated our configuration and included scheduler.alpha.kubernetes.io/critical-pod: '' annotation for Kubernetes versions below 1.14 and OpenShift versions below 3.11. For Kubernetes 1.14 and OpenShift 3.11 we added a PriorityClass with a value below cluster and node critical pods. That adds a Guaranteed Scheduling for Collectord pods, to make sure that you can collect metrics and logs in any critical situation. For details, please review the configuration for OpenShift and Kubernetes Monitoring OpenShift v5 - Configuration Monitoring Kubernetes v5 - Configuration Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.9 Blog Blog - Release 5.9 Monitoring Docker, OpenShift and Kubernetes - Version 5.9 - Support for multiple Splunk Clusters, streaming API Objects With this release we improved capabilities for streaming data to multiple Splunk Clusters and support for deploying multiple Collectord instances on the same node (in case you need to stream the same data to multiple clusters), and added a new capability to stream objects and changes from the API Server. This release also includes a journald input fix. We have found that in the previous version Collectord could hold the file descriptors of the rotated journald files. If you are using Journald input (enabled by default), please upgrade. Streaming API Objects Starting with version 5.9 you can stream all changes from the Kubernetes and Docker API servers to Splunk. That is useful if you want to monitor all changes for the Workloads or ConfigMaps in Splunk. Or you want to recreate Kubernetes Dashboard experience in Splunk. With the default configuration we don’t forward any objects from the API Server except events. Please follow updated documentation to setup streaming of the Kubernetes and Docker API Objects to Splunk. Monitoring OpenShift v5 - Streaming OpenShift Objects from the API Server Monitoring Kubernetes v5 - Streaming Kubernetes Objects from the API Server Monitoring Docker v5 - Streaming Docker Objects from the API Support for multiple Splunk Clusters or Splunk Tokens In case you want to use multiple HTTP Event Collector Tokens, or forward data from the namespaces to different Splunk Clusters, you can define more than one Splunk Output in the configuration. In the default ConfigMap (or configuration for Docker) we include only the default Splunk output under the stanza [output.splunk], you can define additional outputs and name them, like ini Copy 1[output.splunk::prod1] 2url = https://prod1.hec.example.com:8088/services/collector/event/1.0 3token = AF420832-F61B-480F-86B3-CCB5D37F7D0D See the details on how to configure the outputs Monitoring OpenShift v5 - Configurations for Splunk HTTP Event Collector - Support for multiple Splunk clusters Monitoring Kubernetes v5 - Configurations for Splunk HTTP Event Collector - Support for multiple Splunk clusters Monitoring Docker v5 - Configurations for Splunk HTTP Event Collector - Support for multiple Splunk clusters Using the annotations you can override the default Splunk output and define the Splunk Cluster you want to redirect the data from the namespace or pod or container. In the example below, we are using the configuration to forward all the data from a specific namespace to the Splunk output prod1 yaml Copy 1apiVersion: v1 2kind: Namespace 3metadata: 4 name: prod1-namespace 5 annotations: 6 collectord.io/output: 'splunk::prod1' Monitoring OpenShift v5 - Annotations - Change output destination Monitoring Kubernetes v5 - Annotations - Change output destination Monitoring Docker v5 - Annotations - Change output destination Improved support for multiple Collectord deployments If you need to stream the same data to multiple Splunk deployments you can easily deploy more than one Collectord on one node. Some configuration changes are required in order to ensure that the deployments will not conflict with each other, primarily about the location of the database that stores acknowledgement data. Before 5.9, annotations would be applied to all the Collectord deployments. From version 5.9 you can define the subdomains for the annotations under [general] with the key annotationsSubdomain, for example ini Copy 1[general] 2annotationsSubdomain = prod1 After that for this specific deployment you can use annotations as prod1.collectord.io/index=foo. Monitoring OpenShift v5 - Annotations Monitoring Kubernetes v5 - Annotations Monitoring Docker v5 - Annotations Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.24 Blog Blog - Release 5.24 Monitoring Docker, OpenShift, Kubernetes - Version 5.24 Version 5.24 of our applications, configurations, and Collectord is now available. In this blog post, we will cover some highlights of the release. Forward Prometheus metrics to Splunk Metrics Index In this release, we have added the ability to forward Prometheus metrics to the Splunk Metrics Index. We suggest configuring an additional Splunk output that will point to the metrics index (or multiple metric indexes) ini Copy 1[output.splunk::metrics] 2url = https://mysplunk.mydomain:8088/services/collector/event/1.0 3token = 00000000-0000-0000-0000-000000000000 The token should be configured to write by default to the metrics index. When configuring a Prometheus collection with annotations, you can specify to use indexType=metrics in the annotation, and optionally you can configure the index and the output. yaml Copy 1collectord.io/prometheus.1-port: '9113' 2collectord.io/prometheus.1-path: '/metrics' 3collectord.io/prometheus.1-index: 'openshift_metrics' 4collectord.io/prometheus.1-output: 'splunk::metrics' 5collectord.io/prometheus.1-indexType: 'metrics' After that you can use Analytics to search the metrics in the index. Unix timestamps can be parsed from application logs Now you can use the format @unixtimestamp when configuring application log parsing. For example: yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: nginx-pod 5 annotations: 6 collectord.io/logs-extraction: '^(?P<timestamp>\d+)\s$' 7 collectord.io/logs-timestampfield: timestamp 8 collectord.io/logs-timestampformat: '@unixtimestamp' 9spec: 10 containers: 11 - name: nginx 12 image: nginx When you configure application logs, you can lock files to prevent multiple readers When you configure application logs from PVC volumes, you can lock files to prevent multiple readers. if more than one instance of the application is running, and they both use the same PVC volume, you can use the annotation yaml Copy 1collectord.io/volume.1-logs-withlock: 'true' And only one instance of Collectord will read the logs. For example, in this configuration, when just one Pod is running and annotations point to the same PVC volume, the logs will be read by only one instance of Collectord and will be forwarded only once. yaml Copy 1apiVersion: v1 2kind: Pod 3metadata: 4 name: kube-load-test-volume 5 annotations: 6 collectord.io/volume.1-logs-name: 'logs-volume-lock' 7 collectord.io/volume.1-logs-withlock: 'true' 8 collectord.io/volume.1-logs-type: 'lock-test-1' 9 collectord.io/volume.1-logs-onvolumedatabase: 'true' 10 collectord.io/volume.2-logs-name: 'logs-volume-lock' 11 collectord.io/volume.2-logs-withlock: 'true' 12 collectord.io/volume.2-logs-type: 'lock-test-2' 13 collectord.io/volume.2-logs-onvolumedatabase: 'true' 14spec: 15 restartPolicy: Never 16 volumes: 17 - name: logs-volume-lock 18 emptyDir: {} 19 containers: 20 ... In this version, we significantly improved the performance of the acknowledgment database, including concurrent usage. Other significant changes Included a new alert for Kubernetes and OpenShift based on Node conditions: “Cluster Warning: Node Condition”. Added the ability to hide process command line arguments (with annotations or globally). Improved support for Rancher configuration; in cases where volumeRootDir or container logs point to a symlink, Collectord will resolve the symlink correctly. Various bug fixes and improvements can be found in the release notes. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.21 Blog Blog - Release 5.21 Monitoring Docker, OpenShift, Kubernetes and Linux - Version 5.21 Version 5.21 of our applications, configurations, and Collectord is now available. In this blog post, we will cover some highlights of the release. CPU (Throttled, Limits, Requests) dashboard We have added a new dashboard to the Review dashboards family. This dashboard shows the CPU usage of the containers in the cluster, including configured Limits and Requests and the throttled CPU usage. This dashboard will help you to properly configure the CPU limits and requests for your containers. Global sanitization of forwarded logs and events In this release, we have added a new feature to sanitize the logs and events before forwarding them to the backend. You can configure a global replacement pipe for all the host logs, container logs, and events that are forwarded to the backend. For example, you can configure searching for all mentions of password= and replace it with password=******** in all the logs and events. ini Copy 1[pipe.replace::passwords] 2patternRegex = (password=)([^\s]+) 3replace = $1******** Improvements for streaming objects from API Server Collectord has allowed you to stream objects from the API Server for a long time. And it was pretty easy to configure it to forward the objects only from a specific namespace, but it was not simple to stream all namespaces except a few. This version brings filtering capabilities for streaming objects from the API server. For example, you can tell Collectord to stream all the pods except the ones from the namespace0 namespace, or stream only the pods from the namespace1 and namespace2 namespaces. ini Copy 1[input.kubernetes_watch::pods] 2# You can exclude events by namespace with blacklist or whitelist only required namespaces 3# blacklist.kubernetes_namespace = ^namespace0$ 4# whitelist.kubernetes_namespace = ^((namespace1)|(namespace2))$ Podman support You can use our Monitoring Docker application and collectorfordocker image to monitor your Podman containers. Currently, we only support journald as a logging driver. As the k8s-file logging driver does not keep rotated files, we do not suggest using it in production. bash Copy 1podman run -d \ 2 --name collectorforpodman \ 3 --volume /:/rootfs:ro \ 4 --volume collector_data:/data/ \ 5 --cpus=2 \ 6 --cpu-shares=1024 \ 7 --memory=512M \ 8 --restart=always \ 9 --env "COLLECTOR__SPLUNK_URL=output.splunk__url=..." \ 10 --env "COLLECTOR__SPLUNK_TOKEN=output.splunk__token=..." \ 11 --env "COLLECTOR__SPLUNK_INSECURE=output.splunk__insecure=true" \ 12 --env "COLLECTOR__EULA=general__acceptLicense=true" \ 13 --env "COLLECTOR__LICENSE_KEY=general__license=..." \ 14 --env "COLLECTOR__GENERALPODMAN_URL=general.docker__url=unix:///rootfs/var/run/podman/podman.sock" \ 15 --env "COLLECTOR__GENERALPODMAN_STORAGE=general.docker__dockerRootFolder=/rootfs/var/lib/" \ 16 --ulimit nofile=1048576:1048576 \ 17 --privileged \ 18 outcoldsolutions/collectorfordocker:{{ collectorfordocker_version }} Other major changes Compatibility updates for the latest version of Kubernetes, OpenShift and Docker Allows you to configure time precision for events forwarded to Splunk, default is milliseconds, but you can change it to microseconds or nanoseconds Automatically refresh Kubernetes API Token if it is expired Upgrade libraries to debian:bookworm, Go runtime to 1.21.3, and SQLite to 3.43.1 Show UDP connections in network socket tables Monitoring Linux upgraded to the latest version of Collectord To review all the changes, you can follow one of the Release notes links below. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.22 Blog Blog - Release 5.22 Monitoring Docker, OpenShift, Kubernetes and Linux - Version 5.22 Version 5.22 of our applications, configurations, and Collectord is now available. In this blog post, we will cover some highlights of the release. Disk Stats Dashboard Under Review->Disk Stats you can find a new dashboard showing statistics of all the mounted disks on the host. User defined Splunk outputs We have received this feature request from a few customers, and we are happy to announce that it is now available. Users can define a Splunk output with the CustomResourceDefinition (CRD) SplunkOutput in their namespace. For example yaml Copy 1apiVersion: "collectord.io/v1" 2kind: SplunkOutput 3metadata: 4 name: splunk-user-output-for-deployment 5spec: 6 token: 1a8b9c3e-7789-4353-821f-15b9662bac99 7 url: https://splunk.example.com:8088/services/collector/event/1.0 8 insecure: true Similarly to how you can reference the default Splunk outputs defined in the ConfigMap, you can reference them with an annotation yaml Copy 1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: long-running 5 annotations: 6 collectord.io/output: splunk::user/default/splunk-user-output-for-deployment 7spec: 8 ... You define it as splunk::user/<namespace>/<name>. To use this feature, you need to update your configuration file and include the definition of the CustomResourceDefinition SplunkOutput. Other significant changes Monitoring Kubernetes and OpenShift applications show Pod Ownership, PriorityClass and Pod Requests and Limits in the Workload dashboard. Added additional metrics CPU IOWait, Steal and Idle. You can blacklist labels from forwarded metadata. New diagnostic: CPU Vulnerabilities. You can read all other changes and bug fixes in the release notes below. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Release 5.23 Blog Blog - Release 5.23 Monitoring Docker, OpenShift, Kubernetes and Linux - Version 5.23 Version 5.23 of our applications, configurations, and Collectord are now available. In this blog post, we will cover some highlights of the release. Support for large clusters This release was focused on improving the performance and scalability of our monitoring solutions, particularly for large Kubernetes and OpenShift clusters. We have optimized and rewritten the Watch requests implementation for the Kubernetes API Server. You can keep using the old implementation if the configuration remains unchanged, or you can change it with watchImplementation=2 under [general.kubernetes] (default value). To better support installations with a large number of nodes and containers, the default behavior for most of the dashboards is to require pressing a Submit button after selecting filters. Pod and Node Status Monitoring Across dashboards, we have added statuses and conditions of Nodes and Pods. For example, the Overview dashboard will show you a list of Not Ready Containers running in your cluster, so you can quickly identify the issues. Other significant changes Allow defining outputs for Prometheus metrics defined with annotations. Improvements for the Audit Dashboards to include user agents and slightly improve performance for new installations. You can read all other changes and bug fixes in the release notes below. Links You can find more information about other minor updates by following the links below. Release notes Monitoring OpenShift - Release notes Monitoring Kubernetes - Release notes Monitoring Docker - Release notes Upgrade instructions Monitoring OpenShift v5 - Upgrade Monitoring Kubernetes v5 - Upgrade Monitoring Docker v5 - Upgrade Installation instructions Monitoring OpenShift - Installation Monitoring Kubernetes - Installation Monitoring Docker - Installation Blog Blog - Monitoring Kubernetes and OpenShift - Monitoring GPU Blog Blog - Monitoring Kubernetes and OpenShift - Monitoring GPU Monitoring Kubernetes and OpenShift - Monitoring GPU If you are using NVIDIA GPU devices for your workloads, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding, you want to be able to monitor how efficiently you are using these devices. We provide a solution, based on the nvidia-smi tool, that will allow you to monitor GPU devices attached to your Kubernetes and OpenShift nodes, to review GPU/Memory utilization, Power consumption and more. Currently it is in beta mode, and you will need to add the required dashboards to the configurations manually. In future versions we will include these dashboards as part of our application. Please review the documentation on installation Monitoring OpenShift v5 - Monitoring GPU Monitoring Kubernetes v5 - Monitoring GPU We are using the nvidia-smi tool to collect the data, which allows us to install the collection part on any Kubernetes or OpenShift version. The official NVIDIA monitoring tool relies on Kubernetes 1.13+, which is a significant limitation, considering that you can’t run it on the most popular OpenShift version 3.11 (which is based on Kubernetes 1.11). If you prefer to use NVIDIA/gpu-monitoring-tools you can easily use our Prometheus annotations to collect these metrics and forward them to Splunk. Monitoring OpenShift v5 - Annotations - Forwarding Prometheus Metrics Monitoring Kubernetes v5 - Annotations - Forwarding Prometheus Metrics Blog Blog - Monitoring Kubernetes on Mesosphere DC/OS with Splunk Enterprise and Splunk Cloud Blog Blog - Monitoring Kubernetes on Mesosphere DC/OS with Splunk Enterprise and Splunk Cloud Monitoring Kubernetes on Mesosphere DC/OS with Splunk Enterprise and Splunk Cloud If you are using Kubernetes on Mesosphere DC/OS you can find that our default configuration does not provide all the metrics and information out of the box. In this blog post we will guide you through all the configuration changes to get all the information you need to monitor the health of your clusters and performance of your applications. We used Quickstart guide for Kubernetes on DC/OS on AWS as an example. Fix for the cgroup filesystem If you run the troubleshooting command verify on one of the collectorforkubernetes Pods you can find that it fails to find the cgroups for the Pods and Containers. text Copy 1 Kubernetes configuration: 2 + api: OK 3 x pod cgroup: FAILED 4 pods = 0 (with cgroup filter = ^/([^/\s]+/)*kubepods(\.slice)?/((kubepods-)?(burstable|besteffort)(\.slice)?/)?([^/]*)pod([0-9a-f]{32}|[0-9a-f\-_]{36})(\.slice)?$) 5 x container cgroup: FAILED 6 containers = 0 (with cgroup filter = ^/([^/\s]+/)*kubepods(\.slice)?/((kubepods-)?(burstable|besteffort)(\.slice)?/)?([^/]*)pod([0-9a-f]{32}|[0-9a-f\-_]{36})(\.slice)?/(docker-|crio-)?[0-9a-f]{64}(\.scope)?(\/.+)?$) 7 + volumes root: OK 8 + runtime: OK 9 docker This is because we mount the cgroup filesystem under /rootfs/sys/fs/cgroup and if you look at the different types of the cgroups text Copy 1/rootfs/sys/fs/cgroup# ls -alh 2total 0 3drwxr-xr-x. 13 root root 340 Apr 3 22:48 . 4drwxr-xr-x. 7 root root 0 Apr 3 22:48 .. 5drwxr-xr-x. 3 root root 0 Apr 3 22:48 blkio 6lrwxrwxrwx. 1 root root 26 Apr 3 22:48 cpu -> /sys/fs/cgroup/cpu,cpuacct 7drwxr-xr-x. 3 root root 0 Apr 3 22:48 cpu,cpuacct 8lrwxrwxrwx. 1 root root 26 Apr 3 22:48 cpuacct -> /sys/fs/cgroup/cpu,cpuacct you’ll realize that these links are broken. Cgroup cpu points to the /sys/fs/cgroup/cpu,cpuacct, when it should point to the /rootfs/sys/fs/cgroup/cpu,cpuacct (or better ./cpu,cpuacct). To fix that, you can mount cgroups inside the container in our configuration differently. In both DaemonSets collectorforkubernetes and collectorforkubernetes-master change the volumeMounts from yaml Copy 1 - name: cgroup 2 mountPath: /rootfs/sys/fs/cgroup 3 readOnly: true To yaml Copy 1 - name: cgroup-cpu 2 mountPath: /rootfs/sys/fs/cgroup/cpu 3 readOnly: true 4 - name: cgroup-cpu 5 mountPath: /rootfs/sys/fs/cgroup/cpuacct 6 readOnly: true 7 - name: cgroup-blkio 8 mountPath: /rootfs/sys/fs/cgroup/blkio 9 readOnly: true 10 - name: cgroup-memory 11 mountPath: /rootfs/sys/fs/cgroup/memory 12 readOnly: true And change the volumes from yaml Copy 1 - name: cgroup 2 hostPath: 3 path: /sys/fs/cgroup To yaml Copy 1 - name: cgroup-cpu 2 hostPath: 3 path: /sys/fs/cgroup/cpu,cpuacct 4 - name: cgroup-blkio 5 hostPath: 6 path: /sys/fs/cgroup/blkio 7 - name: cgroup-memory 8 hostPath: 9 path: /sys/fs/cgroup/memory After applying the change you can run the verify command again and should see that it fixed the problem text Copy 1 Kubernetes configuration: 2 + api: OK 3 + pod cgroup: OK 4 pods = 7 5 + container cgroup: OK 6 containers = 16 7 + volumes root: OK 8 + runtime: OK 9 docker Pods from DaemonSets collectorforkubernetes-master fail to start If you see that Pods from the DaemonSet collectorforkubernetes-master fail to start with CrashLoopBackOff look at the events for this Pod with bash Copy 1kubectl describe pod --namespace collectorforkubernetes collectorforkubernetes-master-wbv62 If you find something similar to text Copy 1Events: 2 Warning Failed 2m33s (x4 over 3m20s) kubelet, kube-control-plane-0-instance.devkubernetes01.mesos Error: failed to start container "collectorforkubernetes": Error response from daemon: OCI runtime create failed: container_linux.go:337: starting container process caused "process_linux.go:403: container init caused \"process_linux.go:368: setting cgroup config for procHooks process caused \\\"failed to write 200000 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2889c500-5665-11e9-a692-a6728d2eb688/collectorforkubernetes/cpu.cfs_quota_us: invalid argument\\\"\"": unknown That means that the parent cgroup has a lower limit for the CPU. Change the limits for the collectorforkubernetes-master DaemonSet to 1000m or 500m. In our case we see that the parent cgroup for the master pods has a cpu.cfs_quota_us equal to 160000 (1600m) bash Copy 1cat 7d57d61b-4c0f-4133-b658-bdfa902f67b2/cpu.cfs_quota_us 2160000 After lowering the CPU, apply the configuration and you should now see that the Pods from the collectorforkubernetes-master are scheduled on the master nodes. CoreDNS metrics If you want to collect coredns metrics, just run the command to attach the annotation to tell Collectord to start forwarding metrics from coredns pods to Splunk bash Copy 1kubectl annotate deployment/coredns --namespace kube-system 'collectord.io/prometheus.1-path=/metrics' 'collectord.io/prometheus.1-port=9153' 'collectord.io/prometheus.1-source=coredns' --overwrite etcd metrics To be able to monitor the etcd cluster with our application Monitoring Kubernetes for Splunk Enterprise and Splunk Cloud you need to retrieve etcd certificates from the Kubernetes API pod, and modify the configuration of the collectorforkubernetes.yaml. To retrieve the certificates from the Kubernetes API, just find the name of one of the Pods with the kube-apiserver and copy 3 files ca-crt.pem, kube-apiserver-crt.pem and kube-apiserver-key.pem bash Copy 1kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/ca-crt.pem . 2kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/kube-apiserver-crt.pem . 3kubectl cp --namespace kube-system kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos:/data/kube-apiserver-key.pem . Create a secret etcd-cert in the collectorforkubernetes namespace from the just-retrieved files bash Copy 1kubectl create secret generic --namespace collectorforkubernetes etcd-cert --from-file=./ca-crt.pem --from-file=./kube-apiserver-crt.pem --from-file=./kube-apiserver-key.pem Now you need to modify the collectorforkubernetes.yaml configuration. First find the stanza [input.prometheus::etcd] and disable it with disabled=true. We use this configuration when etcd is deployed on the master nodes. In the ConfigMap file 004-addon.conf, add the following configuration for each etcd cluster member ini Copy 1 [input.prometheus::etcd-0] 2 disabled = false 3 type = kubernetes_prometheus 4 index = 5 host = etcd-0-peer.devkubernetes01 6 source = etcd 7 interval = 60s 8 endpoint.https = https://etcd-0-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics 9 tokenPath = 10 certPath = /etcd-cert/ca-crt.pem 11 clientCertPath = /etcd-cert/kube-apiserver-crt.pem 12 clientKeyPath = /etcd-cert/kube-apiserver-key.pem 13 insecure = false 14 includeHelp = false 15 output = 16 17 [input.prometheus::etcd-1] 18 disabled = false 19 type = kubernetes_prometheus 20 index = 21 host = etcd-0-peer.devkubernetes01 22 source = etcd 23 interval = 60s 24 endpoint.https = https://etcd-1-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics 25 tokenPath = 26 certPath = /etcd-cert/ca-crt.pem 27 clientCertPath = /etcd-cert/kube-apiserver-crt.pem 28 clientKeyPath = /etcd-cert/kube-apiserver-key.pem 29 insecure = false 30 includeHelp = false 31 output = 32 33 [input.prometheus::etcd-2] 34 disabled = false 35 type = kubernetes_prometheus 36 index = 37 host = etcd-0-peer.devkubernetes01 38 source = etcd 39 interval = 60s 40 endpoint.https = https://etcd-2-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379/metrics 41 tokenPath = 42 certPath = /etcd-cert/ca-crt.pem 43 clientCertPath = /etcd-cert/kube-apiserver-crt.pem 44 clientKeyPath = /etcd-cert/kube-apiserver-key.pem 45 insecure = false 46 includeHelp = false 47 output = You can find the URLs of the etcd members in the configuration for the kube-apiserver text Copy 1kubectl describe --namespace kube-system pod kube-apiserver-kube-control-plane-0-instance.devkubernetes01.mesos | grep etcd-servers 2 --etcd-servers=https://etcd-0-peer.devkubernetes01.autoip.dcos.thisdcos.directory:2379 And the last step, mount the etcd-cert secret to the collectorforkubernetes-addon Deployment in the collectorforkubernetes.yaml yaml Copy 1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: collectorforkubernetes-addon 5 ... 6spec: 7 ... 8 template: 9 ... 10 spec: 11 ... 12 containers: 13 - name: collectorforkubernetes 14 ... 15 volumeMounts: 16 ... 17 - name: etcd-cert 18 mountPath: /etcd-cert/ 19 readOnly: true 20 volumes: 21 ... 22 - name: etcd-cert 23 secret: 24 secretName: etcd-cert Now you have all the features of the Monitoring Kubernetes application that will help you to monitor the health of the Kubernetes cluster and performance of your applications running on Kubernetes clusters deployed with Mesosphere DC/OS. Blog Blog - Release 5.12 Blog Blog - Release 5.12 Monitoring Linux - Version 5.12 - new application Today we are happy to announce a new addition to the family of our applications that helps you to monitor your infrastructure in Splunk Enterprise and Splunk Cloud. We have released the Monitoring Linux application together with the collectorforlinux package, built on top of the Collectord forwarder. We have found that a lot of our customers are still using bare VMs for large workloads, databases, or legacy software. Migrating these applications to containers may be challenging, so we decided to help monitor these workloads as well. As usual, it takes less than 10 minutes for initial setup, and you will get the application logs, metrics from the host and processes running on this host, and network activities. You can easily configure forwarding logs from any custom location on the host. If you already have a license, you can start using Monitoring Linux right away. Each installation on a host counts as one licensed host, just make sure that you are within your licensed capacity. Links SplunkBase - Monitoring Linux Monitoring Linux - Installation Blog Blog - Release 4 Blog Blog - Release 4 Monitoring OpenShift and Kubernetes - Version 4 (Audit Logs and Prometheus metrics) At Red Hat Summit 2018, we presented our next version of the application Monitoring OpenShift in Splunk. We are happy to announce the GA of Version 4 of Monitoring OpenShift and Kubernetes. These applications are now certified by Splunk, and they are available on SplunkBase. Version 4 brings two significant features: Audit logs and control plane monitoring (etcd clusters, Kubelets, controllers, and API servers). Our solutions are now the most complete suites for monitoring Kubernetes clusters, allowing developers to monitor their applications and operators to monitor the health of their clusters. With the power of Splunk, application developers can build more complex dashboards specific to their applications. And operators can diagnose the health of their clusters. Installation instructions Monitoring OpenShift v5 - Installation Monitoring Kubernetes v5 - Installation Upgrade instructions Monitoring OpenShift v5 - Upgrade instructions Monitoring Kubernetes v5 - Upgrade instructions Overview The most notable new features are Audit Logs and Prometheus metrics, but there are many small usability improvements and significant performance improvements. Monitoring OpenShift - Release History - 4.0.24 Monitoring Kubernetes - Release History - 4.0.24 Audit Logs By enabling advanced Audit Logs in Kubernetes or OpenShift, you will be able to use our dashboard, which will help you answer questions about when and who modified specified objects, who has access to view them, and from where. To learn more about how to enable advanced audit logs, follow these links Monitoring OpenShift v4 - Audit Logs Monitoring Kubernetes v4 - Audit Logs Control plane monitoring Version 4 of our collectord brings the capability of forwarding metrics from Prometheus format directly to Splunk. This allows us to monitor the control plane, including etcd clusters, Kubelets, API Servers, and controllers. Example of the dashboard for monitoring an etcd cluster in Monitoring OpenShift Example of the dashboard for monitoring Kubelets in Monitoring Kubernetes To learn more about how to enable Prometheus metrics, follow these links Monitoring OpenShift v4 - Prometheus metrics Monitoring Kubernetes v4 - Prometheus metrics Blog Blog - Monitoring OpenShift in Splunk: integration with Web Console Blog Blog - Monitoring OpenShift in Splunk: integration with Web Console Monitoring OpenShift in Splunk: integration with Web Console Today we are open-sourcing a Node.js application openshift-webconsole-integration. This application allows you to integrate OpenShift web console with Splunk. It embeds two links in the OpenShift Web Console. The first link gives you the ability to navigate to the Pod or Workload dashboard, where you can review the performance of the containers, review network activity and see the logs. The second link navigates you directly to search where you can start working with the logs from this specific workload or pod. [UPDATE (2018-11-02)] Switched to version v1.1.0, that uses nodejs:8 by default, to make this app compatible with OpenShift Console version below 3.11. [UPDATE (2019-08-07)] Switched to version v1.2.0, which allows you to override the application name from the default monitoringopenshift. [UPDATE (2021-02-10)] For OpenShift 4.x look at Integrating OpenShift Web Console 4.x with Monitoring OpenShift application in Splunk Two examples of how the navigation works. In this example we can navigate directly to the ReplicationController monitoring dashboard and see all the metrics, network activity and the logs from all the Pods scheduled with this ReplicationController. In the second example, we navigate directly to the Splunk search page, where we already have a predefined filter that shows us logs only from this replication controller. Installation You can find detailed installation instructions at README.md. We assume that you already installed Monitoring OpenShift application and collectorforopenshift. Switch to the collectorforopenshift project, if you want this application to run in the same project, like all our other workloads. bash Copy 1oc project collectorforopenshift For the next step, you need a Splunk Web URL, such as http://splunk.local.outcold.solutions:8000. Change this URL in the following command to the Splunk Web URL, where you have Monitoring OpenShift application installed. bash Copy 1oc new-app -f https://raw.githubusercontent.com/outcoldsolutions/openshift-webconsole-integration/v1.2.0/openshift/templates/outcoldsolutions-webconsole-integration.yaml \ 2 --param=SPLUNK_WEB_URL=http://splunk.local.outcold.solutions:8000 \ 3 --param=SOURCE_REPOSITORY_REF=v1.2.0 This template will install the application directly from the GitHub repository with version (tag) v1.2.0. For production environments we recommend that you fork this repository and move the sources of this application to your own location. With these parameters you can override the location of the openshift-webconsole-integration application. As an example bash Copy 1oc new-app -f https://git.local.outcold.solutions/outcoldsolutions/openshift-webconsole-integration/v1.2.0/openshift/templates/outcoldsolutions-webconsole-integration.yaml \ 2 --param=SPLUNK_WEB_URL=http://splunk.local.outcold.solutions:8000 \ 3 --param=SOURCE_REPOSITORY_REF=v1.2.0 \ 4 --param=SOURCE_REPOSITORY_URL=https://git.local.outcold.solutions/outcoldsolutions/openshift-webconsole-integration This template automatically creates the route and exposes script.js. You can find the direct link to this script with bash Copy 1echo "https://$(oc get route outcoldsolutions-webconsole-integration -o=jsonpath='{.spec.host}')/script.js" As an example, it can look similar to text Copy 1https://outcoldsolutions-webconsole-integration-collectorforopenshift.apps.local.outcold.solutions/script.js Open this script in the browser to verify that you can access it. It might take a few moments for the application to be built and deployed. You can check the status with oc status. Verify that you see an expected Splunk URL in the links. Now we can integrate this script into the Web Console. You can use CLI commands by following the guide Loading Extension Scripts and Stylesheets or you can do that from the Web Console as well. Open the Development Console in your browser and refresh the page, and verify that you can find the script.js file we just created If you don’t see the script, force OpenShift to recreate the Console with the new configuration by deleting Web Console Pods bash Copy 1oc delete pods -n openshift-web-console -l app=openshift-web-console After that you should see links in every Logs view in the OpenShift web console. We hope that this simple integration will boost your productivity! Feel free to fork, contribute and open issues at https://github.com/outcoldsolutions/openshift-webconsole-integration. Blog Blog - Monitoring Swarm Services with Monitoring Docker application Blog Blog - Monitoring Swarm Services with Monitoring Docker application Monitoring Swarm Services with Monitoring Docker application Monitoring Docker application is a very generic application that can help you to get started with various orchestration tools, which could be ECS or Docker Swarm or Docker UCP. We intentionally do not add Docker Swarm or ECS-specific information in the Monitoring Docker application, as we do not want to overload this application with the orchestration tool data that you don’t use. But the good news is that if you want to have a nice dashboard that shows the overview of your service running on your Docker Swarm cluster, it is possible. This information is already being collected by our Collectord. You need a few configuration changes in the application itself to make this possible. First, we need to extract docker_stack_namespace, docker_service_name and docker_service_id. For that, we can create a file $SPLUNK_ETC/apps/monitoringdocker/local/props.conf with the following content ini Copy 1[docker_logs] 2EVAL-docker_stack_namespace = substr(mvfilter(match(docker_container_labels, "^com\.docker\.stack\.namespace=")), 28) 3EVAL-docker_service_name = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.name=")), 31) 4EVAL-docker_service_id = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.id=")), 29) 5 6[docker_stats] 7EVAL-docker_stack_namespace = substr(mvfilter(match(docker_container_labels, "^com\.docker\.stack\.namespace=")), 28) 8EVAL-docker_service_name = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.name=")), 31) 9EVAL-docker_service_id = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.id=")), 29) 10 11[docker_proc_stats] 12EVAL-docker_stack_namespace = substr(mvfilter(match(docker_container_labels, "^com\.docker\.stack\.namespace=")), 28) 13EVAL-docker_service_name = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.name=")), 31) 14EVAL-docker_service_id = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.id=")), 29) 15 16[docker_net_stats] 17EVAL-docker_stack_namespace = substr(mvfilter(match(docker_container_labels, "^com\.docker\.stack\.namespace=")), 28) 18EVAL-docker_service_name = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.name=")), 31) 19EVAL-docker_service_id = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.id=")), 29) 20 21[docker_net_socket_table] 22EVAL-docker_stack_namespace = substr(mvfilter(match(docker_container_labels, "^com\.docker\.stack\.namespace=")), 28) 23EVAL-docker_service_name = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.name=")), 31) 24EVAL-docker_service_id = substr(mvfilter(match(docker_container_labels, "^com\.docker\.swarm\.service\.id=")), 29) Alternatively, you can add these fields to the default configuration under $SPLUNK_ETC/apps/monitoringdocker/default/props.conf, make sure to keep it after upgrades. After that, you can leverage these fields and start querying information specific to some stack namespace or service name. As an example, we also provide a dashboard swarm.xml that you can import to your Splunk Search Heads to be able to overview the whole service running on your Docker Swarm cluster. Download dashboard. Blog Blog - Monitoring Tectonic in Splunk (Enterprise Kubernetes by CoreOS) Blog Blog - Monitoring Tectonic in Splunk (Enterprise Kubernetes by CoreOS) Monitoring Tectonic in Splunk (Enterprise Kubernetes by CoreOS) tl;dr; To install the Monitoring Kubernetes solution on Tectonic Follow the instructions on Monitoring Kubernetes to install the collector and application. Install DaemonSet with rsyslog, which forwards journald logs to /var/log/syslog. bash Copy 1kubectl apply -f https://raw.githubusercontent.com/outcoldsolutions/docker-journald-to-syslog/master/journald-to-syslog.yaml Building Journald-To-Syslog image Tectonic is the secure, automated, and hybrid enterprise Kubernetes platform. Built by CoreOS. Our solutions for Monitoring Kubernetes work for most distributions of Kubernetes. And they work for Tectonic as well. They just require a small help to get the host logs. Tectonic is using CoreOS as the host Linux for containers. CoreOS is a very lightweight Linux that does not provide a package manager and requires you to install additional software with containers. CoreOS by default writes all host logs to journald, which stores them in binary format. In most distributions like RHEL, we recommend installing rsyslog which automatically configures streaming of all logs from journald to /var/log/. For CoreOS, we should do the same. The only difference is we will need to install it inside the container. For the image outcoldsolutions/journald-to-syslog, I chose debian:stretch as a base image for a simple reason. It has the right version of journalctl, built with the right set of libraries. If you ssh to one of your CoreOS boxes, you will find out that journald was built with +LZ4 and it is probably using it as the default. bash Copy 1$ journalctl --version 2systemd 233 3+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN default-hierarchy=legacy For example, ubuntu does not have +LZ4 bash Copy 1$ docker run --rm ubuntu journalctl --version 2systemd 229 3+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN But debian has it bash Copy 1$ docker run --rm debian bash -c 'apt-get update && apt-get install -y systemd && journalctl --version' 2systemd 232 3+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN If you decide to build your image to forward logs from journald to syslog, make sure to check the compatibility. You can test it by running journalctl -f over the journal files from your host Linux bash Copy 1docker run --rm -it \ 2 -v /var/log/journal/:/var/log/journal/:ro \ 3 -v /etc/machine-id:/etc/machine-id:ro \ 4 debian \ 5 bash -c 'apt-get update && apt-get install -y systemd && journalctl -f' Inside Journald-To-Syslog image rsyslog I used rsyslog with imjournal to build this image. As an alternative, you can use syslog-ng with the systemd-journal source. The configuration I use for rsyslog: text Copy 1$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat 2module(load="imjournal" StateFile="/rootfs/var/log/syslog.statefile") 3$outchannel log_rotation, /rootfs/var/log/syslog, 10485760, /usr/bin/logrotate.sh 4*.* :omfile:$log_rotation Where: This is just a default format for the syslog file output. Load journal logs and use the state file /rootfs/var/log/syslog.statefile in case this container restarts. Create the output channel as a /rootfs/var/log/syslog file, and call /usr/bin/logrotate.sh if the file is larger than 10Mb (10485760 bytes). Stream all data (in our case, only journal logs) to the created channel from (3). log rotation A good thing about rsyslog is that it will take care of when to call log rotation. We just need to configure logrotate. For that, I created a bash script: bash Copy 1#!/bin/bash 2 3/usr/sbin/logrotate /etc/journald-to-syslog/logrotate.conf And a default configuration to keep five rotated files and rotate when the file size is more than 10M (should be in sync with the size we used in rsyslog.conf). text Copy 1/rootfs/var/log/syslog { 2 rotate 5 3 create 4 size 10M 5} entrypoint.sh To make sure that rsyslog will use the right hostname and not the one autogenerated by Docker, I also created an entrypoint.sh which can set the right hostname before it starts rsyslogd in the foreground. bash Copy 1 2if [[ -z "${KUBERNETES_NODENAME}" ]]; then 3 hostname $(cat /etc/hostname) 4else 5 hostname "${KUBERNETES_NODENAME}" 6fi 7 8 9/usr/sbin/rsyslogd -f /etc/journald-to-syslog/rsyslog.conf -n K8S configuration: journald-to-syslog.yaml The last step is to schedule the journald-to-syslog container. We will schedule it on every Kubernetes node with the DaemonSet workload. yaml Copy 1apiVersion: extensions/v1beta1 2kind: DaemonSet 3metadata: 4 name: journald-to-syslog 5 labels: 6 app: journald-to-syslog 7spec: 8 updateStrategy: 9 type: RollingUpdate 10 11 selector: 12 matchLabels: 13 daemon: journald-to-syslog 14 15 template: 16 metadata: 17 name: journald-to-syslog 18 labels: 19 daemon: journald-to-syslog 20 spec: 21 tolerations: 22 - operator: "Exists" 23 effect: "NoSchedule" 24 - operator: "Exists" 25 effect: "NoExecute" 26 containers: 27 - name: journald-to-syslog 28 image: outcoldsolutions/journald-to-syslog 29 securityContext: 30 privileged: true 31 resources: 32 limits: 33 cpu: 1 34 memory: 128Mi 35 requests: 36 cpu: 50m 37 memory: 32Mi 38 volumeMounts: 39 - name: machine-id 40 mountPath: /etc/machine-id 41 readOnly: true 42 - name: hostname 43 mountPath: /etc/hostname 44 readOnly: true 45 - name: journal 46 mountPath: /var/log/journal 47 readOnly: true 48 - name: var-log 49 mountPath: /rootfs/var/log/ 50 volumes: 51 - name: machine-id 52 hostPath: 53 path: /etc/machine-id 54 - name: hostname 55 hostPath: 56 path: /etc/hostname 57 - name: var-log 58 hostPath: 59 path: /var/log 60 - name: journal 61 hostPath: 62 path: /var/log/journal A few highlights: line 23 allows us to schedule this DaemonSet on masters as well. line 38 tells journald the right ID, where to look for the journal logs. line 41 helps us to know the hostname of the host Linux. Splunk and Monitoring Kubernetes application If you have not done that yet, you can install collectord and our application in Splunk by following our manual on How to get started with Kubernetes. One small modification required for Monitoring Kubernetes v3.0 is to change macro_kubernetes_host_logs_kubelet from syslog_component::kubelet to syslog_component::kubelet*, as in the case of Tectonic, the kubelet process is called kubelet-wrapper. You can do that in Settings → Advanced Search → Search marcos, modify macro_kubernetes_host_logs_kubelet to text Copy 1(`macro_kubernetes_host_logs` AND (syslog_component::kubelet* OR source="*/kubelet.log*")) After that, you should be able to see host logs, including kubelet logs Blog Blog - Outcold Solutions at Red Hat X Podcast Series Blog Blog - Outcold Solutions at Red Hat X Podcast Series Outcold Solutions at Red Hat X Podcast Series On this podcast, Denis Gladkikh, Head of Engineering at Outcold Solutions, introduces his company and discusses how Outcold Solutions is helping businesses monitor containerized environments in Splunk. rhc4tp.blog iTunes Blog Blog - Outcold Solutions at Splunk .conf18 Blog Blog - Outcold Solutions at Splunk .conf18 Outcold Solutions at Splunk .conf18 Are you attending Splunk .conf18 this year? Do you want to learn more about containers? Outcold Solutions is among .conf18 sponsors this year. Please stop by booth M37 (map is attached below) to see the demos of our latest solutions for Monitoring OpenShift, Kubernetes, and Docker with Splunk Enterprise and Splunk Cloud. If you have questions about containers, Docker, Kubernetes, OpenShift or any topic related to this area, please stop by, ask us questions and pick up stickers! Denis Gladkikh (Outcold Solutions) with Amit Mookerjee (Splunk) is giving a talk on how to containerize your applications. To make it more interesting, they are going to present it by using the Splunk Docker Image as an example. Attend this talk if you are a Docker beginner or a Kubernetes advanced user, this session will have tips and tricks and best practices for a broad audience. You will learn about containers, Docker images and all tiny details about the Splunk Docker Image. Space is limited, if you are planning to attend, please add it to your schedule now. Topic: FN1035 - How to Containerize Your Application - An Example of the Splunk Docker Image Link: https://conf.splunk.com/learn/session-catalog.html?search=FN1035#/ Time: Wednesday, Oct 03, 11:30 a.m. - 12:15 p.m. Blog Blog - Performance comparison between Collectord, Fluentd and Fluent-bit Blog Blog - Performance comparison between Collectord, Fluentd and Fluent-bit Performance comparison between Collectord, Fluentd and Fluent-bit New version 5.3 of our solution for Monitoring Docker, Kubernetes and OpenShift in Splunk comes with an updated Collectord, our container-native software for discovering, transforming and forwarding logs, and for collecting metrics. We run performance tests using the same tools as we used a few weeks ago in the blog post Forwarding 10,000 1k events per second generated by containers from a single host with ease, and we are excited to share these results with you. Forwarding 5,000 1KB events per second. Default configuration. Compared to Version 5.2 CPU usage went down from 40% to 26%. Memory usage went down from 120MB to 38MB. Comparing performance with Fluentd and Fluent-bit We were asked a LOT how Collectord performs compared to Fluentd and Fluent-bit. This time we included both Fluentd and Fluent-Bit in our tests. We used Splunk Connect for Kubernetes (v1.0.1, Apache License 2.0) as a Fluentd distribution, and Fluent-bit from fluent/fluent-bit-kubernetes-logging (v0.14.6, Apache License 2.0), with output text Copy 1[OUTPUT] 2 Name splunk 3 Match * 4 Host ec2-52-89-58-42.us-west-2.compute.amazonaws.com 5 Port 8088 6 TLS On 7 TLS.Verify Off 8 Splunk_Token fdc8aa7a-de1d-494a-8fef-821a9936a589 We disabled Gzip compression for Collectord, because both Fluentd and Fluent-bit do not use gzip compression by default. Although we recommend keeping Gzip compression enabled to reduce the volume of network traffic. We disabled metrics collection for Collectord to only compare the performance of log forwarding capabilities. Test 1. Forwarding 5,000 1KB events per second. Collectord. Without Gzip compression and metrics collection, CPU Usage went down to 20%, and memory usage to 32MB. And network transmit went up to around 400MB/s, compared to 4MB/s with gzip compression enabled. Log format. Collectord attaches metadata from the Pods and Owner workloads as pre-indexed fields to the logs, that allows you to search the logs by Pod name, Job name, Job labels and more. The format of the logs is exactly the same as container writes them to the standard output. Fluentd. Default configuration. Fluentd used 80% CPU and 120Mb of Memory. It started forwarding events only 20 seconds after Pods were started (as reflected by the Lag dashboard), but it could catch up and keep up with this volume of logs. In both CPU and Memory graphs, the orange line represents an average CPU and Memory usage of the Collectord for comparison. The number of attached pre-indexed fields is fewer compared to Collectord. Fluentd only attaches metadata from the Pod, but not from the Owner workload, that is the reason, why Fluentd uses less Network traffic. The format of the logs is exactly the same as container writes them to the standard output. Fluent-bit. Default configuration. Fluent-bit used 27% CPU and 26MB of Memory. We discovered some issues with keeping up with this volume of logs. As you can see, the lag (the difference between event time and index time) kept growing to 150 seconds, Fluent-bit needed a few minutes after Jobs completed to finish forwarding the logs. In both CPU and Memory graphs, the orange line represents an average CPU and Memory usage of the Collectord for comparison. Log format. By default Fluent-bit forwards logs embedded in JSON, which adds to licensing and storage costs. Additional configuration may be required to address that. Conclusion We believe that both Fluentd and Fluent-bit are great generic tools, that can work in various environments, but with Collectord you can achieve the performance of Fluent-bit and the flexibility of Fluentd, because Collectord was built specifically for the container environments and optimized for Splunk Http Event Collector. Collectord is a container-native software and always starts forwarding logs almost at the moment of container creation. Collectord has a very high performance for parsing container logs and forwarding data to Splunk HTTP Event Collector. It is important to keep track of the lag between the indexing time and the time of the event, in Version 5.3 we have added an alert that will help you to monitor that. Blog Blog - Reduce Splunk Licensing cost for container logs Blog Blog - Reduce Splunk Licensing cost for container logs Reduce Splunk Licensing cost for container logs Not all logs that are created are equal. Some are needed for debugging purposes, some for auditing and security, some for troubleshooting. Depending on the type of logs, different approaches could be used to reduce licensing cost. Let’s go over some of them. We will use OpenShift as an example in this blog post, but you can apply this to the Kubernetes and Docker logs as well. Timestamps in log messages Most applications forward timestamps with every line. Let’s take a look at the guestbook example, which uses Redis deployments. If you estimate the cost of these timestamps in the messages, they take about 17% of the size of all messages. In our case we see that 73 MB is the total amount of logs, and 13 MB of them are timestamps. Considering that every log line has a timestamp written to the disk, which is generated when the Docker daemon reads the log lines from the standard output and standard error, you end up with two timestamps for every log line. You can read more about that in our blog post about timestamps in container logs. To solve this issue you have several options: Remove timestamps from the logs. Considering that the Docker daemon writes the timestamp, you already have it with every log line. If you don’t have the ability to remove these timestamps from the source, you can use annotations for Collectord to remove timestamps from the messages. yaml Copy 1annotations: 2 collectord.io/logs-replace.1-search: '^([^\s]+\s)(\d+\s\w+\s[^\s]+\s)(.*)$' 3 collectord.io/logs-replace.1-val: '$1$3' Additionally, you can extract timestamps from the messages to use them as event timestamps instead of the timestamps from the Docker logging driver. In the following example, we start by moving the timestamp to the first part of the message, and after that extracting the timestamp as a field, keeping the rest as a raw message. yaml Copy 1annotations: 2 collectord.io/logs-replace.1-search: '^([^\s]+\s)(\d+\s\w+\s[^\s]+\s)(.*)$' 3 collectord.io/logs-replace.1-val: '$2$1$3' 4 collectord.io/logs-extraction: '^(?P<timestamp>\d+\s\w+\s[^\s]+)\s(.*)$' 5 collectord.io/logs-timestampfield: 'timestamp' 6 collectord.io/logs-timestampformat: '02 Jan 15:04:05.999' Drop verbose messages In our example of using replace annotations, we show how you can reduce the amount of logs forwarded from the containers of the nginx pod, where we [remove all access log messages with successful GET requests](% site_rel_link ‘/docs/monitoring-openshift/annotations.md’#example-2-dropping-messages %). Another great example is the DEBUG and TRACE messages, which are usually required for debugging purposes, and are mostly useful for a short period of time. We use them in the development of Collectord itself. When we configure logLevel to higher than INFO, we don’t want to index these logs with Splunk, but still want to have the ability to look at them with the oc logs (kubectl logs) command. To do that, we attach annotations yaml Copy 1annotations: 2 collectord.io/logs-replace.1-search: '^(DEBUG|TRACE).*$' 3 collectord.io/logs-replace.1-val: '' That tells Collectord to drop all messages that start with DEBUG or TRACE. Remove container logs entirely from Splunk If you believe that you don’t need some log messages in Splunk entirely, you can change the output from splunk to devnull with an annotation yaml Copy 1annotations: 2 collectord.io/logs-output: 'devnull' That will tell Collectord to ignore all logs from the containers of this Pod. This approach could be useful for some Pods that you just don’t want to see in Splunk, like containers that you know will never fail. Use opt-out behavior by default for container logs Some of our customers choose not to forward any Pod logs, unless they explicitly select them. With the configuration of Collectord you can change the default output to devnull ini Copy 1[input.files] 2output = devnull Which tells Collectord to ignore all container logs. And after that tell Collectord which container logs it should forward by overriding the output back to splunk yaml Copy 1annotations: 2 collectord.io/logs-output: 'splunk' Sampling for container logs Most of the time you monitor services by tracking the accepted SLA of your service. For example, if you guarantee that 99.9999% of the time your service should return a successful result, and it is acceptable that your service can fail in less than 0.0001% of the time (because of timeouts or any other reason), this percentage can be calculated similarly from 1 billion requests (1k requests can fail), and 100 million requests (only 100 requests can fail). In this case, you can sample them and forward only 10% of the log lines. yaml Copy 1annotations: 2 collectord.io/logs-sampling-percent: '10' You can also use Hash-Based sampling, where hash could be an account id or IP address. See Example 2. Hash-based sampling. Blog Blog - Separating access to the OpenShift Projects and Kubernetes Namespaces using Splunk Roles Blog Blog - Separating access to the OpenShift Projects and Kubernetes Namespaces using Splunk Roles Separating access to the OpenShift Projects and Kubernetes Namespaces using Splunk Roles We have found from our customers that it is a very common request to let application developers see the data only from the OpenShift Projects and Kubernetes Namespaces that they are working on. In this blog post, we will show you how you can do that with Splunk roles. We will use OpenShift as an example, but you can follow the same guidance to perform the same actions on Kubernetes namespaces. There are two ways you can restrict access to the data using Splunk roles. One is by indexes, and the second one is by applying restrictions on search terms. Restrictions with Indexes Configuring Splunk We will use a Project named sock-shop as an example. To be able to restrict access to the logs and metrics that are generated by the Pods running inside of this Project at first we need to create a new index in Splunk. As you can see, currently we have two indexes openshift and openshift_sockshop. The first one is the default index, where we redirect all data, and the second index will be used for forwarding only data from the sock-shop Project. We recommend prefixing all indexes used for storing data forwarded from OpenShift with similar prefixes, like openshift_ or os_. In that way, it will be much easier to modify the macros and saved searches to adjust which indexes you will be able to see. We will show you an example later. For the second step, we need to allow the HTTP Event Collector Token to write to the newly created index. Using the web interface ( if you are using a Splunk single instance, Splunk Cloud, or a Splunk Deployment Server), you can modify and add this index to the list If you have a distributed Splunk HTTP Event Collector and you don’t use Splunk Deployment Server, you will need to modify the inputs.conf and add the index openshift_sockshop to the list of indexes. ini Copy 1[http://openshift] 2disabled = 0 3index = openshift 4indexes = openshift,openshift_sockshop 5token = 00000000-0000-0000-0000-000000000003 Splunk is now ready to accept data to the openshift_sockshop index. Configuring OpenShift To tell Collectord to forward all the logs and metrics from the sock-shop namespace, we need to annotate it with bash Copy 1oc annotate namespace sock-shop collectord.io/index=openshift_sockshop In case of Kubernetes use kubectl annotate namespace sock-shop collectord.io/index=openshift_sockshop If this is not a new Project, and you already have Pods running in it, you have several options: wait until Collectord reloads the metadata and acknowledges the new annotations; recreate all Pods in this Project; or restart Collectord itself with bash Copy 1oc delete pods --all --namespace collectorforopenshift In case of Kubernetes use kubectl delete pods --all --namespace collectorforkubernetes Making data visible You may find that after making this change you might not see the data in Splunk for this project. This is because our search macros are looking only in indexes searched by default. You have two options: make the newly created index searchable by default, or modify the search macros. See Monitoring OpenShift. Configuring Splunk Indexes for details. For Kubernetes Monitoring Kubernetes. Configuring Splunk Indexes To make the data visible for the current role you can add this index to the list of indexes searched by default As an alternative, you can make these indexes visible in the application by modifying the search macros (in Advanced Search). This is where the naming pattern for indexes can be handy; you can prefix all the base macros (macros that are used by all the searches and other macros) with index=openshift* as in the example below. You can modify the macros using the Splunk Web interface or by creating the macros.conf file in $SPLUNK_ETC/apps/monitoringopenshift/local/macros.conf If you are using Version 5.9 or above you just need to modify the base macro macro_openshift_base as definition = (index=openshift*), and that macro will be applied to all other macros as well. ini Copy 1[macro_openshift_events] 2definition = (index=openshift* sourcetype=openshift_events NOT "\"type\":\"DELETED\"") 3 4[macro_openshift_host_logs] 5definition = (index=openshift* sourcetype=openshift_host_logs) 6 7[macro_openshift_logs] 8definition = (index=openshift* sourcetype=openshift_logs) 9 10[macro_openshift_mount_stats] 11definition = (index=openshift* sourcetype=openshift_mount_stats) 12 13[macro_openshift_net_socket_table] 14definition = (index=openshift* sourcetype=openshift_net_socket_table) 15 16[macro_openshift_net_stats] 17definition = (index=openshift* sourcetype=openshift_net_stats) 18 19[macro_openshift_proc_stats] 20definition = (index=openshift* sourcetype=openshift_proc_stats) 21 22[macro_openshift_prometheus_metrics] 23definition = (index=openshift* sourcetype=prometheus OR sourcetype=openshift_prometheus) 24 25[macro_openshift_stats] 26definition = (index=openshift* sourcetype=openshift_stats) Creating the restricted role in Splunk The next step is to create a role in Splunk that will be able to see only the sock-shop Project. Restricting app developers only to their project with indexes We will use the user role as an example. The only change we will do is to specify which indexes this role can see. Now you can create a user with the role user. When you log in with this user you will see only one Project with the Monitoring OpenShift application Restrictions with Search Terms Each log event and metric point from the Pods has metadata attached to it, which includes the field openshift_namespace. We can leverage this indexed field and the Splunk Search Restrictions to be able to show only one project to the Splunk user. Restricting app developers only to their project with search terms In this example we will let the user role see only the data from a guestbook Project. For that we will modify the user role and specify a restriction openshift_namespace::guestbook OR (sourcetype::openshift_events AND object.involvedObject.namespace=guestbook) (double colon means that this search will only be applied to the indexed fields) Starting from version 5.6 you only need to restrict by the indexed field openshift_namespace::guestbook We also need to be sure that this user can see the data from the OpenShift index, and this index is searchable by default. (Otherwise we will need to modify the macros; see the example above in the section Making data visible.) Now you can create a user with the role user. When you log in with this user you will see only one Project with the Monitoring OpenShift application A note about the security An important note about this method from Securing Splunk Enterprise > Add and edit roles with Splunk Web. Search term restrictions offer limited security. A user can override some search term restrictions if they create a calculated field that references a field name listed here as a restricted term. Be aware that this method is less secure. Although we tried to break it by creating a calculated field: text Copy 1openshift_namespace=coalesce(if(openshift_namespace=="guestbook", "guestbook", null), "guestbook") With this calculated field, you cannot see the logs outside of your project when you specify a Search Term Restriction like openshift_namespace::guestbook, because this term is looking only for indexed fields. If you change it to openshift_namespace=guestbook, you will be able to see the logs from other namespaces by using the calculated field as in the example above. Conclusion Both methods allow you to restrict which projects a user can see in Splunk. Splunk claims that restriction with Search Terms is not very secure, but we could not find how to escape the existing Project when openshift_namespace is an indexed field. The method with restricting by indexes gives you more flexibility. You can decide for each index how long you want to retain the data. If one team is logging much more than other teams, that will not affect performance for the teams, as each team will have its index. Of course, managing one index per Project might require more configurations on both sides, OpenShift and Splunk as well. Blog Blog - Splunk Application Boilerplate (version 1.0). Developing Splunk Apps with Docker. Blog Blog - Splunk Application Boilerplate (version 1.0). Developing Splunk Apps with Docker. Splunk Application Boilerplate (version 1.0). Developing Splunk Apps with Docker. We develop applications for Splunk. We develop applications for Splunk that allow you to monitor containerized applications. We develop these applications because we love containers. Of course, we use containers in our development workflow. Today we are happy to share our development practices with you, the way we use Splunk Image and App Inspect to build our applications. We are open sourcing two projects: Boilerplate of Splunk app and developer scripts that allow you to leverage Docker for development, feel free to use it to build and deliver Splunk Applications. Docker Image with Splunk AppInspect, which we will support moving forward and keep up to date. The benefits of using our Boilerplate app: Splunk is distributed as a Docker Image. If you break something, tear down the environment and create a new one. Splunk can be preconfigured with any possible requirements you might have, which you can define with Splunk configuration files. Support for Splunk Licenses, if you have Splunk Development License. Default configurations allow you to start developing right away. Check our Tips & Tricks section. And it is easy to get started: Step 0. Bootstrap your application by following our step-by-step guidance Step 1. Run Splunk Up to bootstrap your Splunk environment Step 2. Run Splunk Logs to wait until Splunk is ready Step 3. Run Splunk Web to open your application in Splunk Web Read more on GitHub. Enjoy! Blog Blog - Timestamps in container logs Blog Blog - Timestamps in container logs Timestamps in container logs It is essential to know when the application generates a specific log message. Timestamps help us to order the events. And they help us to build a correlation between systems. That is why logging libraries write timestamps with the log messages. When you write log messages to systems like journald or Splunk, every event has a timestamp. These applications parse the timestamp from the message or generate it from the system time. When you write logs to plain text files, you lose the information about when the application created these messages. This is why you tell loggers to identify every line with a timestamp. Let’s look at an example. A tiny Python snippet that generates a new log message when the response does not have a 200 status. python Copy 1import logging 2 3logging.basicConfig(format="%(asctime)s - %(levelname)s - %(message)s") 4 5 6def handle_request(request): 7 if request.statusCode != 200: 8 logging.warning('not OK response') 9 10 # handle request If we handle response without an OK status, we should see a message like in the example below. text Copy 12018-08-09 21:53:42,792 - WARNING - not OK response What can this logging message tell us? That around 2018-08-09 21:53:42,792 we had one response that did not have a 200 status code. It is not the exact time of the response. It’s possible that the function handles the response a second later after the actual response. It’s possible that the network was too slow, and the whole request took 10 seconds to complete. Another important detail: when we write a log message, we do not inject the timestamp. The logger does it for us. It depends on the load of the system where this code is being executed. The timestamp can be a few nano- or micro- seconds off from the actual message. Logger libraries are very similar in how they work. Below we will look into how the Python logging module emits the log messages. If we look at the source code of the logging module, we can find where the timestamp is generated. python Copy 1def __init__(self, name, level, pathname, lineno, 2 msg, args, exc_info, func=None, sinfo=None, **kwargs): 3 """ 4 Initialize a logging record with interesting information. 5 """ 6 ct = time.time() 7 self.name = name 8 self.msg = msg Based on the format, the logger prints the timestamp in human-readable form. Another important detail about all logging libraries: they all flush the stream after emitting the message. Python logging module is not an exception. python Copy 1msg = self.format(record) 2stream = self.stream 3stream.write(msg) 4stream.write(self.terminator) 5self.flush() The flush call is very important for logging libraries. By default, operating systems create buffered standard output and error streams. Based on the attached reader, the behavior could be different. Without flushing, your messages could be stuck in the buffer until the buffer is full or the system closes the stream. I recommend looking at the stdio and fflush man pages to learn more. Let’s summarize what we have learned so far: Timestamps are essential in logs. They provide order and help correlate events. Timestamps might not represent the exact moment when application code writes the log message. Logging libraries avoid keeping records in buffers. All messages should be available to readers almost instantly. It is time to look at how Docker handles container logs. In the example below, we use Docker, but other container runtimes like CRI-O have similar implementations. Let’s generate Docker container logs. For that, we can run a docker run command. bash Copy 1$ docker run --name python_example python:3 python -c 'import logging; logging.basicConfig(format="%(asctime)s - %(levelname)s - %(message)s"); logging.warning("not OK response");' 22018-08-10 17:59:54,582 - WARNING - not OK response It is a short-lived container, but we have not removed it after execution, so we can look at the logs. bash Copy 1$ docker logs python_example 22018-08-10 18:00:56,086 - WARNING - not OK response We can use the --timestamp option with the docker logs command. bash Copy 1$ docker logs --timestamps python_example 22018-08-10T18:00:56.086989400Z 2018-08-10 18:00:56,086 - WARNING - not OK response As you can see, every message in Docker logs also has a timestamp. If we round the timestamp generated by Docker, we can see that it differs by one millisecond (2018-08-10T18:00:56.086989400Z ~= 2018-08-10T18:00:56.087). Docker generates this timestamp with the Copier. go Copy 1if msg.PLogMetaData == nil { 2 msg.Timestamp = time.Now().UTC() 3} else { 4 msg.Timestamp = partialTS 5} Docker attaches the Copier to the container’s standard output and error pipes. go Copy 1copier := logger.NewCopier(map[string]io.Reader{"stdout": container.StdoutPipe(), "stderr": container.StderrPipe()}, l) 2container.LogCopier = copier 3copier.Run() 4container.LogDriver = l The Copier reads the streams and forwards messages to loggers. Let’s summarize what we learned about the Docker logger system: Docker generates another timestamp for every log message. There could be a slight delay between when an application writes the message and the Docker timestamp. This delay depends on how busy your system is. We have not talked about this, but it is essential to know: If the Copier with the Logger isn’t able to keep up with the number of logs, it might block the standard output of the container. This can result in performance degradation of your application. You will not see spikes in CPU usage, but peaks in IO wait. None of the timestamps represent the actual moment when the event happened. Logger drivers use the timestamp generated by Docker. Our suggestion: Avoid generating timestamps with logging libraries. Docker always attaches timestamps to the logs. This will help you reduce disk space for storing the logs. And if you use logging systems where you pay for the amount of injected data (like AWS CloudWatch or Splunk), this will help you reduce costs. Blog Blog - Using Splunk fields extractor to extract fields from container logs Blog Blog - Using Splunk fields extractor to extract fields from container logs Using Splunk fields extractor to extract fields from container logs Getting logs and metrics in your Splunk cluster is just the first step to managing your logs. The next step is to build your own custom dashboard to be able to explore the data you forward to Splunk. The collectord has a special format for sources that it forwards from container logs. We on purpose decided to replace it from unused and non-interesting /var/lib/docker/containers/{container_id}/{container_id}-logs.json to something that we can leverage with Splunk. So instead we are sending logs with the format that includes container name, image name and more, depends on used orchestration. You can find the format definition in Docker, Kubernetes and OpenShift documentation. With the following example, we show how you can leverage this knowledge to extract fields from your container logs and start building custom dashboards. We use Monitoring OpenShift as an example, but you can apply this to any of our applications. First, you need to find the logs and define the source rule. Using wildcards, you can cover multiple sources, like all containers that were created from a specific image, as in the example below In our deployment, we have 6 pods with similar sources. text Copy 1/openshift/5d9ab541136dfd1a6a41efe25b481a432147e7452b4dd0755c0cb666e925cb79/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-bgqvk/nodejs.stdout 2/openshift/9c078f0896fbad08df0a1b10c2c26d40a3166f940f4d6c3f731469f1a9152e11/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-dfbxg/nodejs.stdout 3/openshift/05cebcacac1ccc6cb768169f3fb5544f606eeca1615ab339e5e01156aa6fa56a/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-gbqmf/nodejs.stdout 4/openshift/185fa1c7e42658666fe704bc5b64366544c17d4d4edb83356c0ee2ebb1f2df6b/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-c79zr/nodejs.stdout 5/openshift/7e12a1738ac448537c724b34b9451b1541ec60a5abf82f5de9912166c7497f76/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-hlz88/nodejs.stdout 6/openshift/b14c3eb86db75e4f006cdc86bb9cbd08450b327ac9955f0299288e94f0e9053d/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex-3-drvj7/nodejs.stdout Based on the format for OpenShift container logs, we can define the source by applying wildcards to the container id and pod suffix. text Copy 1/openshift/*/nodejs-ex/docker-registry.default.svc:5000/nodejs/nodejs-ex:latest/nodejs-ex*/nodejs.stdout Using this pattern, we can define field extraction in Splunk. For that, go to Settings, Fields, Field extractions and choose Open Field Extractor. Change the data type to source and paste the source defined above. Verify that you can see all the logs that you expect and define field extraction by following the wizard. With this simple approach, you can easily extract fields from container logs and start building custom dashboards. Blog Blog - Welcome to our blog Blog Blog - Welcome to our blog Welcome to our blog We are happy to welcome you to the blog of our company, Outcold Solutions. We are a small company co-founded by two engineers. Our company focuses on building powerful Splunk applications for monitoring and log forwarding from Docker, Kubernetes, and OpenShift clusters. One of the co-founders worked at Splunk, and he is the original author of the Splunk Logging Driver and Splunk Docker Images. You can follow him on Denis Gladkikh (@outcoldman). In this blog we will share not only news, the most interesting use cases and best practices about our products, but also how we develop our applications for Splunk, the tools we use (Docker, of course), and how we automate our builds. We will share our knowledge about Splunk and other logging solutions, including ELK (Elasticsearch, Logstash, Kibana). Some examples of the blog posts we are planning to write soon: Splunk Docker image: past, current, and future. We will share the story about why the Splunk Docker image was built the way it was built. What influenced this image. Also, we will discuss and propose some ideas on how we can improve and develop the second version of the Splunk Docker image. ELK vs. Splunk. A deep dive into two solutions. Compare storage engines, the story behind the products, compare components, offerings, and share our knowledge on how to keep both systems healthy. Developing Splunk applications in Docker containers. We will share how we build our applications in Docker, what we have learned, and some scripts. Series of blog posts on how to run Splunk clusters in Docker, OpenShift, and Kubernetes. And of course, best practices for using our solutions, the most common and interesting use cases. Subscribe using your favorite RSS reader or by following us on twitter. Blog

No documentation matches that search.

Outcold Solutions - Monitoring Kubernetes, OpenShift and Docker in Splunk
Monitoring Docker in Splunk Monitoring Kubernetes in Splunk Monitoring OpenShift in Splunk Monitoring Linux in Splunk Monitoring Windows Containers in Splunk Forwarding Logs to Elasticsearch and OpenSearch Forwarding Logs to QRadar
Documentation Blog Contact Us Pricing Evaluation licenses Partners License Agreement Privacy Policy Refund & Cancellation Acceptable Use Policy FAQ
Collectord documentation
Monitoring Linux
Overview Installation Configuration Log forwarding Splunk HTTP Event Collector Alerts Troubleshooting Release history
Common
All documentation FAQ License agreement Acceptable use policy Refund & cancellation Privacy policy Pricing
Questions? Contact support or reach out.
Monitoring Linux
Collectord documentation
Monitoring Linux
Overview Installation Configuration Log forwarding Splunk HTTP Event Collector Alerts Troubleshooting Release history
Common
All documentation FAQ License agreement Acceptable use policy Refund & cancellation Privacy policy Pricing
Questions? Contact support or reach out.
Home / Docs / Monitoring Linux
Monitoring Linux

Forward Linux host logs and metrics to Splunk.

Run Collectord on each Linux server and forward syslog, journald, host metrics, and process metrics to Splunk - no containers, no orchestrator, just one binary plus a systemd unit.

Install in 5 minutes Troubleshooting

Installation

Set up Splunk app, HEC, and install collectord

→

Configuration

Config file layout and override settings

→

Log forwarding

Forward logs from custom paths beyond /var/log and journald

→

Splunk HTTP Event Collector

Configure SSL and connection settings for Splunk HEC

→

Alerts

Predefined alerts for license and collector health

→

Troubleshooting

Run verify command and diagnose common issues

→

Release history

Changelog of all collectord and Splunk app releases

→
Quick links
  • Installation
  • Troubleshooting
  • Release history

Don't leave your containers shipwrecked

LinkedIn GitHub YouTube RSS Newletter

Solutions

  • Monitoring Docker
  • Monitoring Kubernetes
  • Monitoring OpenShift
  • Splunk dashboards
  • vs Splunk Connect for K8s

Support

  • Submit ticket
  • Documentation

Company

  • Contact
  • Blog

Legal

  • Privacy policy
  • License agreement
  • Refund & cancellation
  • Acceptable use
  • Cookie preferences

© 2017-2026 Outcold Solutions LLC. All rights reserved.

HQ in DeLand, FL, USA.

We value your privacy

We use cookies to measure traffic with Google Analytics and to track ad conversions with Google Ads. None of these are required for the site to work. See our Privacy Policy for details.