Outcold Solutions LLC

Forwarding 10,000 1k events per second generated by containers from a single host with ease

October 22, 2018

[UPDATE (2018-11-04)] Up to 35% CPU performance improvements, 3 times less memory usage in upcoming version 5.3.

It is good to know the limits of your infrastructure. We are continually testing collector in our labs. Today we want to share with you results of the tests that we have performed on AWS EC2 instances. So you can use it as a reference for planning the capacity and the cost of your deployments. We will provide you with information on how we run the tests and how we measured the performance.

Tests performed on 2018-10-20

Environment

AWS

We used 2 EC2 instances. In the same VPC, in the same AZ, default Tenancy, with AMI ami-0d1000aff9a9bad89 (Amazon Linux 2).

  • c5d.xlarge (4 vCPU, 8GiB, 100 NVMe SSD) for Splunk
  • m5.large (2 vCPU, 8GiB, 20Gb gp2 EBS) for testing environment (for most tests)
  • c5d.xlarge (4 vCPU, 8GiB, 100 NVMe SSD) for 10,000 1k events per second test (we note below)

Splunk

We deployed Splunk inside of the container. We used version 7.2.0. One index for all events.

Docker

docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a215d7133c34aa18e3b72b4a21fd0c6136
 Built:             Wed Sep 26 23:00:19 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a/18.06.1-ce
  Built:            Wed Sep 26 23:01:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Json logging driver configuration

{
  "log-driver": "json-file",
  "log-opts" : {
    "max-size" : "100m",
    "max-file" : "3"
  }
}

Collector for docker

In our tests we used a latest released version of collector for Docker 5.2. We used two configurations. One that works out of the box (with gzip compression, SSL, join rules). The second configuration used HTTP connection for HEC, disabled gzip compression and no join rules.

...
--env "COLLECTOR__SPLUNK_URL=output.splunk__url=http://splunk-example:8088/services/collector/event/1.0" \
--env "COLLECTOR__SPLUNK_GZIP=output.splunk__compressionLevel=nocompression"  \
--env "COLLECTOR__JOIN=pipe.join__disabled=true" \
...

Kubernetes

Single instance cluster. Bootstrapped with kubeadm.

kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Collector for kubernetes

In our tests we used a latest released version of collector for Kubernetes 5.2. Similarly to Docker, we used two configurations. One that works out of the box (with gzip compression, SSL, join rules). The second configuration used HTTP connection for HEC, disabled gzip compression and no join rules.

[output.splunk]
url = http://splunk-example:8088/services/collector/event/1.0
compressionLevel=nocompression

[pipe.join]
disabled = true

Log generator

We used ocp_logtest with the following configuration

python ocp_logtest.py --line-length=1024 --num-lines=300000 --rate 60000 --fixed-line

That configuration generates from one container close to 1,000 events with the average size of 1024 bytes.

To run it in docker we used the command below. To forward 5,000 events we would run 5 of these containers in parallel.

docker run --rm \
  --label=test=testN \
  -d docker.io/mffiedler/ocp-logtest:latest \
  python ocp_logtest.py --line-length=1024 --num-lines=300000 --rate 60000 --fixed-line

To run log generator in Kubernetes we used Jobs. Each had the same definition as the one below. To forward 5,000 events we would run 5 of these Jobs in parallel (you need to change the name of the job).

apiVersion: batch/v1
kind: Job
metadata:
  name: logtestX
  labels:
    test: 'testN'
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: c1
        image: docker.io/mffiedler/ocp-logtest:latest
        command:
          - python
        args:
          - ocp_logtest.py
          - --line-length=1024
          - --num-lines=300000
          - --rate=60000
          - --fixed-line

Tests

In our test dashboards we show:

  1. The number of messages in total (to verify that we have not lost any messages).
  2. The number of messages per second (only from tested containers).
  3. Message len (len(_raw), only the size of the logline, excluding the metadata that we attach).
  4. Collector CPU usage percent of a single core.
  5. Lag, the difference between _indextime and _time (... | timechart avg(eval(_indextime-_time))). We show max, avg and min.
  6. Collector memory usage in Mb.
  7. Network transmit from collector container (in case of Kubernetes network transmit from the host, as it is running on the host network).

Test 1. Docker environment. Default Configuration. Forwarding 1,000 1k events per second

Test results

Test 2. Docker environment. No SSL, No Gzip, No Join. Forwarding 1,000 1k events per second

Changing the default configuration can significantly impact the results. Because we do not use SSL and we do not use Gzip we reduced the CPU usage from 6-7% to 4-5%. Because we do not use Gzip we reduced memory usage from 60Mb to 20Mb, simple because our batchSize is still the same 768K, but now we are talking about not-compressed 768K. But the network usage grew from 50Kb/s to almost 2Mb/s.

Test results

Test 3. Docker environment. Default Configuration. Forwarding 5,000 1k events per second

Comparing to Test 1, now we forward 5 times more events. Instead of 6-7% CPU usage we see 30-35% CPU usage of a single core. And increased memory usage from 60Mb to around 110Mb.

Test results

Test 4. Docker environment. No SSL, No Gzip, No Join. Forwarding 5,000 1k events per second

Disabling SSL, Gzip and Join rules can reduce CPU usage from 30-35% to 25%. But disabling Gzip compression increases the network traffic. When you do not pay for network traffic between your nodes and Splunk instance, and if you have enough bandwidth to support it you can choose not to use Gzip compression.

Test results

Test 5. Kubernetes environment. Default Configuration. Forwarding 1,000 1k events per second

Similarly to Docker environment, we tested Kubernetes environment as well. CPU and Memory usage on Kubernetes clusters slightly higher for several reasons. First, these collectors still performing a collection of all other data, that we configure, including system metrics, Prometheus metrics, network metrics, host logs. Collector for Kubernetes forwards more data by default, because of the more complex environment. The second reason because we attach more metadata to logs and metrics, as we forward not only information about the Containers and Pods, but also information about the Workloads that created this Pod, and the Host information as well.

Memory usage in this test is not very accurate, because we run 5,000 1k events test right before this test, that increased the memory usage of the collector and kept the memory reserved.

Test results

Test 6. Kubernetes environment. No SSL, No Gzip, No Join. Forwarding 1,000 1k events per second

Similar result to Docker environment. Not using Gzip compression can reduce CPU usage, and reduce Memory usage. From 9-10% to 5-6% CPU usage of a single core.

Test results

Test 7. Kubernetes environment. Default Configuration. Forwarding 5,000 1k events per second

Forwarding 5,000 uses 40% of CPU. Comparing to test 5, we see 4x change.

Test results

Test 8. Kubernetes environment. No SSL, No Gzip, No Join. Forwarding 5,000 1k events per second

Disabling Gzip compression reduces CPU usage from 40% to 26% of a single core.

Test results

Test 9. Docker environment. Default Configuration. Forwarding 10,000 1k events per second

To be able to forward more than 5,000 events we reserved c5d.xlarge instance for this test, to make sure that we will not be affected by the performance of gp2 EBS volume.

We have changed a configuration of collector and increased the number of Splunk threads to 5. In our tests, we see that one Splunk Client with the default configurations (SSL, Gzip compression, 768K batch size) can forward about 5,000 events. We recommend increasing this value if you have more than 4,000 events per second.

--env "COLLECTOR__SPLUNK_THREADS=output.splunk__threads=5"  

Doing that allowed us to forward 10,000 events per second. Comparing to test 3 we are using 60% of a single core. The memory usage grew to 400Mb because of the dedicated threads (and buffers allocated for them).

Test results

Important detail, that with this amount of events dockerd CPU uses uses around 25% of single core. Splunk process used 80% of single host on its own server.

With this amount of events dockerd uses around 25% of a single core CPU. Splunk process used 80% of a single core CPU on its host.

Test results

Summary

If you want to reproduce these tests in your environment, we have shared with you all the steps we performed. If you find some steps missed - please let us know.

Forwarding up to 5,000 1k events per second does not require any changes in the configuration. To forward above that you need to change the number of threads.

These results are not our limit. We will keep working on improving performance and memory usage in the future.

performance, collector, ec2, aws

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.