Outcold Solutions LLC

Monitoring Kubernetes - Version 5

Collecting Prometheus metrics

Most of the components in Kubernetes control plane export metrics in Prometheus format. The collector can read these metrics forward them to Splunk Enterprise or Splunk Cloud. Our installation has default configurations for collecting metrics from Kubernetes API Server, Scheduler, Controller Manager, Kubelets and etcd cluster. In most Kubernetes providers you don't need to do additional configuration to see these metrics.

If your applications export metrics in Prometheus format, you can use our collector to forward these metrics as well to Splunk Enterprise or Splunk Cloud.

Forwarding metrics from Pods

Please read our documentation on annotations, to learn how you can define forwarding metrics from Pods.

Defining prometheus input

We deploy collector in 3 different workloads. Depending on where you want to collect your metrics, you should plan to include you Prometheus metrics.

  • 002-daemonset.conf is installed on all nodes (masters and non-masters). Use this configuration if you need to collect metrics from all nodes, from local ports. Example of these metrics is Kubelet metrics.
  • 003-daemonset-master.conf is installed only on master nodes. Use this configuration to collect metrics only from master nodes from local ports. Examples of these metrics are control plane processes, etcd running on masters.
  • 004-addon.conf installed as a deployment and used only once in the whole cluster. Place your Prometheus configuration here, if you want to collect metrics from endpoints or service. Examples of these Prometheus configurations are controller manager and scheduler, which can be accessed only from an internal network and can be discovered with endpoints. Another example is etcd cluster running outside of the Kubernetes cluster.

Default configuration

Kubelet

On every node collector reads and forwards kubelet metrics. We deploy this configuration in 002-daemonset.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
[input.prometheus::kubelet]

# disable prometheus kubelet metrics
disabled = false

# override type
type = kubernetes_prometheus

# specify Splunk index
index =

# override host (environment variables are supported, by default Kubernetes node name is used)
host = ${KUBERNETES_NODENAME}

# override source
source = kubelet

# how often to collect prometheus metrics
interval = 60s

# Prometheus endpoint, multiple values can be specified, collector tries them in order till finding the first
# working endpoint.
# At first trying to get it through proxy
endpoint.1proxy = https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/api/v1/nodes/${KUBERNETES_NODENAME}/proxy/metrics
# In case if cannot get it through proxy, trying localhost
endpoint.2http = http://127.0.0.1:10255/metrics

# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath = /var/run/secrets/kubernetes.io/serviceaccount/token

# server certificate for certificate validation
certPath = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

# client certificate for authentication
clientCertPath =

# Allow invalid SSL server certificate
insecure = true

# include metrics help with the events
includeHelp = false

Kubernetes API Server

On master nodes collectors reads and forwards metrics from the kubernetes API server. We deploy this configuration using 003-daemonset-master.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
[input.prometheus::kubernetes-api]

# disable prometheus kubernetes-api metrics
disabled = false

# override type
type = kubernetes_prometheus

# specify Splunk index
index =

# override host (environment variables are supported, by default Kubernetes node name is used)
host = ${KUBERNETES_NODENAME}

# override source
source = kubernetes-api

# how often to collect prometheus metrics
interval = 60s

# prometheus endpoint
# at first trying to get it from localhost (avoiding load balancer, if multiple api servers)
endpoint.1localhost = https://127.0.0.1:6443/metrics
# as fallback using proxy
endpoint.2kubeapi = https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/metrics

# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath = /var/run/secrets/kubernetes.io/serviceaccount/token

# server certificate for certificate validation
certPath = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

# client certificate for authentication
clientCertPath =

# Allow invalid SSL server certificate
insecure = true

# include metrics help with the events
includeHelp = false

Scheduler

On master nodes collectors reads and forwards metrics from the scheduler. We deploy this configuration using 003-daemonset-master.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[input.prometheus::scheduler]

# disable prometheus scheduler metrics
disabled = false

# override type
type = kubernetes_prometheus

# specify Splunk index
index =

# override host
host = ${KUBERNETES_NODENAME}

# override source
source = scheduler

# how often to collect prometheus metrics
interval = 60s

# prometheus endpoint
endpoint = http://127.0.0.1:10251/metrics

# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath =

# server certificate for certificate validation
certPath =

# client certificate for authentication
clientCertPath =

# Allow invalid SSL server certificate
insecure = true

# include metrics help with the events
includeHelp = false

Collecting metrics from scheduler using endpoint discovery

The collector will be able to forward metrics from scheduler only if scheduler binds to the localhost on master nodes. In case if scheduler only binds to the pod network, you need to use a different way of collecting metrics from the scheduler. In 004-addon.conf you can find commented out section [input.prometheus::scheduler], that allows collecting metrics from the scheduler using endpoint discovery.

You can comment out section [input.prometheus::scheduler] in 003-daemonset-master.conf and uncomment in 004-addon.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Example on how to get scheduler metrics with endpoint discovery
[input.prometheus::scheduler]
# disable prometheus scheduler
disabled = false
# override type
type = kubernetes_prometheus
# specify Splunk index
index =
# override host (using discovery from endpoint)
host =
# override source
source = scheduler
# how often to collect prometheus metrics
interval = 60s
# prometheus endpoint
endpoint = endpoint-http://kube-scheduler-collectorforkubernetes-discovery:10251/metrics
# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath =
# server certificate for certificate validation
certPath =
# client certificate for authentication
clientCertPath =
# Allow invalid SSL server certificate
insecure = false
# include metrics help with the events
includeHelp = true

In this configuration, the collector is using endpoint endpoint-http://kube-scheduler-collectorforkubernetes-discovery:10251/metrics, that syntax defines endpoint auto-discovery, it is listing all endpoints with port 10251 defined under name kube-scheduler-collectorforkubernetes-discovery and using all endpoints to collect the metrics.

Endpoint kube-scheduler-collectorforkubernetes-discovery is created with service, defined in our configuration.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-scheduler-collectorforkubernetes-discovery
labels:
  k8s-app: kube-scheduler
spec:
    selector:
      k8s-app: kube-scheduler
    type: ClusterIP
    clusterIP: None
    ports:
    - name: http-metrics
      port: 10251
      targetPort: 10251
      protocol: TCP

Controller Manager

On master nodes collectors reads and forwards metrics from controller manager. We deploy this configuration using 003-daemonset-master.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# This configuration works if controller-manager is bind to the localhost:10252
[input.prometheus::controller-manager]

# disable prometheus controller-manager metrics
disabled = false

# override type
type = kubernetes_prometheus

# specify Splunk index
index =

# override host
host = ${KUBERNETES_NODENAME}

# override source
source = controller-manager

# how often to collect prometheus metrics
interval = 60s

# prometheus endpoint
endpoint = http://127.0.0.1:10252/metrics

# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath =

# server certificate for certificate validation
certPath =

# client certificate for authentication
clientCertPath =

# Allow invalid SSL server certificate
insecure = false

# include metrics help with the events
includeHelp = false

Collecting metrics from controller manager using endpoint discovery

The collector will be able to forward metrics from controller manager only if controller manager binds to the localhost on master nodes. In case if controller manager only binds to the pod network, you need to use a different way of collecting metrics from the controller manager. In 004-addon.conf you can find commented out section [input.prometheus::controller-manager], that allows to collect metrics from the controller manager using endpoint discovery.

You can comment out section [input.prometheus::controller-manager] in 003-daemonset-master.conf and uncomment in 004-addon.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Example on how to get controller-manager metrics with endpoint discovery
[input.prometheus::controller-manager]
# disable prometheus controller-manager
disabled = false
# override type
type = kubernetes_prometheus
# specify Splunk index
index =
# override host (using discovery from endpoint)
host =
# override source
source = controller-manager
# how often to collect prometheus metrics
interval = 60s
# prometheus endpoint
endpoint = endpoint-http://kube-controller-manager-collectorforkubernetes-discovery:10252/metrics
# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath =
# server certificate for certificate validation
certPath =
# client certificate for authentication
clientCertPath =
# Allow invalid SSL server certificate
insecure = false
# include metrics help with the events
includeHelp = true

In this configuration, the collector is using endpoint endpoint-http://kube-controller-manager-collectorforkubernetes-discovery:10252/metrics, that syntax defines endpoint auto-discovery, it is listing all endpoints with port 10252 defined under name kube-controller-manager-collectorforkubernetes-discovery and using all endpoints to collect the metrics.

Endpoint kube-controller-manager-collectorforkubernetes-discovery is created with service, defined in our configuration.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: Service
metadata:
  namespace: kube-system
  name: kube-controller-manager-collectorforkubernetes-discovery
  labels:
    k8s-app: kube-controller-manager
spec:
  selector:
    k8s-app: kube-controller-manager
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 10252
    targetPort: 10252
    protocol: TCP

etcd

On master nodes, collectors read and forward metrics from etcd processes. We deploy this configuration using 003-daemonset-master.conf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[input.prometheus::etcd]

# disable prometheus etcd metrics
disabled = false

# override type
type = kubernetes_prometheus

# specify Splunk index
index =

# override host
host = ${KUBERNETES_NODENAME}

# override source
source = etcd

# how often to collect prometheus metricd
interval = 36s

# prometheus endpoint
endpoint.http = http://:2379/metrics
endpoint.https = https://:2379/metrics

# token for "Authorization: Bearer $(cat tokenPath)"
tokenPath =

# server certificate for certificate validation
certPath = /rootfs/etc/kubernetes/pki/etcd/ca.pem

# client certificate for authentication
clientCertPath = /rootfs/etc/kubernetes/pki/etcd/client.pem
clientKeyPath = /rootfs/etc/kubernetes/pki/etcd/client-key.pem

# Allow invalid SSL server certificate
insecure = true

# include metrics help with the events
includeHelp = false

This configuration works when you run etcd cluster with master nodes. With this configuration, collector tries to collect metrics using http scheme at first, and https after that. For https collector uses certPath, clientCertPath and clientKeyPath, which are mounted from the host.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
...
  volumeMounts:
  ...
  - name: k8s-certs
    mountPath: /rootfs/etc/kubernetes/pki/
    readOnly: true
...
volumes:
- name: k8s-certs
  hostPath:
  path: /etc/kubernetes/pki/

Verify that these certificates are available, if not, make appropriate changes. Check certificates used by the Kubernetes API Server, they are defined with 3 command line arguments

--etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
--etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
--etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
You can find these arguments by executing ps aux | grep apiserver on one of the master node, or try to find the API Server definition under /etc/kubernetes/manifests/kube-apiserver.yaml.

If your etcd cluster is a dedicated set of nodes, you can define prometheus collection in 004-addon.conf.

Metrics format

Prometheus defines several types of metrics.

Each metric value in Splunk has fields:

  • metric_type - one of the types from the Prometheus metric types.
  • metric_name - the name of the metric.
  • metric_help - only if includeHelp is set to true, you will see definition of this metric.
  • metric_label_XXX - if the metric has labels, you will be able to see them attached to the metric values.
  • seed - unique value from the host for specific metric collection.

Based on the metric type you can find various values for the metrics.

  • counter
    • v - current counter value
    • d - the difference with a previous value
    • s - period for which this difference is calculated (in seconds)
    • p - (deprecated) period for which this difference is calculated (in nanoseconds)
  • summary and histogram
    • v - value
    • c - counter specified for this summary or histogram metric
  • All others
    • v - value

If you have specified to include help with the metrics, you can explore all available metrics with the search.

sourcetype="prometheus"
|  stats latest(_raw) by source, metric_type, metric_name, metric_help
Explore Prometheus Metrics

About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.