Check Splunk search logs, just in case

^{June 24, 2021}

We have been working on an interesting case with one of our customers. Every role in Splunk has a defined disk limit, and by default the user role has only 100MB.

Splunk Role Disk Limit

We are always cautious about how much data we bring to Splunk Dashboards and limit everything to make sure our applications can handle large clusters in our applications.

One search that was causing an issue was a search used to populate filters in various places of our "Monitoring OpenShift" application. Depends on the number of the nodes, namespaces and labels, we expect this search to return multiple thousands of values, but should not take a lot of disk space.

( `macro_openshift_stats_cgroup` OR `macro_openshift_logs`) | 
stats count by host, openshift_node_labels, openshift_namespace, openshift_cluster_eval

This search generated ~1,000 rows on our test Cluster, but took almost 10MB of Disk Space. In the customer's environment, we were dealing with hundreds of MB on disk. Which is very unusual.

Job Disk Size

And the simple change to a simplified search would bring the disk space to just several MB, instead of hundreds.

( `macro_openshift_stats_cgroup` OR `macro_openshift_logs`) | 
stats count by host, openshift_node_labels, openshift_namespace, openshift_cluster

The difference between the first search, and the second one, is the usage of the openshift_cluster_eval field, which is a Calculated field, that looks first at an indexed field openshift_cluster and if there is no value in here, will look in openshift_node_labels for the cluster field (for backward compatibility).

Considering that those searches return the same results, something was very odd about that. On our test cluster one search would take 8.26MB and another 196KB (the difference is 42 times)

Job Disk Size Compare

When you inspect the Job (Search), you can get find a Job ID

Job Inspect

Using this SID, you can find a folder on the Search Head, that represents it. It will be under $SPLUNK_HOME/var/run/splunk/dispatch, so we looked into it

Job Inspect

As you can see the difference between those two Searches is just the size of the remote_logs folder. After some digging in those logs, we saw many repeating INFO messages that some of the calculated fields will be ignored, which is expected, but we definitely did not expect that it would clog the search logs.

If you look at the Splunk documentation, you will find some information about Splunk logging, see Troubleshooting Manual-Enable debug logging. The logs we are looking at are part of the search process, not the Splunk daemon, so that the configurations would be in the $SPLUNK_HOME/etc/log-searchprocess.cfg. And considering that in our case we have a Splunk Cluster with Search Head Cluster and Indexer cluster, the searches are scheduled on Indexers, so if we want to remove those INFO messages, we need to modify the configuration on the indexers.

If you look at the default configuration on the $SPLUNK_HOME/etc/log-searchprocess.cfg you will find

rootCategory=INFO,searchprocessAppender
appender.searchprocessAppender=RollingFileAppender
appender.searchprocessAppender.fileName=${SPLUNK_DISPATCH_DIR}/search.log
appender.searchprocessAppender.maxFileSize=10000000 # default: 10MB (specified in bytes).
appender.searchprocessAppender.maxBackupIndex=3

This means that all the not overridden categories will get by default a value INFO. The maximum size of the logs would be 30MB (3 files of the maximum size of 10MB each). So if you have 10 Indexers, those logs could grow for each search up to 300MB.

There are two ways to fix that. First, we can override the values for a specific category, in our case it was CalcFieldProcessor, so we can create a file $SPLUNK_HOME/etc/log-searchprocess-local.cfg with a content

category.CalcFieldProcessor=WARN

The second option is to override the default log level for all categories with the file $SPLUNK_HOME/etc/log-searchprocess-local.cfg and content

rootCategory=WARN,searchprocessAppender

After we applied those changes, we saw that searches are not taking so much space on the disk.

Jobs after fix

One important detail, if you are using Splunk Cloud, you would not have access to the Splunk File System, to find if you are affected by the same issue, you can run the search, go to the Job Inspector, scroll to the very bottom and expand Search Job Properties, scroll all the way down, and at the bottom of that page you can find log files from the indexers, so you can download them and see if there is anything that clogs the search logs.

Job Inspector

If you see that those logs take a lot of space, talk to Splunk Support and ask them to make the configurations on the Indexers in your Splunk Cloud Cluster.

About Outcold Solutions