31 May How to archive indexed data in Splunk Enterprise
In the last blog, we covered how data ages in Splunk Enterprise through different stages. In the frozen stage, data is either archived or deleted after a set period of time. Archived data has no default location and you may archive data into a directory location of your choice. So how do you archive indexed data in Splunk Enterprise? Well, read on….
Depending on how you’ve set your data retirement and archiving policy, data reaches the frozen state where the indexer deletes the data. If you want to archive the data instead, you can let the indexer archive the data automatically or you have the choice to specify a customized archiving script for the indexer to follow. You can do this in two ways in indexes.conf.
- By setting the coldToFrozenDir attribute where you specify the location where the index will automatically archive the frozen data.
- By specifying a valid coldToFrozenScript attribute where the indexer will run a user-supplied script when the data is frozen.
You can only set one of the above two attributes. However, if you set both, the coldToFrozenDir attribute takes precedence over coldToFrozenScript. If you choose not to specify either attribute, the indexer runs a default script by writing the name of the bucket being erased to the log file $SPLUNK_HOME/var/log/splunk/splunkd_stdout.log following which, the bucket is erased.
How the indexer archives the data for you
How the indexer archives the frozen data will depend on what release the data was originally indexed. For buckets created in version 4.2 and onwards, the indexer will delete all data except for the rawdata file. For buckets created in versions pre 4.2, the scripts in buckets .tsidx and .data files are gzipped.
Customizing an archiving script
If you specify a script in the coldToFrozenScript attribute, it will run just before the indexer deletes data that reaches the frozen state. This will archive indexed data in Splunk instead of deleting it. You will need to supply the actual script to the indexer that includes the following:
- <index> indicates which index contains the data you want to archive
- <path to script> indicates the path to the archiving script
- The script must be in $SPLUNK_HOME/bin or one of its subdirectories
- <path to program that runs script> this is optional; however, you must set it if your script needs a program such as python to run it.
For example, if your script is located in $SPLUNK_HOME/bin and is called myColdToFrozen.py, set the attribute in the following manner:
coldToFrozenScript = “SPLUNK_HOME/bin/python” “SPLUNK_HOE/bin/myColdToFrozen.py”
Furthermore, the indexer also comes with an archiving example script $SPLUNK_HOME/bin/coldToFrozenExample.py that you can use to edit and customize. A word of caution here, this is an example script and should not be applied to a production instance without testing extensively to suit your environment. Once again depending on whether the data was originally indexed in version 4.2 and onwards or pre 4.2, the example script will archive frozen data in two different ways mentioned earlier. It is important to note here that you should make sure that the script you’ve created completes quickly so that the indexer doesn’t end up waiting for the return indicator.
Data archiving and indexer clusters
A non-clustered indexer rolls its buckets over to frozen based on its specific set of configurations. In the same way, based on its set of configurations, in an Indexer cluster, each peer node rolls its buckets over to frozen. All peers in a cluster should be identical in configuration so all copies of a bucket should roll to frozen at the same time.
Occasionally, there can be variance in the timing, because the same index can grow at different rates on different peers. The cluster does perform processing to ensure that buckets move to the frozen stage fluently and seamlessly across all peers in the cluster. It also does not initiate fix-up activities for a bucket that is frozen on one peer but not on another.
Archiving multiple copies
You should know that because indexer clusters contain multiple copies of each bucket if you use the above-mentioned methods to archive indexed data in Splunk, you will end up with multiple copies of data depending on the replication factor. For instance, if a cluster has a replication factor of 3, the cluster will store three copies of data across its set of peer nodes. This is a problem that can’t be solved by archiving the data on a single node since there’s no certainty that a single node consists of all the data in that cluster. Although the solution to this would seem to be to archive just one copy of each bucket on the cluster and deleting the other copies, in practice, it is a complex process.
Besides, if you choose to archive multiple copies of clustered data, you may run into name collisions with identically named copies of buckets. This will create problems especially for deployments where contents of a directory are required to have unique names.
There are several ways around this. However, the complexity can be baffling in addition to being time consuming. Consulting a Splunk professional is recommended to customize a solution to your business environment so you can focus on other important initiatives. Splunk professionals at Cyber Chasse are trained to take on these and more complexities and provide you with the best resolution to your archiving needs. Contact us today.