A Retention Complication

After going through the retention schedule exercise on our infrastructure log management system I ran into a bit of an interesting situation. First, some background.

Historically Graylog2 hasn't provided any capability to perform time based retention, which is pretty dang lame. Instead all retention was performed based purely on message count. If you have good figures on your log generation rates this is no big deal. Let's assume you have a 6 month retention schedule and your environment has leveled out to about 25,000 events per second. This gives us 36,000,000 events per day at 182, which in Graylog2 is pretty easy to set up. Just change these lines in your /etc/graylog2.conf and restart the service.

elasticsearch_max_docs_per_index = 36000000
elasticsearch_max_number_of_indices = 182

But, in reality, there's a kind of a hitch. Most of us don't live in that predictable of an environment. Most of us are actively building out our logging infrastructure and frequently adding new systems that have never been profiled with no clear understood volume. Or we're a dynamic shop with systems coming up and down automagically to meet demand. We might go from 200 EPS one day to 700 the next. Growth can be approximated over time but it takes significant data to be meaningful.

Fortunately for us Graylog2 finally started shipping time based retention. I was pretty stoked and jumped right in, especially since it was as easy as two config options.

elasticsearch_max_time_per_index = 1d
elasticsearch_max_number_of_indices = 20

Personally I didn't care how many indices there were just so long as the oldest one was no more 6 months old. So I set elasticsearch_max_time_per_index to 182, thinking any index older than that would get deleted, and outright ignored elasticsearch_max_number_of_indices entirely. Therein lies my fatal flaw. As it turns out I was a bit off. The option elasticsearch_max_time_per_index is used for rotation whereas elasticsearch_max_number_of_indices is used for retention. This is important.

Retention Is Not Rotation

I mean, this does make sense. Retention is how long we keep something around and rotation is how often we fiddle with it. After a good bit of back and forth with the developers it occurred to me that I've mentally combined the two actions into a single concept. Take for example log files on a Linux system. A very common setup is to rotate files nightly and delete them weekly. I realized that I've internalized the rotation strictly as a resource management problem. By having files that contain logs from a single day it's easier to read, they're generally smaller, and all around easier to work with.

Retention, again in my head, however is much more important. Here we're not really concerned about ease of use but rather importance. We keep files because they're important. Then, after a certain time, they're not or we're told we can't. Either way, we delete them because they're not important anymore1.

By merging the two concepts I had relegated rotation to be just something we do as part of retention because it's easy, not because it has any actual merit. This is where I had failed, and what I think bears repeating. Rotation is important and has its own merits. Rotation is how we organize log files. Rotation is how we heat tier information. Hell, rotation is how I organize the notebook on my desk.

Normally I find myself rethinking technical implementations as a result of policy/procedure/governance enlightenment. This incident is the opposite. Thanks to a moment of technical implementation enlightenment I've found myself re-thinking some policies and procedures and how there might be missing pieces.

  1. Ok, fine, often times we delete things not because they're not important but because we don't have the storage. Don't fool yourself. If it was really that important we would find the storage. If something truly is important, but an extra drive tray for the NAS is denied, we would find a way.