My Thoughts on Retention

Sometimes I feel like Records Retention is the red-headed step child. It's obviously important, almost every regulation that covers us talks about it to one extent or another. There's a base assumption under the major frameworks that retention is happening1. Having a policy around records management is even a requirement under SOX, and for certain classifications of data under PCI-DSS and HIPAA. Despite all this, based on my experience, retention schedules are often amongst the last policies to be built and the least likely to be enforced.

Personally, I find this somewhat backwards. Instituting a Records Retention policy is one of the easier policies to write. At its core a retention schedule is not much more than a list of data types and when they're to be disposed of. The trick is figuring out what those time frames should be. For each data type you need to take into account at least:

  • Liability — We cannot be held liable for any data that falls outside of our published retention schedules (unless the published schedule contradicts regulation or legislation).
  • Storage — Keeping stuff a long time means we have to pay to keep it.
  • Processing — The longer back we keep data, particularly logs and alerts, the longer it generally takes to run reports and process the data.
  • Usability — We need to understand how we’re going to use the information so we can know out how long we actually need to keep it.
  • Compliance — Most regulations, laws, and standards define minimum storage times, sometimes you’ll see maximums. Whatever else we need to make sure we follow these, or else get someone higher than us to sign off on ignoring the requirement.

Generally speaking liability, storage, and processing lean towards keeping data as short a time as possible. Usability and compliance can set specific maximums or minimums.

This is when you need to actually start talking to people. From your compliance team you need to find out what kind of data needs to be collected and what time frames may be prescribed based on how your product or company is regulated. From the data custodians you need to know the technical limitations of how much can be stored and how. From the users2 you need to find out what kinds of information is being used and how it's being used. For example,

Syslogs are primarily useful for system state monitoring, troubleshooting, and auditing. System state is very transient information that is rarely useful beyond a day or so. Problem generally make themselves apparent quickly and can be resolved with relatively short lived history. It is very rare to go back more than 3-4 weeks during troubleshooting. Audit reports are run regularly against the logs regularly and can be used without referring to the authoritative source.

This tells us we minimally need 4 weeks but can probably scale out to 3-6 months in order to cover abnormal situations. One year or longer is almost definitely extreme overkill.

Or to beat that horse using a business side example,

Client engagements are charged on a per encounter basis and billed monthly. Notes are kept for each engagement and are used to clarify potential discrepancies in the event of a billing dispute. The notes do not contain any legislated or contractually defined data but does contain private information regarding the client. Billing disputes are rare and frequently initiated within 2 months and have never been resolved later than 9 months following the engagement.

In this situation there would be a business need to keep the notes at least 9 months, we can round that up to 12 to make a nice easy number. However, the actual billing data may need to be kept longer depending on any number of factors such as financial reporting or tax filings.

There are a lot of variables involved and there may be some political negotiation, if billing needs 12 months and storage has only allocated disk for 5 then someone will have to change something. That's not to say you have to get it all right at the outset either. Create a default schedule and add to it as you define the data elements. Just get something in place. Your technical team will be happy because their storage stops filling up and your legal team will be happy when they receive a subpoena for "all network traffic, emails, and telephone records related to John Snow since August 10th, 1995".


  1. Under NIST SP800-53r4 check out AU-11, SI-12, CM-2(3). Or if you prefer something drier than NIST go to ISO 15489-1:2001.

  2. Users is a pretty generic term here. For things like firewall, web, or auth logs it might be your systems administrators or security analysts. For things like client engagements and front desk visitor check-in it may be your billing or business intelligence folk.