Migrating Graylog Servers - Part 6 - Lessons Learned

This is the sixth and final post in a multi-part series where I explore the process of transforming an existing Graylog install into a resilient and scalable multi-site installation. Start here for Part 1.

Let me start by saying that the entire process was a huge learning experience for me. Over the course of my career I have dealt with a number of different logging systems of various degrees of complexity. The simplest being straight syslog, syslog-ng or rsyslog, logging to text files with analysis in awk/sed/wc/grep/perl. The most complex being commercial solutions like RSA enVision and NitroView, now McAfee ESM. The simple solutions tend to be lacking in more advanced reporting features but are insanely simple to manage. The more complex will let you report or alert on almost condition you can consider but at the expense of requiring intense amounts of time to configure and manage.

All that being said, log management was never a primary focus of mine. In past jobs those duties did fall within my team and I was always closely involved in the process. As a result I do have a strong appreciation of many of the complexities involved in developing a log management/analysis program and some knowledge of the players in the field. I've been aware of Graylog2 for some time but this was certainly my first opportunity to really dig into it and also my first time working with most all of the technologies it relies on.

I came into the entire process as close to a blank slate as possible. I had no opinion of the product. I've never worked with any of the developers1. I've never used any of the underlying technologies involved. I came to this purely as an engineer who wanted a system that didn't suck and would let him do incident response and auditing. This is why I'm hoping the following lessons learned and observations will be useful to someone else and not simply cathartic.

Architecture and Deployment

Graylog2 can get really goddamn complicated. I kept referencing back to the Architecture diagram as examples of what to design against and think about. I wasn't trying to build something that legends would be told of but I also didn't want to build something that would have to be ripped out and replaced in a year. So I spend a fair bit of time kidnapping a coworker and scribbling on whiteboards as we discussed my ideas, though honestly I probably talked at more than with him.

Graylog2 Is Pretty Simple

The guys at TORCH were either smart or lucky when they split up their architecture the way they did. During my work the easiest component in the entire stack to scale out was the Graylog2 server itself. There was no magic involved in creating a new server instance and expand a cluster. So long as all of the graylog2-server instances were configured against the same MongoDB server they were clustered.

Graylog2 itself doesn't really need to scale because the application itself is pretty standalone. There is the idea of a Graylog2 Cluster but as near as I can tell what that really means is, "I see other graylog2-server instances that say they're part of me because they're listed as online in the database. That's neato." More than anything else it provides a way for more than one system to write to the same ElasticSearch cluster without it turning into a cluster.

Logs Is Logs Is Logs

Coming from an enterprise SIEM background I immediately jump to concepts like "log normalization" and "dynamic baselineing". These are features that we don't really have here, and quite honestly I shouldn't complain. Graylog2 is primarily a log collection, storage, and search tool. Building out dashboards with automatic dynamic baselines for each dashboard query is pretty system intensive and quite a significant development effort to build and maintain. That's not really the market this product is going for.2

The other piece I desperately miss is log normalization. That is, every incoming log is run through a parser and the important pieces of information are pulled out and stored in a consistent format. This means we can do a search for "Authentication Logs for user Scott Pack" or "Show me all successful logins for Scott Pack summarized by number of events per device" and receive consistent results whether the logs came from Linux, Windows, or Apache. There is something called an "Extractor" which is a start, but we're not quite there yet. It's an exceptionally heavy burden to place on development staff so there's always the possibility that normalization stays off the radar.

ElasticSearch is Magic

For my complaints about MongoDB as a hipster software, ElasticSearch does lots of things right. Clustering is trivial, one simply tells each node where the others are and they figure it out. My only problems came into play when initially spreading out the load and then later when I performed a major upgrade.

Adding a second instance and setting them up in a Master/Master configuration was trivial, but by default that only provides load sharing. I also had to change the Replication settings in order for a copy of the data to exist on each node. That's no big deal, but replication isn't really set at the global level. Instead every time an index is created the replication factor for that index is set based on the global setting. This means I had to retroactively change the replication on every index after the fact. ElasticSearch offers no direct method to make that change so I had to write a script to iterate through every index and update the setting. Not terrible, but certainly not great.

The upgrade issue was a bit more interesting. Apparently, there is a disk utilization threshold where if the partition that stores the ElasticSearch data is too full no shards will be assigned to it. When I upgraded ElasticSearch from 0.90 to 1.3.2 my ElasticSearch usage was around 90% which resulted in the second cluster node connecting to the cluster but never re-assigning its own shards.

Management

Yes. That's Right. I Said Plural Databases.

Graylog2 uses two separate database systems. MongoDB stores configuration data and such from graylog2-web and graylog2-server whereas ElasticSearch stores actual log data. Both are these new fangled NoSQL JSON based RESTful systems. I'll accept that the systems are actually designed for different purposes. MongoDB acts more like a traditional database whereas ElasticSearch is intended more as a data search engine. Using them for their intended purpose makes sense but MongoDB is actually used to store a significantly insignificant amount of information. It makes me wonder if, given the small amount of data involved, we couldn't simply use ElasticSearch instead.

As it stands we have two separate database server technologies that use two separate management subsystems. Which means my team needs to become experts in two more databases that aren't used for any other service that we manage. That problem lies solely on us, but it's still a bummer.

MongoDB Makes Me Sad

MongoDB provides some pretty amazing clustering and replication options. As I was splitting up services and building configs I really thought I could build out a Master/Master cluster and be happy. This is pretty much the anti-truth. In MongoDB a cluster is called a ReplicaSet, using ReplicaSets we can build the distributed model we hope for but with an extra hitch called Quorum. The MongoDB Manual goes into more detail, but in short any ReplicaSet cluster requires at minimum 3 nodes. Personally, I find 3 separate servers a bit much to store less than 5GB worth of metadata.

Infinitely Unbounded Spoolers Are The Devil

As of this writing there exists a bug in graylog2-server that results in a resource exhaustion condition. The input message spooler caches files out to disk as they're processed, which is cool. What's not cool is that the messages in the cache aren't flushed as they're processed. This means the filesystem on which /var/lib/graylog2-server/message-cache-spool resides will eventually fill up and graylog2-server will crash. Until the bug gets fixed, targeted for 0.92, my only recourse was to set a cronjob that stops the service, deletes the cache files, and restarts the service.

Support

In my experience having a good relationship with your providers is a necessity. It gives you a voice to drive the product roadmap in a direction that helps your business. It gives you a sympathetic ear that may be able to help when things go wrong. It gives you the chance to tell someone who matters that you like what they're doing and hope they'll succeed. During the project I engaged with some of the developers several times. Mostly to ask questions that came up, but also to get the opportunity to become engaged and let them know what I wanted.

The developers I interacted with were all pleasant people who really seemed to want to make a good product and genuinely liked talking to people. I love this in a vendor. The best example is probably this conversation on Twitter after they announced one of my most pined for features.


  1. Full Disclosure. The lead developer served as a guest on a coworker's podcast. That happened prior to my coming on board and as such didn't involve me.

  2. That being said if they want to start digging that hole it's be pretty awesome.