This is the second in a multi-part series where I explore the process of transforming an existing Graylog install into a resilient and scalable multi-site installation. Start here for Part 1.

Server Setup

At this point we have about 6 months of data in our existing Graylog2 system so I wanted the entire migration to happen without data loss and to be as seamless as possible. For the new environment I needed 4 new servers:

East Coast Graylog2 Server (east-gray01)
East Coast ElasticSearch node (east-es01)
West Coast Graylog2 Server (west-gray01)
West Coast ElasticSearch node (west-es01)

You'll notice a discrepancy between this setup and the diagram from before. For now I opted to leave the Graylog2 Web Interface running on the Graylog2 server nodes. That was mostly for easy of deployment and management more than anything else. I'm truly not concerned with load at this point, and we will be configuring the applications with no location assumptions. This means if we need to separate the two out then there will be minimal amounts of detangling involved.

The Architectural Considerations gives us some pretty good data on sizing. Server nodes need gobs of CPU, data nodes need gobs of RAM and disk but not much CPU, web and mongo need almost pitifully small amounts of anything. As a general rule I would recommend the following for two deployment scenarios.

Graylog2 Servers

	Small	Medium
CPU	4 cores	8 cores
RAM	4 GB	8 GB
Disk	40 GB	40 TB
OS	Ubuntu 14.04	Ubuntu 14.04

ElasticSearch Servers

	Small	Medium
CPU	2 cores	2 cores
RAM	12 GB	24 GB
Disk	500 GB	1 TB
OS	Ubuntu 14.04	Ubuntu 14.04

Very likely the most up-in-the-air figure is going to be disk. If you are just now building a log collector it's hard to tell what the overall utilization is. You'll need to take into account your retention schedule, the typical size of your logs, and how quickly they come in.

Actually building the servers is easy. These days we have fan-dancy tools like Foreman or Cobbler and Puppet or Chef to do all the actual work leaving us with a base system to deal with. I opted to get started on the ElasticSearch.

Since site independence was foremost in my mind I wanted to to emulate a master-master replication setup. This was actually significantly easier than I thought, largely due to the automagic clustering built into ElasticSearch.

ElasticSearch Fiddly Bits

Installation

First we have to install the ElasticSearch application. Graylog2 has a hard dependency on version 0.90.10¹. Fortunately for us ElasticSearch maintains their own package repositories based on primary version. Due to the requirement of the older version we can adjust their instructions a tad to be

wget -qO - http://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb http://packages.elasticsearch.org/elasticsearch/0.90/debian stable main" > /etc/apt/sources.list.d/elasticsearch.list
apt-get update && apt-get install elasticsearch

Keep in mind that will install the newest version of ElasticSearch from the 0.90 tree. Since I do all the menial work with Puppet I fix the problem by doing this in a manifest

package {
  'elasticsearch' :
    ensure => '0.90.10',
}

This makes sure the version 0.90.10 is always installed, no matter what the newest version is².

Build Out the Application Configuration

Once the application is installed we need to configure it to join our current cluster. Edit the file /etc/elasticsearch/elasticsearch.yml to look something like this.

cluster.name: graylog2
node.name: east-es01
node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 1
discovery.zen.ping.timeout: 5s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["east-es01.east.example.com","west-es01.west.example.com","graylog.example.com"]

Some of the ElasticSearch terminology threw me for a bit, particularly the node.master setting. Unlike Graylog2, where you define which node is the master, setting node.master to true simply indicates that this node is capable of being a master. The ElasticSearch cluster itself elects a master from the current member nodes. It's much like a NetBIOS Browser Election except not nearly as terrifyingly noisy.

node.data says whether this system will actually store anything or act solely as a worker node
discovery.zen.ping.timeout defines how long the master election takes. The default is 3s but since I'm dealing with a cross-continent haul I was getting failures. Bumping it up to 5 fixed things.
discovery.zen.ping.unicast.hosts is used to restrict cluster membership and should contain every other ES node in the cluster. I opted to include all of them since it doesn't cause problems and it lets me reuse the same config.
discovery.zen.ping.multicast.enabled is pretty important. By default ES uses multicast to determine other cluster members. I don't like that since it's noisy and lets anyone in the multicast domain gain access to your cluster.
index.number_of_replicas defines how many copies of the data exist and is the real magic here.

Determining the number of shards per index is fuzzy. You want it bigger than one because the bigger the index the more resources are consumed writing and indexing it, plus the bigger the shard the longer it takes to replicate. Presumably there's a corollary wherein an insane number of shards can blow up performance. In our environment we did some rough envelope math to make each index roughly 4GB and went with 5 shards because it was halfway to 10. Setting replication to 1 means that every shard will have exactly 1 copy, i.e. 1 primary and 1 replica. It will also distribute the replicas in such a way as to minimize data duplication. That is, if we have 2 ElasticSearch nodes and 1 replica the primary will be on east-es01 and the replica will be on west-es01. You should only ever have both a primary and replica on the same node if you have more replicas than nodes.

If you make the right call and install from the repo they'll be a service configuration file, for us it'll be /etc/default/elasticsearch. Change the value of ES_HEAP_SIZE to something much bigger. This should be at least half of your total memory.

Make It Happen

Initially we want data to start being replicated without the new boxes taking complete control. We can do this by copying the above config out to the new nodes but changing node.master to false. We'll also need to change the discovery.zen.ping.multicast.enabled and discovery.zen.ping.unicast.hosts so the new nodes can join the cluster. Once all the files get copied into place start up all the services and magic should happen. You should start seeing shards get reallocated between nodes as the cluster tries to balance itself out.

Something to keep in mind is that the number of shards per index is fixed at index creation time and cannot be changed. The number of replicas per shard, however, can be changed. While setting the value in the config files only applies to new indices we can use the API to retroactively adjust the replica values. The authoritative information on the how can be found in the Reference Guide.

There are a number of monitoring tools for ElasticSearch that will help you keep an eye on replication status. I used kopf but the Health and Performance Monitoring page lists a few packages to look at.

Once all the indices are updated and the replication is finished we can immediately remove the original ElasticSearch instance from the cluster. However, for cleanliness, I'm going to wait until the stack is built out before I start removing components.

This is actually changing soon, for some definitions of soon. According to the release notes for Graylog2 v0.90 we have optional support for ElasticSearch v1.3.2 in Graylog2 v0.91-rc.1. Don't get confused by the numbers, suffice to say we can upgrade soon. ↩
It's worth noting that if you run an upgrade through apt-get this won't prevent ElasticSearch from being updated. I'm sure apt has a way to exclude packages based on version, though I'm also sure the syntax is different from yum and something I should do sooner rather than later. ↩