CouchDB Replication Monitor

CouchDB does replication, but replication needs to be set up after each server restart. This means you need to ensure that replication is restarted whenever the daemon restarts CouchDB. I have never seen replication stop working without a restart, but I prefer being safe to being sorry about replication. To be perfectly honest, I do not trust that my replication initiation after a soft CouchDB restart works properly either so I prefer to monitor the replication and have a safety mechanism in place to restart replication if needed.

There are several ways to monitor replication. You could fetch the status page of all servers and restart replication on servers with an empty page, but that is a kind of brute force approach in my world. A better solution is to use the replication itself to monitor that it works.

Each server updates their timestamp in CouchDB and this is again replicated to the other servers. This gets us a bit of the way, but not all the way. The server you are checking might have received updates from all the other servers, but you don’t know if it’s pushed out anything to the other servers. To solve this, you can add information about the other servers to the local server as well. This will give you a matrix of server replication status.

For each server, you will see the timestamp replicated from the server and a list of timestamps replicated to that server. The latter often being a generation older than the former. Cron can be used to update this data. The cronjob reads all the server timestamps and updates this servers timestamp followed by a list of the other servers timestamp.

A mapper to get a server id to server status out of the db.

map: function(doc) {
  emit(doc._id, doc);
}

Our monitroing database is called server_status. The design containing the mapper is called collections and the view server_list.

A Ruby database checker that can run on cron.

require 'rubygems'
require 'couchrest'
require 'json'
require 'open-uri'

STATUS_DB = 'http://localhost:5984/server_status'
COLLECTIONS = 'collections'
SERVER_LIST = 'server_list'

hostname = ARGV[0]

status_db = CouchRest.database!(STATUS_DB)
status_view = "#{STATUS_DB}/_design/#{COLLECTIONS}/_view/#{SERVER_LIST}"

# Get the current information about this server if available
server_status = begin
  status_db.get(hostname)
rescue RestClient::ResourceNotFound
  {'_id' => hostname}
end

server_status['time'] = Time.new.to_i
# Get the current times of the other servers and update this server's
# view of them
JSON(open(status_view).read)['rows'].map do |row|
  {'server' => row['id'], 'status' => row['value']}
end.each do |status|
  unless status['server'] == hostname
    server_status['servers'][status['server']] = status['status']['time'] 
  end
end
status_db.save_doc(server_status)

Now you need to determine when to trigger replication restart. This can be handled in the watchdog cronjob. If the highest timestamp seen for this server at other servers is above a threshold, restart replication.

The final loop triggering when the age is above a threshold. The init_replication method just posts a continuous replication trigger to the db:

JSON(open(status_view).read)['rows'].map do |row|
  {'server' => row['id'], 'status' => row['value']}
end.each do |status|
  if server_status['time'] - status['status']['time'] > THRESHOLD 
    init_replication(status['server']) 
  end
  unless status['server'] == hostname
    server_status['servers'][status['server']] = status['status']['time'] 
  end
end

Rudimentary init_replication method.

def init_replication(server)
  target = "http://#{server}:5984"
  databases = ['server_status']
  databases.each do |db|
    config = {
            'source' => "#{db}",
            'target' => "#{target}/#{db}",
            'continuous' => true
    }
    payload = JSON.generate(config)
    result = Net::HTTP.new('127.0.0.1', '5984').post(
      '/_replicate', payload, {'content-type' => 'text/x-json'})
    unless result.code == 200
      p "replication to #{target}/#{db} failed with #{result.code}" 
    end
  end
end

We have a monitoring view of replication ages in our system. It shows the matrix of timestamps as age in seconds rather than the actual timestamp since the age is the important metric.
Server Status

A bonus of this replication monitoring system is that we can access the status page from a mobil phone and get an accurate picture of the replication status. This doesn’t worry me now, but it did when we first set it up. Now it’s just a part of our general monitoring view.

One thought on “CouchDB Replication Monitor

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.