Saturday night, 06:00:
I was torn from my sleep by the lovely voice of my colleague, uttering the words dreaded among the ops community: “We’re down, Chris.”
After opening the lid of my notebook, I started digging into the issue.
We’re running several high availability services with the help of keepalived. When a service goes down, keepalived switches over services to a standby instance. This is usually done by assigning a dedicated virtual IP address (VIP) to a service which can then be migrated on the fly between instances. This is achieved by keepalived using the Virtual Router Redundance Protocol (VRRP) as described in RFC5789, and has proved itself to be very reliable in the last years.
Not so today. The primary server somehow lost its VIP, without any notice in the log files. Just like that. I reassigned the VIP manually, verified that we’re back up and went back to sleep.
We then did a post-mortem on Monday. Obviously, we wanted to know what the heck had happened, and how we could prevent it from happening again in the future.