Saturday night, 06:00:
I was torn from my sleep by the lovely voice of my colleague, uttering the words dreaded among the ops community: “We’re down, Chris.”
After opening the lid of my notebook, I started digging into the issue.
We’re running several high availability services with the help of keepalived. When a service goes down, keepalived switches over services to a standby instance. This is usually done by assigning a dedicated virtual IP address (VIP) to a service which can then be migrated on the fly between instances. This is achieved by keepalived using the Virtual Router Redundance Protocol (VRRP) as described in RFC5789, and has proved itself to be very reliable in the last years.
Not so today. The primary server somehow lost its VIP, without any notice in the log files. Just like that. I reassigned the VIP manually, verified that we’re back up and went back to sleep.
We then did a post-mortem on Monday. Obviously, we wanted to know what the heck had happened, and how we could prevent it from happening again in the future.
Our OpenStack hoster relies on DHCP to configure the attached network interfaces. I usually prefer static configured interfaces on servers, and eyed the configuration with some suspicion for a while. We’ve had some minor issues with static routes and DHCP in the past, but the environment was pretty stable, so we went with the hoster’s original base images and their configuration.
Keeping this in mind, we’ve been implementing one bigger change in our infrastructure: We’ve upgraded some instances (including the database node that had the outage) from Ubuntu 16.04 LTS (xenial) to Ubuntu 18.04 LTS (bionic). One of the major changes between the two releases is the change of the default networking stack from ifupdown to netplan. Could this have caused the issue?
After digging around a bit more, I found out that calling
netplan apply (to apply a network configuration) resulted in dropped VIPs.
Apparently, keepalived doesn’t monitor the VIP assigned to it, and neither initiates a failover, nor tries to reinstate the dropped VIP after it was removed from the interface. This is a known issue, and was fixed by this commit which was released with
Unfortunately, Ubuntu ships with
keepalived-1.3.9, and the keepalived developers do not provide an official repository for more recent versions.
After considering providing packages myself, I came to the conclusion that this actually wouldn’t fit the actual problem, as keepalived would just note the removed VIP and failover to another machine. But I wanted to fix the underlying problem itself, instead of just coping with the symptoms.
The default renderer for netplan on Ubuntu is systemd-networkd, a systemd component. On every
netplan apply, the systemd-networkd is restarted, applying the new configuration. This apparently also happens when a DHCP lease is renewed and results in the VIPs being removed, because systemd-networkd is unaware of them, as they are assigned by keepalived and are not configured in
This seems to be a feature, and I kind of agree why: If you have a messy interface configuration and you connect to a network, it makes sense to make sure the interface is in a defined state.
But having a downtime every time a DHCP lease is renewed obviously also is not an option.
So, how do I fix the problem permanently in a clean way? I was thinking about the following approaches:
- Migrating to a static IP configuration
As I was critical of using DHCP anyway, I considered migrating to static IPs. While this would fix the issues with DHCP lease renewals, we’d still have the same problem on other occasions of
systemctl restart systemd-networkd (e.g. automatic security updates). I was reluctant to change that much of the networking stack, as I believe this should be done by the underlying hypervisor.
- Consider ditching
netplanfor the old
We never had problems while using
ifupdown with our configuration so far. I quickly decided against this though, as I really didn’t want to completely exchange fundamental network stacks on every new instance.
The final, clean solution
Before actually messing around with the underlying network stack, fortunately I was hinted at RFC1122. The Linux network stack defaults to “Weak ES (End System) Model”, enabling it to handle incoming packages for IP addresses configured on another interface than the physical interface the package came in.
This basically meant, that we could create a dummy interface and let keepalived manage the VIP on that interface. Then,
systemd-networkd can configure the original interface at will, without hurting the dedicated interface for the virtual IP.
Here’s an example: On our system, the primary network card is
ens3, and we’ve configured keepalived to assign the VIP as an alias on the same interface in
1 2 3
I quickly hopped on a staging instance, creating a dummy interface
keepalived0 and assigned existing VIP with a
After that, I pinged the instance from another machine and then removed the original VIP:
The pings are still answered!
Applying the configuration
This was the perfect solution! It was easily implemented without any downtime, got rid of the ancient IP aliasing (apparently deprecated since
linux-2.4) and allowed us to keep the default network stack including netplan and systemd-networkd.
Bonus: The interface (incl. the master VIP) are also now displayed in Ubuntu’s default message-of-the-day (motd) upon login.
Edit: An earlier version of this blog post suggested to persist the
keepalived0 dummy interface using netplan. This is not sufficient to solve the problem of the purged addresses on
Just add the
keepalived0 interface by deploying the following file to
1 2 3
systemctl restart systemd-networkd to apply the configuration.
One thing I noted with this configuration is that the
keepalived0 interface is always in
1 2 3 4
This seems to be intended behaviour, as the dummy interface doesn’t implement many functions of a real interface.
So the last thing left was adjusting the keepalived configuration in
1 2 3
I hope this ensures I can sleep in on Saturdays.