To be clear, the cluster installed by this playbook isn't high available.
On the hardware side of things, at least 2 major points of failure remain :
- Single internet link
- Single power delivery (alleviated by a UPS)
Also as it's a home setup, it's obviously not replicated across multiple sites.
On the software stack side however, most of the components are configured to be "high-available ready". By that I mean that the cluster I've built isn't sized to be HA, but it's "only" a matter of throwing more CPUs / RAM at it.
dnsmasq is used for local resolution.
It's installed on the 3 master nodes, and all nodes in the cluster are configured to used them.
HighAvailability : dnsmasq tolerates the loss of 2 master nodes
Kubernetes is installed using 3 master nodes, using the stacked etcd topology
The apiserver is accessed though a HAProxy that distributes requests across the 3 master nodes.
HighAvailability : Kubernetes tolerates the loss of 1 master node
All persistent volumes use Longhorn.
The Longhorn provisionner is deployed at least on our 3 master nodes. The Longhorn volumes are using at 2 or 3 replicas.
HighAvailability : Longhorn tolerates the loss of at least 1 node
Keepalived is used to be sure we have an HA virtual IP.
On each master node, an NGinx instance is configured to access all the published services.
master-1 is set as the default master node, with master-2 and master-3 as backups.
HighAvailability : Reverse proxy tolerates the loss of 2 master nodes
There are 2 limitations with this setup :
master-1 is responsible for SSL cert renewal using certbot.
If it stays down for a long period, SSL certs will expire.
Traffic is not load balanced between the 3 NGinx instances.
Finally on the applications side, it's a bit more hit-and-miss. But mostly miss.
The private registry is scaled to 2 instances, and could be scaled up even more if required.
It is scaled to 2 instances by default to be mostly HA.
DNS queries are logged to a PostgreSQL database. A Redis database is used to share state between Blocky instances.
PostgreSQL and Redis are only using 1 instance. If they go down, Blocky will still work, with the following limitations :
- DNS queries log fall back to stdout
- state is not shared between Blocky instances
This is the best compromise to still have DNS when a node goes down, but avoid consuming too much resources to build HA PostgreSQL and Redis clusters.
