The 2-Node Myth
Main Page :: The 2-Node Myth |
A common argument in the availability world is "You need at least 3-nodes for availability clustering".
This article aims to disprove that.
To understand this argument we must first discuss two concepts in availability clustering; Quorum and fencing (also called 'stonith').
Quorum
"Quorum" is a term used to define simple majority. Nodes in a cluster have a default value of '1'.
Said mathematically, quorum is > 50%. When a cluster is quorate, it is allowed to host highly available services.
In a 2-node cluster, the only way to be 'quorate' is for both nodes to be online. This is because, if a node is lost, the remaining node's vote is '1', and that is 50%. Quorum requires greater than 50%. So the node is inquorate and, thus, not allowed to host highly available services.
So in a 2-node cluster, quorum must be disabled.
Fencing (aka Stonith)
"Fencing" is a term for a process where a node that has entered an unknown state is force into a known state.
Typically, a node that stops responding is fenced by forcing it to power off using some mechanism external to the node itself. Typically this is done by using the lost node's out-of-band management interface (IPMI), but it can also be done by cutting power to it via network-connected power switches, etc. It is also possible to "fence" a node by severing it's connection to the network, a process called "fabric fencing".
In any case, the goal is to ensure that no matter what might have caused the node to stop responding, the survivor node(s) know that the lost node will not try to offer or access shared resources any further.
Confusion
Misunderstanding the roles of quorum and fencing is perhaps the most common source of confusion for admins who are new to availability clustering. In fact, many availability projects confuse the roles that these technologies play in protecting and coordinating services. Many believe that quorum is required and fencing is optional, when in fact it is exactly the opposite.
It is from this confusion that the myth of requiring 3+ nodes is required for a "proper" availability cluster.
Scenarios
Lets look at a couple scenarios from the perspective first of a 3-node cluster with quorum but without fencing. Then we'll look at the same scenario from the perspective of a 2-node cluster without quorum and with fencing.
Inter-node Communication Failure
In this scenario, a node hosting NFS and a virtual IP is online but loses network access on it's cluster communications network.
3-Node
The lost node can't reach either other node and becomes inquorate. It shuts down NFS and releases it's public IP.
The two other nodes can talk to each other and reform a new cluster with 2 votes out of 3. This is >50% and so they are quorate and able to proceed with recovery. Node 2 was the backup node and node 3 was a quorum node. NFS is started on node 2 and the virtual IP is brought up.
2-Node
The lost node can't reach the peer. At this time, both nodes block and call a fence against the peer.
Knowing that node 1 was the normally active node, fencing was configured such that node 2 would pause for 15 seconds before fencing node 1. Node 1 would see no delay against node 2 and fence it immediately. Node 2 would pause, and would be fenced before it exited the pause. Node 1 survives, confident that node 2 won't try to run NFS or take the public IP. Services resume normal operation.
In both cases, no split-brain occurred.
Node 1 Hangs
In this scenario, node 1 is again hosting NFS and a virtual IP. In this case, the fault is not with communications, but instead something causes node 1 to stop responding for a period of time. This hang will eventually clear and allow node 1 to resume operation.
3-Node
When node 1 stops responding, node 2 declares it lost, reforms a cluster with the quorum node, node 3, and is quorate. It begins recovery by mounting the filesystem under NFS, which replays journals and cleans up, then starts NFS and takes the virtual IP address.
Later, node 1 recovers from its hang. At the moment of recovery, it has no concept that time has passed and so has no reason to check to see if it is still quorate or whether its locks are still valid. It just finished doing whatever it was doing at the moment it hung.
In the best case scenario, you now have two machines claiming the same IP address. At worse, you have uncoordinated writes to storage and you corrupt your data.
2-Node
When node 1 stops responding, node 2 declares it lost and initiates a fence. It pauses the 15 seconds, then proceeds with the fence action. Once node 1 has been confirmed fenced, it then mounts the filesystem, which replays the journals, and then it starts NFS and takes over the virtual IP address.
Node 1 never "thaws" because it was rebooted. When it comes back online, it will be in a fresh state and will not start NFS or take the virtual IP address until it rejoins the cluster.
Fence Loop
Another concern with two-node clusters is a fence loop.
This is a scenario when communication between nodes break, and the slower node is fenced and reboots. When it comes back online, it starts its cluster stack, fails to reach the peer, fences it and comes online. The second node reboots, fails to reach the peer, fences it, and comes online. This loop continues until communication is restored.
In corosync version 2+, there is a "wait_for_all" setting that tells corosync to not become quorate until it is able to talk to the peer node. This way, when the first fenced node boots, it will sit there and do nothing until communications are repaired.
In previous versions, the solution was simply to not start the cluster on boot. The logic behind this argument goes "either an admin is doing maintenance and is there to start the cluster, or the node was fenced and should not rejoin until an admin has had a chance to investigate the source of the problem.".
Thus, fence-loops in 2-node clusters are not a problem.
Dual Fence
Another concern sometimes raised about 2-node clusters is that it might be possible, after a comms break, that both nodes fence each other leaving both nodes offline.
This is equivalent to an "old-west" style shoot out where both people pull the trigger at precisely the same moment. Unlikely, but not impossible.
This can be avoided a few ways;
First is to set a fence delay against one of the nodes. In an active/passive configuration, set the delay against the active node. In the case of a comms break, the passive node will look up how to fence the active node, see a delay and sleep. The active node looks up how to fence the passive node, sees no delay and shoots immediately. The passive node reboots before it exists its sleep, ensuring that the active node wins the fence.
Another option is to use a fence device that only allows one log in at a time, like how many switched PDUs work. The first node to log in to the device will block the slower node, ensuring only one node lives.
A special case is with IPMI-based fencing. In this case, you will want to disable acpid so that the node immediately powers off when it receives an ACPI power button event. This will ensure that the faster node immediately terminates the slower node.
Summary
In short;
- "Quorum is a tool for when things are working predictably".
- "Fencing is a tool for when things go wrong".
Even shorter;
- Quorum is optional, fencing is not.
An Argument For Simplicity
Contrary to the argument that you must have 3 nodes (or more) are required for proper availability, we would argue that 2-node clusters are superior. The logic behind this is this;
"An availability cluster is not beautiful when there is nothing left to add. It is beautiful when there is nothing left to take away."
A 2-node cluster is the simplest configuration possible and availability is a function of simplicity.
For this reason, we prefer 2-node clusters. Over five years of experience in 2-node cluster deployments have proven to us that exceptional uptime is possible with 2-node clusters.
Any questions, feedback, advice, complaints or meanderings are welcome. | ||||
Us: Alteeve's Niche! | Support: Mailing List | IRC: #clusterlabs on Libera Chat | ||
© Alteeve's Niche! Inc. 1997-2023 | Anvil! "Intelligent Availability™" Platform | |||
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |