Live Migration

From Alteeve Wiki
Revision as of 00:29, 8 September 2023 by Digimer (talk | contribs) (Created page with "{{header}} "Live migration" is the process where a server that is actively running on an Anvil! subnode can be moved to the peer subnode, without interrupting it or stopping it. This can happen if, for example, Scancore detects that the active subnode is developing a hardware fault. To minimize the risk of service interruption, the Anvil! will migrate to the peer subnode as a preventative measure to protect your servers. The way this works is that a "pa...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

 Main Page :: Live Migration

"Live migration" is the process where a server that is actively running on an Anvil! subnode can be moved to the peer subnode, without interrupting it or stopping it. This can happen if, for example, Scancore detects that the active subnode is developing a hardware fault. To minimize the risk of service interruption, the Anvil! will migrate to the peer subnode as a preventative measure to protect your servers.

The way this works is that a "paused" (inactive) identical copy of the server is created on the peer subnode. The RAM in use by the server then begins to copy to the peer's paused copy of the server, even while the server is still in use. When the majority of the RAM is copied, for a very brief moment, the server is frozen on the old host, the last of the RAM copies over, and the server is thawed (resumed) on the new host. This freeze -> finish copying -> thaw process is generally completed in well under one second.

The Anvil! has a custom resource agent used by the pacemaker high-availability software running behind the scenes in each Anvil! node. Anvil! systems built by Alteeve now have a dedicated "Migration Network", which is connected peer-to-peer, directly between the two subnodes. This allows for high speed, dedicated RAM copy, generally at 10 or 25 Gbps (1~3 GiB/sec).

In technical detail, this "RA" first looks to see if the peer has a resolvable hostname that ends in ".mn". For example, if the server is being migrated from subnode "an-a01n01" to "an-a01n02", we check to see if "an-a01n02.mn" resolves to an IP and, if so, if that IP is usable. If it is, that network is used for the RAM copy. If not, the IP of the base "an-a01n02", usually the Back-Channel Network IP, is used for the RAM copy as a fall-back.

Background details are documented on Red Hat's KVM/qemu documentation.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Libera Chat
© Alteeve's Niche! Inc. 1997-2023   Anvil! "Intelligent Availability™" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.