Storage Replication

From Alteeve Wiki
Jump to navigation Jump to search

 Main Page :: Storage Replication

In the Anvil! cluster, everything has to be fully redundant, and all components need to be electrically and mechanically isolated. This is required so that anything can be taken offline without a maintenance window.

In traditional clusters, shared storage is provided by a SAN. These are devices that share raw storage (called LUNs) to provide the backing storage for hosted servers. SANs typical have thorough redundancy; Dual controllers, dual power supplies, dual fiber-channel links, etc. However, they're generally a single unit with a common back plane, and so servicing them generally requires a maintenance window, and certain failure modes can take them offline entirely. Of course, some SANs can be mirrored for full redundancy, and when done this is an excellent option, but the cost of such a setup is generally prohibitive.

The Anvil! solves this by keeping data mirrored between the subnodes in each Anvil! node. This also contributes to compartmentalisation, a key component of Intelligent Availability.

The way this works in Anvil! nodes is somewhat similar to a RAID 1 array; The data is copied synchronously between the two subnodes over a dedicated replication link. We'll cover the details shortly, but the overview is that data being written to disk is confirmed to be written on both subnodes before the guest is told that the write is complete. This way, if an unpredictable, catastrophic loss of the host subnode occurs while writes are in flight, the server will reboot on the surviving subnode, able to replay it's journals and write-ahead logs, just as if it lost power and rebooted. No data is lost.

So how do we do it?

The storage built into each subnode has a small amount set aside for the Anvil!'s OS and software, about 80 GiB, and the rest of the storage is made available for use by servers. When a server is provisioned, say 100 GiB is allocated, then a 100 GiB logical volume is created on each subnode. These become the mirrored storage for the new server. These to LVs are then used to back a new virtual device, using DRBD. This is where the magic happens.

DRBD creates a new, virtual storage device that can be seen and used from both subnodes in the Anvil! node. This virtual device acts just like any other storage device, with all of the replication handles behind the scenes. When a write comes down from your server, DRBD makes a copy and sends it over the dedicated storage network to the peer. The data is written to both subnodes storage, and then the server is told the write is complete.

The storage network is usually back-to-back and runs at whatever speed you need for your workloads. Generally this is an 10 or 25 Gbps link, and being back-to-back, the latency is very low. If you need even lower latency, RDMA interconnects are also supported. For uses where high-speed DR is required, the storage network can go through switches for 3-way replication, though that's not required as DR can use any network.

One benefit of DRBD over traditional RAID level 1 is it's bitmap. With normal RAID, if a "disk" disconnects, the storage needs to be fully resync'ed when reconnected. With DRBD, a little bit of space at the end of the backing storage is used for metadata (~32 MiB is reserved for each 1 TiB of data). If a subnode or DR host disconnects, when a write happens, the block with the new data has it's corresponding bit flipped. This marks the block as "dirty". When the peer eventually reconnects, only those "dirty blocks" need to be copied over. This means that peers can come and go with minimal interruption.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Libera Chat
© Alteeve's Niche! Inc. 1997-2023   Anvil! "Intelligent Availability™" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.