Selecting Hardware For Your Anvil!

From Alteeve Wiki
Jump to navigation Jump to search

 Alteeve Wiki :: How To :: Selecting Hardware For Your Anvil!

This guide's goal is to provide a high-level overview on how to match appropriate hardware for anticipated loads on your Anvil! platform.

System Requirements

To provide full stack redundancy, there are minimum system requirements. These are not performance requirements, but instead, minimum features and capabilities.

Foundation Pack

The foundation pack consists of redundant power and redundant networking which the Anvil! nodes will sit on top of. They play an integral role in providing full stack redundancy.

The foundation pack's ethernet switches inform that maximum sequential write speed and network bandwidth between the hosted servers and the outside world. As such, care must be taken when selecting equipment to ensure required performance is provided.

Finally, a foundation pack can host two or five Anvil! pairs. Deciding how many pairs you expect to use will determine the capacity of the foundation pack equipment.

UPSes

Two network-connected UPSes are required in order for ScanCore to monitor incoming power states, estimated run time remaining during power outage, charge percentage during recovery and to alert on distorted input power.

Currently, any UPS that supports the APC (Schneider) AP9630 or AP9631 network cards are supported.

APC SmartUPS 1500 RM2U 120vAC UPS. Photo by APC.
APC SmartUPS 1500 Pedestal 120vAC UPS. Photo by APC.

If you plan to use a different make or model, please verify that the input voltage and frequency, battery run time and battery charge percentage can be retrieved over the network. If so, please contact us and we'll be happy to work to add support.

Network Managed UPSes Are Worth It

We have found that a surprising number of issues that affect service availability are power related. A network-connected smart UPS allows you to monitor the power coming from the building mains. Thanks to this, we've been able to detect far more than simple "lost power" events. We've been able to detect failing transformers and regulators, over and under voltage events and so on. Events that, if caught ahead of time, avoid full power outages. It also protects the rest of your gear that isn't behind a UPS.

So strictly speaking, you don't need network managed UPSes. However, we have found them to be worth their weight in gold. We will, of course, be using them in this tutorial.

Switched PDUs

Two network-switched PDUs are used to provide backup fencing for the Anvil! nodes.

Currently, any APC (Schneider) switched PDU, like the AP7900, are supported. Raritan brand PDUs are also supported, like the PX2-5260A4R.

APC AP7900 8-Outlet 1U 120vAC PDU. Photo by APC.
APC AP7931 16-Outlet 0U 120vAC PDU. Photo by APC.

Other brands may work fine, but may not be ideal due to slow switching times (TrippLite switched PDUs are an example of PDUs that work, but are very slow to switch and confirm states).

If you plan to use a different make or model, please verify that you can turn an outlet on and off, and that you can verify the outlet's current state. If so, please contact us and we'll be happy to work to add support.

Why Switched PDUs?

When a node stops responding, we can not simply assume that it is dead. To do so would be to risk a "split-brain" condition which can lead to data divergence, data loss and data corruption.

To deal with this, we need a mechanism of putting a node that is in an unknown state into a known state. A process called "fencing" (or stonith). Many people who build HA platforms use the IPMI interface for this purpose, as will we. The idea here is that, when a node stops responding, the surviving node connects to the lost node's IPMI interface and forces the machine to power off. The IPMI BMC is, effectively, a little computer inside the main computer, so it will work regardless of what state the node itself is in.

Once the node has been confirmed to be off, the services that had been running on it can be restarted on the remaining good node, safe in knowing that the lost peer is not also hosting these services. In our case, these "services" are the shared storage and the virtual servers.

There is a problem with this though. Actually, two.

  1. The IPMI draws its power from the same power source as the server itself. If the host node loses power entirely, IPMI goes down with the host.
  2. The IPMI BMC has a single network interface plugged into a single switch and it is a single device.

If we relied on IPMI-based fencing alone, we'd have a single point of failure. If the surviving node can not put the lost node into a known state, it will intentionally hang. The logic being that a hung cluster is better than risking corruption or a split-brain. This means that, with IPMI-based fencing alone, the loss of power to a single node would not be automatically recoverable.

That is not allowed under Intelligent Availability.

To make fencing redundant, we will use switched PDUs. Think of these as network-connected power bars. Imagine now that one of the nodes blew itself up. The surviving node would try to connect to its IPMI interface and, of course, get no response. Then it would log into both PDUs (one behind either side of the redundant power supplies) and cut the power going to the node. By doing this, we now have a way of putting a lost node into a known state.

So now, no matter how badly things go wrong, we can always recover!

Ethernet Switches

Selecting ethernet switches is easier.

The most fundamental requirement is that the switches support VLANs. We strongly recommend stacked switches with 'hitless failover'. That is to say, the switch must allow for members of the stack to leave and be replaced without any interruption to the flow of network traffic.

Brocade ICX6610-48 8x SFP+, 48x 1Gbps RJ45, 160Gbit stacked switch. Photo by Brocade.
Brocade ICX6450-48 4x SFP+, 24x 1Gbps RJ45, 40Gbit stacked switch. Photo by Brocade.

We recommend (and have the most experience with) Brocade ICX-series switches, but any vendor (Cisco, Dell, D-Link, etc) should be fine, provided the above features are provided.

Serial Access

Antaira UTS-1110A single-port USB to RS-232] serial adapter. Photo by Antaira.

You will need a serial port to access and configure the foundation pack devices. Most modern laptops and desktops no longer ship with a serial port. If you need a serial port, we have found the Antaira UTS-1110A to be a very good USB option that works flawlessly under linux.

Striker Dashboards

Striker dashboards server two purposes;

  • Host the Striker WebUI
  • Host the ScanCore database

The first job requires very little in the way of system resources. The second one, hosting the database, does require more careful consideration.

Generally speaking, you will want to have 8 GiB of RAM or more, an SSD to help with the random database access an Intel Core i5 or equivalent. The Striker dashboard requires two ethernet connections, one to the IFN and one to the BCN. If you wish, four interfaces can be used. If available, Striker's installer will detect the four interfaces and automatically configure active-passive bonding for the two connections.

A server? An appliance!

The Striker dashboard runs like your home router; It has a web-interface that allows you to create, manage and access new highly-available servers, manage nodes and monitor foundation pack hardware.

We have used;

Striker dashboards are an integral part of the Anvil! platform and are critical in the delivery of Intelligent Availability. As such, you must have two of them. If the machines you use do not have redundant power and/or networking, be sure to connect the first dashboard to the first power rail and ethernet switch and the second dashboard to the second power rail and ethernet switch.

Anvil! Nodes

The more fault-tolerant, the better!

The Anvil! Nodes host power your highly-available servers, but the servers themselves are totally decoupled from the hardware. You can move your servers back and forth between these nodes without any interruption. If a node catastrophically fails without warning, the survivor will reboot your servers within seconds ensuring the most minimal service interruptions (typical recovery time from node crash to server being at the login prompt is 30 to 90 seconds).

File:Fujitsu Primergy RX2540-M2 Front-Left.jpg
The capable Fujitsu Primergy RX2540 M2; Photo by Fujitsu.
File:Fujitsu Primergy TX1320-M2 Front-Left.jpg
The powerfully small Fujitsu Primergy TX1320 M2; Photo by Fujitsu.

The requirements are two servers with the following;

Beyond these requirements, the rest is up to you; your performance requirements, your budget and your desire for as much fault-tolerance as possible.

Template note icon.svg
Note: If you have a bit of time, you should really read the section discussing hardware considerations from the main tutorial before purchasing hardware for this project. It is very much not a case of "buy the most expensive and you're good".

The previous section covered the bare-minimum system requirements for following this tutorial. If you are looking to build an Anvil! for production, we need to discuss important considerations for selecting hardware.

More Consideration - Storage

There is probably no single consideration more important than choosing the storage you will use.

In our years of building Anvil! HA platforms, we've found no single issue more important than storage latency. This is true for all virtualized environments, in fact.

The problem is this:

Multiple servers on shared storage can cause particularly random storage access. Traditional hard drives have disks with mechanical read/write heads on the ends of arms that sweep back and forth across the disk surfaces. These platters are broken up into "tracks" and each track is itself cut up into "sectors". When a server needs to read or write data, the hard drive needs to sweep the arm over the track it wants and then wait there for the sector it wants to pass underneath.

This time taken to get the read/write head onto the track and then wait for the sector to pass underneath is called "seek latency". How long this latency actually is depends on a few things:

  • How fast are the platters rotating? The faster the platter speed, the less time it takes for a sector to pass under the read/write head.
  • How fast the read/write arms can move and how far do they have to travel between tracks? Highly random read/write requests can cause a lot of head travel and increase seek time.
  • How many read/write requests (IOPS) can your storage handle? If your storage can not process the incoming read/write requests fast enough, your storage can slow down or stall entirely.

When many people think about hard drives, they generally worry about maximum write speeds. For environments with many virtual servers, this is actually far less important than it might seem. Reducing latency to ensure that read/write requests don't back up is far more important. This is measured as the storage's IOPS performance. If too many requests back up in the cache, storage performance can collapse or stall out entirely.

This is particularly problematic when multiple servers try to boot at the same time. If, for example, a node with multiple servers dies, the surviving node will try to start the lost servers at nearly the same time. This causes a sudden dramatic rise in read requests and can cause all servers to hang entirely, a condition called a "boot storm".

Thankfully, this latency problem can be easily dealt with in one of three ways;

  1. Use solid-state drives. These have no moving parts, so there is less penalty for highly random read/write requests.
  2. Use fast platter drives and proper RAID controllers with write-back caching.
  3. Isolate each server onto dedicated platter drives.

Each of these solutions have benefits and downsides;

  Pro Con
Fast drives +

Write-back caching

15,000rpm SAS drives are extremely reliable and the high rotation speeds minimize latency caused by waiting for sectors to pass under the read/write heads. Using multiple drives in RAID level 5 or level 6 breaks up reads and writes into smaller pieces, allowing requests to be serviced quickly and to help keep the read/write buffer empty. Write-back caching allows RAM-like write speeds and the ability to re-order disk access to minimize head movement. The main con is the number of disks needed to get effective performance gains from striping. Alteeve always uses a minimum of six disks, but many entry-level servers support a maximum of 4 drives. You need to account for the number of disks you plan to use when selecting your hardware.
SSDs They have no moving parts, so read and write requests do not have to wait for mechanical movements to happen, drastically reducing latency. The minimum number of drives for SSD-based configuration is two. Solid state drives use NAND flash, which can only be written to a finite number of times. All drives in our Anvil! will be written to roughly the same amount, so hitting this write-limit could mean that all drives in both nodes would fail at nearly the same time. Avoiding this requires careful monitoring of the drives and replacing them before their write limits are hit.
Template note icon.svg
Note: Enterprise grade SSDs are designed to handle highly random, multi-threaded workloads and come at a significant cost. Consumer-grade SSDs are designed principally for single threaded, large accesses and do not offer the same benefits.
Isolated Storage Dedicating hard drives to virtual servers avoids the highly random read/write issues found when multiple servers share the same storage. This allows for the safe use of cheap, inexpensive hard drives. This also means that dedicated hardware RAID controllers with battery-backed cache are not needed. This makes it possible to save a good amount of money in the hardware design. The obvious down-side to isolated storage is that you significantly limit the number of servers you can host on your Anvil!. If you only need to support one or two servers, this should not be an issue.

The last piece to consider is the interface of the drives used, be they SSDs or traditional HDDs. The two common interface types are SATA and SAS.

  • SATA HDD drives generally have a platter speed of 7,200rpm. The SATA interface has limited instruction set and provides minimal health reporting. These are "consumer" grade devices that are far less expensive, and far less reliable, than SAS drives.
  • SAS drives are generally aimed at the enterprise environment and are built to much higher quality standards. SAS HDDs have rotational speeds of up to 15,000rpm and can handle far more read/write operations per second. Enterprise SSDs using the SAS interface are also much more reliable than their commercial counterpart. The main downside to SAS drives is their cost.

In all production environments, we strongly, strongly recommend SAS-connected drives. For non-production environments, SATA drives are fine.

More Consideration - Storage Security

If security is a particular concern of yours, then you can look at using self-encrypting hard drives along with LSI's SafeStore option, or similar options from other vendors. An example hard drive, which we've tested and validated, would be the Seagate ST1800MM0038 drives. In general, if the drive advertises "SED" support, it should work fine.

The provides the ability to:

  • Encrypt all data with AES-256 grade encryption without a performance hit.
  • Require a pass phrase on boot to decrypt the server's data.
  • Protect the contents of the drives while "at rest" (ie: while being shipped somewhere).
  • Execute a self-destruct sequence.

Obviously, most users won't need this, but it might be useful to some users in sensitive environments like embassies in less than friendly host countries.

More Consideration - RAM

RAM is a far simpler topic than storage, thankfully. Here, all you need to do is add up how much RAM you plan to assign to servers, add at least 4 GiB for the host, and then install that much memory (or more) in both of your nodes.

In production, there are two technologies you will want to consider;

  • ECC, error-correcting code, provide the ability for RAM to recover from single-bit errors. If you are familiar with how parity in RAID arrays work, ECC in RAM is the same idea. This is often included in server-class hardware by default. It is highly recommended.
  • Memory Mirroring is, continuing our storage comparison, RAID level 1 for RAM. All writes to memory go to two different chips. Should one fail, the contents of the RAM can still be read from the surviving module.

Storage Over-Provisioning

"Over-provisioning", also called "thin provisioning" is a concept made popular in many "cloud" technologies. It is a concept that has almost no place in HA environments and is precluded by Intelligent Availability.

A common example is creating virtual disks of a given apparent size, but which only pull space from real storage as needed. So if you created a "thin" virtual disk that was 80 GiB large, but only 20 GiB worth of data was used, only 20 GiB from the real storage would be used.

In essence; Over-provisioning is where you allocate more resources to servers than the nodes can actually provide, banking on the hopes that most servers will not use all of the resources allocated to them. The danger here, and the reason it has almost no place in HA, is that if the servers collectively use more resources than the nodes can provide, something is going to crash.

More Consideration - CPUs And CPU Over-Provisioning

Over provisioning of RAM and storage is never acceptable in an HA environment, as mentioned. Over-allocating CPU cores is possibly acceptable though, if done carefully.

When selecting which CPUs to use in your nodes, the number of cores and the speed of the cores will determine how much computational horse-power you have to allocate to your servers. The main considerations are:

  • Core speed; Any given "thread" can be processed by a single CPU core at a time. The faster the given core is, the faster it can process any given request. Many applications do not support multithreading, meaning that the only way to improve performance is to use faster cores, not more cores.
  • Core count; Some applications support breaking up jobs into many threads, and passing them to multiple CPU cores at the same time for simultaneous processing. This way, the application feels faster to users because each CPU has to do less work to get a job done. Another benefit of multiple cores is that if one application consumes the processing power of a single core, other cores remain available for other applications, preventing processor congestion.

In processing, each CPU "core" can handle one program "thread" at a time. Since the earliest days of multitasking, operating systems have been able to handle threads waiting for a CPU resource to free up. So the risk of over-provisioning CPUs is restricted to performance issues only.

If you're building an Anvil! to support multiple servers and it's important that, no matter how busy the other servers are, the performance of each server can not degrade, then you need to be sure you have as many real CPU cores as you plan to assign to servers.

So for example, if you plan to have three servers and you plan to allocate each server four virtual CPU cores, you need a minimum of 13 real CPU cores (3 servers x 4 cores each plus at least one core for the node). In this scenario, you will want to choose servers with dual 8-core CPUs, for a total of 16 available real CPU cores. You may choose to buy two 6-core CPUs, for a total of 12 real cores, but you risk congestion still. If all three servers fully utilize their four cores at the same time, the host OS will be left with no available core for its software, which manages the HA stack.

In many cases, however, risking a performance loss under periods of high CPU load is acceptable. In these cases, allocating more virtual cores than you have real cores is fine. Should the load of the servers climb to a point where all real cores are under 100% utilization, then some applications will slow down as they wait for their turn in the CPU.

In the end, the decision whether to over-provision CPU cores or not, and if so by how much, is up to you, the reader. Remember to consider balancing out faster cores with the number of cores. If your expected load will be short bursts of computationally intense jobs, then few-but-faster cores may be the best solution.

A Note on Hyper-Threading

Intel's hyper-threading technology can make a CPU appear to the OS to have twice as many real cores than it actually has. For example, a CPU listed as "4c/8t" (four cores, eight threads) will appear to the node as an 8-core CPU. In fact, you only have four cores and the additional four cores are emulated attempts to make more efficient use of the processing of each core.

Simply put, the idea behind this technology is to "slip in" a second thread when the CPU would otherwise be idle. For example, if the CPU core has to wait for memory to be fetched for the currently active thread, instead of sitting idle, a thread in the second core will be worked on.

How much benefit this gives you in the real world is debatable and highly depended on your applications. For the purposes of HA, it's recommended to not count the "HT cores" as real cores. That is to say, when calculating load, treat "4c/8t" CPUs as a 4-core CPUs.

More Consideration - Network Interfaces (Six of them? Seriously?)

Yes, seriously.

Obviously, you can put everything on a single network card and your HA software will work, but it would not be advised.

We will go into the network configuration at length later on. For now, here's an overview:

  • Each network needs two links in order to be fault-tolerant. One link will go to the first network switch and the second link will go to the second network switch. This way, the failure of a network cable, port or switch will not interrupt traffic.
  • There are three main networks in an Anvil!;
    • Back-Channel Network; This is used by the cluster stack and is sensitive to latency. Delaying traffic on this network can cause the nodes to "partition", breaking the cluster stack.
    • Storage Network; All disk writes will travel over this network. As such, it is easy to saturate this network. Sharing this traffic with other services would mean that it's very possible to significantly impact network performance under high disk write loads. For this reason, it is isolated.
    • Internet-Facing Network; This network carries traffic to and from your servers. By isolating this network, users of your servers will never experience performance loss during storage or cluster high loads. Likewise, if your users place a high load on this network, it will not impact the ability of the Anvil! to function properly. It also isolates untrusted network traffic.

So, three networks, each using two links for redundancy, means that we need six network interfaces. It is strongly recommended that you use three separate dual-port network cards. Using a single network card, as we will discuss in detail later, leaves you vulnerable to losing entire networks should the controller fail.

A Note on Dedicated IPMI Interfaces

Some server manufacturers provide access to IPMI using the same physical interface as one of the on-board network cards. Usually these companies provide optional upgrades to break the IPMI connection out to a dedicated network connector.

Whenever possible, it is recommended that you go with a dedicated IPMI connection.

We've found that it rarely, if ever, is possible for a node to talk to its own network interface using a shared physical port. This is not strictly a problem, but it can certainly make testing and diagnostics easier when the node can ping and query its own IPMI interface over the network.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Libera Chat
© Alteeve's Niche! Inc. 1997-2023   Anvil! "Intelligent Availability™" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.