Intelligent Availability

From Alteeve Wiki
Jump to navigation Jump to search

 Main Page :: Intelligent Availability

Intelligent Availability™ is the successor to High Availability. For a system to be defined as "IA", it must meet the following requirements and design focus;

1. Where HA is reactive, IA is proactive.

2. Complete stack redundancy with no single point of failure.

3. IA must survive both failure and recovery without interruption.

4. Over-provisioning/thin-provisioning is not allowed.

5. Performance must remain consistent in a degraded state.

6. Human interaction must be reduced as much as possible.

7. Compartmentalise

Expanding on these;

Intelligent Availability Is Proactive

Where traditional high-availability would detect a fault and react to it, intelligent availability actively scans its environment, internal components and software states looking for and adapting to changing threat models.

For example; Under normal operation, the biggest threat to operation is component failure. Thus, maintaining full redundancy is the top priority. Should one node's health degrades (ie: a RAID array loses a disk and enters "Partially Degraded" or "Degraded" state), then hosted servers will be live-migrated to the healthier peer.

If, as another example, environmental cooling is lost, or input power to both UPSes are lost, then the threat to availability is no longer component failure. With this new threat model, shedding load to either reduce thermal output or reduce current draw on the UPS batteries becomes the priority. A sacrificial node will be selected (on multiple criteria), servers will be consolidated if needed and a node will be powered off. When the room cools or power returns (and the UPSes sufficiently charge), the peer will be started and full redundancy will be restored.

The criteria for determining threats and how to react/recover, in the Anvil! system, is handled with ScanCore. This is an agent-based "decision engine" we created for the above capability. Being agent based, it is designed to be easily expanded on over time as the core itself doesn't know about or care about how external sensors, software states or what not are collected. It simply gathers the aggregate data from the various agents and uses that data to make decisions. Similarly, agents themselves are allowed to make localized decisions (and often attempt recovery) autonomously. Allowing these simple stand-along scan agents to take corrective actions as appropriate.

Component failure is not always predictable, so reactive recovery is still important, but IA must strive to expand predictive capabilities.

Complete Stack Redundancy

Traditional HA solutions, by definition, provide some level of redundancy but almost always lack full redundancy. The basic HA setup is a shared storage (SAN) solution coupled to two or more nodes. There are often SPoF in the power, network or other components. For a solution to be "IA", you must be able to walk up to *any* component, rip it out ungracefully and have the system continue to operate.

Survive Failure and Recover

Many HA solutions rely on components that can survive a failure but not necessarily recovery. Classic examples are the use of SANs or blade chassis. These often include high levels of redundancy; Duplicate backplanes, etc. However, if a back-plane fails, often the chassis must be de-energized to effect repairs. This requires scheduled downtime.

In an IA environment, all components much be electrically and mechanically isolated. This is why, among other reasons, the Anvil! platform does not use a SAN. As redundant as a SAN might be, it is one box and one logical array. By using replicated storage, we can totally remove a node (and it's internal storage) and not effect availability in any way. With this, we can perform firmware and software updates, repair components or full out replace an entire node without impacting availability. Ask past customers how they feel when they have to update the firmware on their SANs and you will quickly understand the value of our approach. :)

Never Over-Provision

One of the key differentiators of "availability" solutions relative to "cloud" solutions is the importance of resource utilization efficiency. In cloud environments, which chase RUE, thin-provisioning is employed in storage (as an example). So the cloud operators will understand that rarely does a given hosted VM use more than, say, 75% of the allocated disc space. Thus, a cloud operator might well over allocate storage by 25% (in this example). The risk, of course, is that if too many hosted guests use more than 75% of their allocated storage, you suddenly run out of available disk space despite guests believing they still have available storage.

In IA, nothing is more important than resource availability so over-provisioning of resources is not allowed. In IA, all infrastructure design decisions are dictated, first, by availability, then secondly by required performance and only third is utilization efficiency considered.

Consistent Performance In A Degraded State

As an extension of number 4; IA dictates that, in addition to resiliency being the top design consideration, performance in a degraded state must remain consistent. For example; If you tried to aggregate the bandwith of paired network links, you might be able to get 2 Gbps or 20 Gbps. In a degraded state though, the bandwidth would be reduced to 1 or 10 Gbps. If the user's workload exceeded what 1 Gbps or 10 Gbps could provide, they wouldn't know until a link was lost. So despite there still being some level of availability, performance in a degraded state is no longer sufficient to maintain required workloads. Thus, the Anvil! system bonds in active/passive mode. So despite there being 2x 1Gbps link, the available performance remains at 1 Gbps, and stays at 1 Gbps (or 1 Gbps) in a failed state.

Similarly, despite having two nodes, all servers will normally run on just one host. In a similar concern as described in the network example; If the CPU resources on one node is insufficient to handle the growing work load, we want to know as soon as possible, not many months later when a node is lost and the hosted servers can no longer perform as required. In this way,

With this approach, if a client's growth exceeds design expectations, we will know as soon as possible and can address the performance issues with planned upgrades.

Remove The Human

Understanding that a good IA system is one that will "fade into the background", and understanding that people can make mistakes, Intelligent Availability requires that human interaction be minimized as much as possible and that all user interfaces strive to be as simply as possible. This is not for the benefit of the user, but for the protection of the system. In the above examples, the autonomous pro-active live migration, load shedding and restoration of redundancy post-event were all reflections if this requirement. Beyond taking proactive actions faster than humans can, it also removed opportunity for human error.

Compartmentalise

This is an extension of Never Over-Provision; There is a temptation to use very large UPSes, switched PDUs and ethernet switches in order to run the most node pairs per foundation pack.

The risk here is that a serious, multi-component failure will cause many more IA node pair to fail when the underlying foundation pack fails. This increases the scope of an outage and extends the mean time to recovery, simply because of the wider scale of services impacted.

Consider this approach to be similar to how some ships and submarines have dual-hulls and floodable compartments. The dual-hull provides redundancy, analogous to the full-stack redundancy in IA described above. This protects against all normal failures via standard redundancy. However, accepting that everything can go wrong, that alone is not sufficient.

Continuing the analogy; compartmentalisation in IA is akin to the floodable compartments. If "both hulls" fail, your last line of defence is to minimise the effected area in the hopes of keeping the ship, your organisation, afloat.

We've seen rare cases, a short-circuit in an automatic transfer system tripping the breakers on both UPSes for example, taking out a foundation pack and any hosted node pairs. In a compartmentalised environment where no more than two or three node pairs are hosted, this would mean only those immediately hosted systems would be lost. Had a larger foundation pack been implemented, the outage would be much more widespread, possibly "sinking your ship".

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Libera Chat
© Alteeve's Niche! Inc. 1997-2023   Anvil! "Intelligent Availability™" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.