When critical workloads can’t afford downtime, Red Hat High Availability Clusters step in to keep services running, ensure data stays consistent, and eliminate single points of failure. Built on the solid foundation of the High Availability Add-On, these clusters use a mix of resource orchestration, fault detection, and fencing mechanisms to deliver enterprise-grade uptime.
Whether you’re a Linux engineer, system architect, or platform owner evaluating RHEL clustering, this deep dive walks you through its architecture, components, and strategies for maintaining availability and integrity.
🔧 What Makes a Cluster “Highly Available”?
At the heart of RHEL HA is the High Availability Add-On, which transforms a group of RHEL systems (called nodes) into a cohesive cluster. This cluster continuously monitors each member, takes over services when failures occur, and ensures clients never know something went wrong.
Clusters built using this RH-HA:
- Avoid single points of failure
- Automatically failover services
- Maintain data integrity during transitions
Key tools in the stack include:
- Pacemaker: The brain of the cluster that manages resources
- Corosync: Handles messaging, quorum, and membership
- STONITH (Fencing): Ensures failed nodes are completely cut off
- GFS2 and lvmlockd: Enable active-active shared storage access
🧠 Core Components of RHEL High Availability
1. Pacemaker: Resource Management Engine
Pacemaker is the cluster’s resource orchestrator, comprising several daemons:
- CIB: Holds configuration/status in XML, synced across all nodes
- CRMd: Schedules actions like start/stop/move for resources
- LRMd: Interfaces with local agents to execute actions and monitor state
2. Corosync: Messaging Backbone
Corosync ensures all nodes talk to each other reliably. It manages:
- Membership and quorum determination
- Messaging and state sync via
kronosnet - Redundant links and failover networking
3. Fencing (STONITH): Last Line of Defense
If a node stops responding, how do you guarantee it won’t corrupt data? Enter fencing.
- STONITH (“Shoot The Other Node In The Head”) cuts power or access to failed nodes
- Prevents dual writes and split-brain scenarios
- Required (
stonith-enabled=true) for production clusters
Examples:
- Redundant power fencing ensures both power supplies of a node are killed
- Use fencing delays (
pcmk_delay_base,priority-fencing-delay) to avoid race conditions
🧩 Ensuring Quorum and Preventing Split-Brain
A cluster needs quorum (majority vote) to make decisions. Without it, Pacemaker halts all resources to protect data.
- votequorum service tracks voting nodes
- no-quorum-policy:
stop(default): Stops all servicesfreeze: Useful for GFS2 where shutdowns require quorum- Quorum devices (net-based) help even-node clusters survive more failures
- Algorithms:
ffsplit,lms
💾 Storage Strategies for Data Consistency
1. Shared Storage
Failover only works if the new node can access the same data. Supported mediums include:
- iSCSI
- Fibre Channel
- Shared block devices
2. LVM in Clusters
- HA-LVM: Active/passive, single-node access at a time
- lvmlockd: Enables active/active access, works with GFS2
3. GFS2: The Cluster File System
- Allows simultaneous block-level access from multiple nodes
- Requires Pacemaker, Corosync, DLM, and lvmlockd
- Supports encrypted file systems (RHEL 8.4+)
⚙️ Resource Management Tactics
Resources in Pacemaker are abstracted via agents. They can be grouped, ordered, colocated, and monitored with high precision.
Key controls:
- Groups: Start in order, stop in reverse
- Constraints:
- Location (where)
- Ordering (when)
- Colocation (with whom)
- Health checks: Automatic monitoring with customizable failure policies
- migration-threshold: Move resource after N failures
- start-failure-is-fatal: Node marked bad after failed start
- multiple-active: What to do if resource runs on >1 node
- shutdown-lock: Prevents unnecessary failovers during planned maintenance
🌐 Multi-Site Clustering & Remote Nodes
1. Booth Ticket Manager
Manages split-brain in geo-distributed clusters. Tickets control which site holds resource ownership.
2. pacemaker_remote
Lets you add nodes that don’t run Corosync (e.g., VMs) into your cluster:
- Extend cluster size beyond 32 nodes
- Useful for managing cloud VMs or containers
🛠️ Configuration Tools
Red Hat provides two main tools to manage the cluster:
pcs(CLI)pcsd(Web UI)
Tasks made simple:
- Cluster creation
- Adding/removing nodes
- Config changes (live)
- Viewing status and logs
✅ Summary: Why RHEL HA Matters
If your workloads can’t go down—and your data can’t risk corruption—RHEL HA offers:
- Mature, enterprise-tested components
- Consistent handling of failovers and fencing
- Flexibility for active/active and geo-distributed clusters
- Integrated tooling for automation and visibility
Start with two nodes. Plan your fencing. Decide quorum policies. Add shared storage. Then scale.
When uptime matters, RHEL High Availability Add-On delivers.
Have questions or want a deeper walkthrough? Contact us at OmOps or explore more Linux and infrastructure insights on our blog.
Leave a Reply