Tag: Red Hat

  • Understanding Pacemaker clusters and their resources configuration for high availability

    Pacemaker clusters also known as RHEL HA and its resources are configured and managed to provide reliability, scalability, and availability to critical production services. This is done by eliminating single points of failure and facilitating failover. The Red Hat High Availability uses Pacemaker as its cluster resource manager.

    Before jumping into how Pacemaker cluster and resources are configured, lets dive in some core concepts and Understand RHHA

    Core Concepts and Components

    Cluster Definition: A cluster consists of two or more computers, known as nodes or members. High availability clusters ensure service availability by moving services from an inoperative node to another cluster node.

    Pacemaker’s Role: Pacemaker is the cluster resource manager that ensures maximum availability for cluster services and resources. This is done by using cluster infrastructure’s messaging and membership capabilities to detect and recover from node and resource-level failures.

    Key Components:

    Cluster Infrastructure: Provides functions such as configuration file management, membership management, lock management, and fencing.

    High Availability Service Management: Manages failover of services when a node becomes inoperative. This is primarily handled by Pacemaker.

    Cluster Administration Tools: Configuration and management capabilities for setting up, configuring, and managing the HA tools, including infrastructure, service management, and storage components.

    Cluster Information Base (CIB): An XML-based daemon that distributes and synchronises current configuration and status information across all cluster nodes from a Designated Coordinator (DC). It represents both the cluster’s configuration and the current state of all resources. Direct editing of the cib.xml file is not recommended; instead, use pcs or pcsd.

    Cluster Resource Management Daemon (CRMd): Routes Pacemaker cluster resource actions, allowing resources to be queried, moved, instantiated, and changed.

    Local Resource Manager Daemon (LRMd): Acts as an interface between CRMd and resources, passing commands (e.g., start, stop) to agents and relaying status information.

    corosync: The daemon that provides core membership and communication needs for high availability clusters, managing quorum rules and messaging between cluster members.

    Shoot the Other Node in the Head (STONITH): Pacemaker’s fencing implementation, which acts as a cluster resource to process fence requests, forcefully shutting down nodes to ensure data integrity.

    Configuration and Management Tools

    Red Hat provides two primary tools for configuring and managing Pacemaker clusters:

    pcs (Command-Line Interface): This tool controls and configures Pacemaker and the corosync heartbeat daemon. It can perform tasks such as creating and configuring clusters, modifying running cluster configurations, and remotely managing cluster status.

    pcsd Web UI (Graphical User Interface): Offers a graphical interface for creating and configuring Pacemaker/Corosync clusters, with the same capabilities as the pcs CLI. It can be accessed via https://nodename:2224. For pcsd to function, port TCP 2224 must be open on all nodes for pcsd Web UI and node-to-node communication.

    Essential Cluster Concepts

    Fencing: If communication with a node fails, other nodes must be able to restrict or release access to shared resources. This is achieved via an external method called fencing, using a fence device (also known as a STONITH device). STONITH ensures data safety by guaranteeing a node is truly offline before shared data is accessed by another node, and forces offline nodes where services cannot be stopped. Red Hat only supports clusters with STONITH enabled (stonith-enabled=true). Fencing can be configured with multiple devices using fencing levels.

    Quorum: Cluster systems use quorum to prevent data corruption and loss by ensuring that more than half of the cluster nodes are online. Pacemaker, by default, stops all resources if quorum is lost. Quorum is established via a voting system. The votequorum service, along with fencing, prevents “split-brain” scenarios. In Split-brain parts of the cluster act independently, potentially corrupting data. For GFS2 clusters, no-quorum-policy must be set to freeze to prevent fencing and allow the cluster to wait for quorum to be regained.

    Quorum Devices: A separate quorum device can be configured to allow a cluster to sustain more node failures than standard quorum rules, especially recommended for clusters with an even number of nodes (e.g., two-node clusters).

    Cluster Resource Configuration

    A cluster resource is an instance of a program, data, or application managed by the cluster service, abstracted by agents that provide a standard interface.

    Resource Creation: Resources are created using the pcs resource create command. They can be of various classes, including OCF, LSB, systemd, and STONITH.

    Meta Options: Resource behavior is controlled via meta-options, such as priority (for resource preference), target-role (desired state), is-managed (cluster control), resource-stickiness (preference to stay on current node), requires (conditions for starting), migration-threshold (failures before migration), and multiple-active (behavior if resource is active on multiple nodes).

    Monitoring Operations: All Resources can have monitoring operations defined to ensure their health. If not specified, a default monitoring operation is added. Multiple monitoring operations can be configured with different check levels and intervals.

    Resource Groups: A common configuration involves grouping resources that need to be located together, start sequentially, and stop in reverse order. Constraints can be applied to the group as a whole.

    Constraints: Determine resource behavior within the cluster.

    Location Constraints: Define which nodes a resource can run on, allowing preferences or avoidance. They can be used to implement “opt-in” (resources don’t run anywhere by default) or “opt-out” (resources can run anywhere by default) strategies.

    Ordering Constraints: Define the sequence in which resources start and stop.

    Colocation Constraints: Determine where resources are placed relative to other resources. The influence option (RHEL 8.4+) determines whether primary resources move with dependent resources upon failure.

    Cloned Resources: Allow a resource to be active on multiple nodes simultaneously (e.g., for load balancing). Clones are slightly sticky by default, preferring to stay on their current node.

    Multistate (Master/Slave) Resources: A specialization of clones where instances can operate in two modes: Master and Slave. They also have stickiness by default.

    LVM Logical Volumes: Supported in two cluster configurations: High Availability LVM (HA-LVM) for active/passive failover (single node access) and LVM volumes using lvmlockd for active/active configurations (multiple node access). Both must be configured as cluster resources and managed by Pacemaker. For RHEL 8.5+, vgcreate --setautoactivation n is used to prevent automatic activation outside Pacemaker.

    GFS2 File Systems: Can be configured in a Pacemaker cluster, requiring the dlm (Distributed Lock Manager) and lvmlockd resources. The no-quorum-policy for GFS2 clusters should be set to freeze.

    Virtual Domains as Resources: Virtual machines managed by libvirt can be configured as cluster resources using the VirtualDomain resource type. Once configured, they should only be started, stopped, or migrated via cluster tools. Live migration is possible if allow-migrate=true is set.

    Pacemaker Remote Nodes: Allows nodes not running corosync (e.g., virtual guests, remote hosts) to integrate into the cluster and have their resources managed. Connections are secured using TLS with a pre-shared key (/etc/pacemaker/authkey).

    Pacemaker Bundles (Docker Containers): Pacemaker can launch Docker containers as bundles, encapsulating resources within them. The container image must include the pacemaker_remote daemon.

    Managing Resources and Cluster State

    Displaying Status: pcs resource status displays configured resources. pcs status --full shows detailed cluster status, including online/offline nodes and resource states.

    Clearing Failure Status: pcs resource cleanup resets a resource’s status and failcount after a failure is resolved. pcs resource refresh re-detects the current state of resources regardless of their current state.

    Moving Resources: Resources can be manually moved using pcs resource move or pcs resource relocate. Resources can also be configured to move after a set number of failures (migration-threshold) or due to connectivity changes by using a ping resource and location constraints.

    Enabling/Disabling/Banning Resources: Resources can be disabled (pcs resource disable) to manually stop them and prevent the cluster from starting them. They can be re-enabled (pcs resource enable). pcs resource ban prevents a resource from running on a specific node.

    Unmanaged Mode: Resources can be set to unmanaged mode, meaning Pacemaker will not start or stop them, while still keeping them in the configuration.

    Node Standby Mode: A node can be put into standby mode (pcs node standby) to prevent it from hosting resources, effectively moving its active resources to other nodes. This is useful for maintenance or testing.

    Cluster Maintenance Mode: The entire cluster can be put into maintenance mode (pcs property set maintenance-mode=true) to stop all services from being started or stopped by Pacemaker until the mode is exited.

    Updating Clusters: Updates to the High Availability Add-On packages can be performed via rolling updates (one node at a time) or by stopping the entire cluster, updating all nodes, and then restarting. When stopping pacemaker_remote on a remote/guest node for maintenance, disable its connection resource first to prevent monitor failures and graceful resource migration.

    Disaster Recovery Clusters: Two clusters can be configured for disaster recovery, with one as the primary and the other as the recovery site. Resources are typically run in production on the primary and in demoted mode (or not at all) on the recovery site, with data synchronisation handled by the applications themselves. pcs dr commands (RHEL 8.2+) allow displaying status of both clusters from a single node.