eG Monitoring
 

Measures reported by RedisClstrStatusTest

Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.

Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.

Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:

  • Node A contains hash slots from 0 to 5500.

  • Node B contains hash slots from 5501 to 11000.

  • Node C contains hash slots from 11001 to 16383.

This allows to add and remove nodes in the cluster easily.

In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-slave model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional slaves nodes).

In our example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.

However when the cluster is created (or at a later time) we add a slave node to every master, so that the final cluster is composed of A, B, C that are master nodes, and A1, B1, C1 that are slave nodes. This way, the system is able to continue if node B fails.

Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.

However, note that if nodes B and B1 fail at the same time, Redis Cluster is not able to continue to operate.

To avoid this, administrators must monitor the Redis cluster, understand how many master nodes it is composed of, track the status of hash lots assigned to each node, and be promptly alerted if any hash slot fails. For achieving this, administrators can use the RedisClstrStatusTest.

For a cluster-enabled Redis instance, this test reports the composition of the cluster in terms of the number of master nodes and hash slots assigned to the cluster. In addition, the test tracks the status of the hash slots, and notifies administrators if any hash slot fails. Moreover, the test also alerts administrators if any node is added or removed from the cluster.

Outputs of the test : One set of results for the cluster-enabled instance being monitored.

The measures made by this test are as follows:

Measurement Description Measurement Unit Interpretation
cluster_enabled Indicates whether/not the cluster feature is enabled for the target Redis instance.   If the instance is cluster-enabled, then this measure will report the value Yes. For a cluster-disabled instance, this measure will report the value No.

The values reported by this measure and their numeric equivalents are available in the table below:

Measure Value Numeric Value
Yes 1
No 0


Note:

This measure reports the Measure Values listed in the table above to indicate whether/not the target instance is cluster-enabled. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.
cluster_slots_ok Indicates the number of hash slots in the cluster that are in the OK state. Number If the value of this measure is the same as the value of the Number of hash slots assigned to cluster measure, it means that all hash slots mapped to all nodes in the cluster are working correctly.

On the other hand, if the value of this measure is much lower than the value of the Number of hash slots assigned to cluster measure, it means that the hash slots mapped to some nodes are in the FAIL or PFAIL state. You may want to look up the values of the Number of hash slots in PFAIL state and Number of hash slots in FAIL state measures to confirm this.
cluster_slots_pfail Indicates the number of hash slots that are mapped to a node in PFAIL state. Number Ideally, the value of this measure should be very low or 0.

A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and slave nodes can flag another node as PFAIL, regardless of its type.

Note that those hash slots still work correctly, as long as the PFAIL state is not promoted to FAIL by the failure detection algorithm. PFAIL only means that we are currently not able to talk with the node, but may be just a transient error.
cluster_stats_messages_sent Indicates the number of messages sent via the cluster node-to-node binary bus. Number All the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).
cluster_stats_messages_received Indicates the number of messages received via the cluster node-to-node binary bus. Number
cluster_state Indicates the current state of the cluster.   This measure can report any of the following values:

  • OK: If the node is able to receive queries, then this measure will report the value OK.

  • Fail: If there is at least one hash slot that is unbound (no node associated), in error state ((node serving it is flagged with FAIL flag), or if the majority of masters can't be reached by this node, then this measure will report the value FAIL.

The numeric values that correspond to the measure values discussed above are as follows:

Measure Value Numeric Value
Fail 0
OK 1


Note:

This measure reports the Measure Values listed in the table above to indicate the cluster state. However, in the graph, this measure is indicated using the Numeric Values listed in the table above.
cluster_known_nodes Indicates the number of nodes in the cluster. Number To know the details of the nodes in the cluster, use the detailed diagnosis of this measure.
cluster_slots_assigned Indicates the total number of hash slots assigned to the cluster. Number  
cluster_size Indicates the number of master nodes in the cluster. Number  
cluster_slots_fail Indicates the number of hash slots that are mapped to a node in FAIL state. Number Every node sends gossip messages to every other node including the state of a few random known nodes. Every node eventually receives a set of node flags for every other node. This way every node has a mechanism to signal other nodes about failure conditions they have detected.

A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:

  • Some node, say node A, has another node B flagged as PFAIL

  • Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.

  • The majority of masters signaled the PFAIL or FAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time. (The validity factor is set to 2 in the current implementation, so this is just two times the NODE_TIMEOUT time).



  • If all the above conditions are true, Node A will:

  • Mark the node as FAIL.

  • Send a FAIL message to all the reachable nodes.



Ideally therefore, the value of this measure should be 0.
cluster_added_nodes Indicates the number of nodes added to the cluster. Number Use the detailed diagnosis of this measure to know which nodes were recently added to the cluster.
cluster_deleted_nodes Indicates the number of nodes deleted from the cluster. Number Use the detailed diagnosis of this measure to know which nodes were recently deleted from the cluster.
cluster_slave_nodes Indicates the number of slave nodes in the cluster. Number