In datacenters, failures are the norm not the exception.
Say the rate of failure of one machine (OS/disk/motherboard/network, etc) is once every 10 years (120 months) on average
When there are 120 servers in datacenter, MTTF of a machine is 1 month
With 12,000 servers, MTTF is 7.2 hours and with 120,000 it becomes 0.72 hours.
Hence, need for failure detector. Failing to do so will lead to
1. Data loss
2. inconsistency in system
3. Loss of revenue
4. closure of data center
Target settings for failure detection:
1. Process group-based systems (clouds/datacenters, replicated servers, distributed databases)
2. Crash-stop/Fail-stop process failures. Once a process (member of a group) fails, it stops executing any instructions from that point to infinity. This is different from crash-recovery model when processes can join and recovered with different identifiers
The main challenge of membership protocol is communication over unreliable network
2. Dissemination : Mechanism that notifies failures to other processes once detected; can also handle process joins and process leaves in a system
When one process crashes among a group of processes, atleast one non-faulty process detects the failure and disseminates to other member processes in the group so that other processes can update their membership lists
2. Almost-complete list (weakly consistent) : Eg. Gossip-style, SWIM, etc
3. Partial-random list Eg. SCAMP, T-MAN, Cyclon, etc
1. pj crashes
Nothing we can do about it. Its a frequent occurence. Common case rather than exception and frequency goes up linearly with size of datacenter.
2. Failure detectors desirable properties
a. Completeness (correctness property): Each failure is detected eventually by a non-faulty process
b. Accuracy (correctness property): There is no mistaken detection
The above two are impossible together in a lossy network. Failure detector problem is similar to other well-known problem in distributed systems called "consensus" problem.
The consensus problem is where a group of processes that are trying to decide on the value of a single bit and there is a well-known proof which shows that it's impossible to do over lossy and delayed network.
In real life, completeness is guaranteed (100%) but accuracy is partial/probabilistic guarantee (nearly 100% but never 100%)
This is correct since all failures need to be detected. However, false positives due to accuracy trade-off, might cause an overhead of falsely removing a process and rejoining with different identifier. This is still better than the other way round (100% accuracy with nearly 100% completeness since we might lose failures)
c. Speed (performance property): Time to first detection of failure
d. Scale (performance property): Equal load on each member and no bottlenecks or SPOF. Low overall network message load
All the desirable properties are required in spite of arbitrary simultaneous process failures
b. All the processes send heartbeats periodically to a central process
c. The central process maintains list of processes and heartbeats
d. If central process doesn't receive heartbeat from a process within a timeout, it marks the process as failed
Disadvantages:
- SPOF of central process (violates 100% complete desirable property)
- Central process is highly overloaded
If a neighbor process doesn't receive heartbeat from the process within a specified timeout, it marks the process as failed
It's better than centralized heartbeating since it reduces load on a single process.
Disadvantages:
- It's unpredictable on simultaneous multiple failures where in 3 consecutive processes in a ring fail. It's possible that few failures go undetected (violates 100% complete property)
- This approach has the overhead of repairing the ring in the event of failures
Advantage is it has equal load per member, though the overall load is high.
It is 100% complete since no failures go undetected.
However, if one of the processes is slow in receiving hearbeats, it's possible that lot of processes might be incorrectly marked as failed.
Hence need to make all-to-all hearbeating protocol more robust to increase the probabilistic guarantee of accuracy.
Periodically, this local membership list/table is sent to "b" targets similar to the gossip protocol.
On receipt of membership list at the target, it updates its local membership list (updating the entries for pid, if they are more recent than what it has).
For example, in a 4 process system, let's assume
P1 membership list : <{1, 10120, 66}, {2, 10103, 62}, {3, 10098, 63}, {4, 10111, 65}>
P2 membership list : <{1, 10118, 64}, {2, 10110, 64}, {3, 10090, 58}, {4, 10111, 65}>
If P1 sends its membership list to P2, on receipt P2 updates its membership list as follows
<{1, 10120, 70}, {2, 10110, 64}, {3, 10098, 70}, {4, 10111, 65}>
Note: current time is 70 at P2 (asynchronous clocks)
If a particular node/process last update entry in the membership list exceeds a threshold wrt current time (t-fail secs), that process is marked failed by the protocol.
However, that failed entry is not immediately deleted from the row. Only after T-cleanup seconds (which is close to T-fail) it will delete the member from the list.
Need for two different timeout (T-fail and T-cleanup):
Due to the nature of gossip protocol, it's possible that one process deletes the entry from its list but it could be chosen as target for gossip by another process which still has the entry, thus causing the deleted entry to be incorrectly re-added.
The two processes can incorrectly ping-pong the failed process retaining the entry without actually detecting. Hence to avoid this ping-pong affect, instead of deleting the entry immediately the process entry is left for another T-cleanup (which is typically equal to T-fail) seconds.
What happens if gossip period T-gossip is decreased?
Detection time decreases, but bandwidth/messages increase.
What happens to P-mistake(false positive rate) as T-fail,T-cleanup is increased?
Detection time increases, but P-mistake decreases because we give slightly longer time for non-faulty nodes heartbeats to go across.
Tradeoff: False positive rate (P-mistake) vs. detection time (T-gossip, T-fail, T-cleanup) vs. bandwidth
Analysis:
- All-to-all heartbeating : Load = N/T (T is heartbeat sent evert T time units)
- Gossip-variant of all-to-all heartbeating : Load = N/tg where tg is gossip period; L = N Log N / T. Note that gossip load is more than all-to-all, but has better accuracy.
- Optimal L is independent of N, but both the above are dependent on N and hence are sub-optimal
* L= O(N/T)
* Try to achieve simultaneous detection at all processes
* fail to distinguish Failure Detection and Dissemination componentns
Key:
- separate the two components
- use a non heartbeat-based Failure detection component
b. Stage 1: Process pi picks a random process pj and sends ping. Process pj responds back with an ACK
c. If either the original ping (pi -> pj) or the ACK (pj -> pi) is lost, pi tries to reach pj again, but by sending ping to "k" other random processes which ping process pj indirectly and send the ACK back to pi indirectly.
Time-bounded completeness
- key : select each membership element once as a ping target in a traversal
* round-robin pinging
* random permutation of list after each traversal
- Each failure is detected in worst case 2N-1 (local) protocol periods
- Preserves Failure Detection properties
- not enabled in all routers/switches
- multiple simultaneous multicasts especially when multiple processes detect the same failure almost at the same time
* Epidemic/gossip style dissemination
* maintain a buffer of recently joined/evicted processes (piggyback from this buffer; prefer recent updates)
* Buffer elements are garbage collected after a while
2. Indirect pinging may not solve the problem (eg. correlated message losses near pinged host)
3. Key: suspect a process before declaring it as failed in the group. If while suspect messages are being disseminated and process become alive, mark the process alive.
* inc# for pi can be represented only by pi (when it receives a <suspect, pi> message)
* somewhat similar to DSDV routing protocol
- Within an inc# : (suspect inc#) > (alive, inc #)
- (failed, inc#) overrides everything else
2. Every distributed system uses a failure detector
3. Many distributed systems use a membership service
4. Ring failure detection underlies IBM SP2 and many other similar clusters/machines
5. Gossip-style failure detection underlies Amazon EC2/S3 (rumored)
Say the rate of failure of one machine (OS/disk/motherboard/network, etc) is once every 10 years (120 months) on average
When there are 120 servers in datacenter, MTTF of a machine is 1 month
With 12,000 servers, MTTF is 7.2 hours and with 120,000 it becomes 0.72 hours.
Hence, need for failure detector. Failing to do so will lead to
1. Data loss
2. inconsistency in system
3. Loss of revenue
4. closure of data center
Target settings for failure detection:
1. Process group-based systems (clouds/datacenters, replicated servers, distributed databases)
2. Crash-stop/Fail-stop process failures. Once a process (member of a group) fails, it stops executing any instructions from that point to infinity. This is different from crash-recovery model when processes can join and recovered with different identifiers
Group Membership Service
Each application process (pi) maintains a membership list. This membership list is created, maintained and removed by membership protocol. Applications query the membership list using gossip, overlays, DHTs, etcThe main challenge of membership protocol is communication over unreliable network
Two sub-protocols
1. Failure detector : Mechanism that detects failures2. Dissemination : Mechanism that notifies failures to other processes once detected; can also handle process joins and process leaves in a system
When one process crashes among a group of processes, atleast one non-faulty process detects the failure and disseminates to other member processes in the group so that other processes can update their membership lists
Types of membership lists:
1. Complete list all the time across all the processes in the group (strongly consistent) : Eg. virtual synchrony2. Almost-complete list (weakly consistent) : Eg. Gossip-style, SWIM, etc
3. Partial-random list Eg. SCAMP, T-MAN, Cyclon, etc
Failure Detectors
Remember large group of processes, scalability is the goal, the same code runs at each process and communication is over unreliable network1. pj crashes
Nothing we can do about it. Its a frequent occurence. Common case rather than exception and frequency goes up linearly with size of datacenter.
2. Failure detectors desirable properties
a. Completeness (correctness property): Each failure is detected eventually by a non-faulty process
b. Accuracy (correctness property): There is no mistaken detection
The above two are impossible together in a lossy network. Failure detector problem is similar to other well-known problem in distributed systems called "consensus" problem.
The consensus problem is where a group of processes that are trying to decide on the value of a single bit and there is a well-known proof which shows that it's impossible to do over lossy and delayed network.
In real life, completeness is guaranteed (100%) but accuracy is partial/probabilistic guarantee (nearly 100% but never 100%)
This is correct since all failures need to be detected. However, false positives due to accuracy trade-off, might cause an overhead of falsely removing a process and rejoining with different identifier. This is still better than the other way round (100% accuracy with nearly 100% completeness since we might lose failures)
c. Speed (performance property): Time to first detection of failure
d. Scale (performance property): Equal load on each member and no bottlenecks or SPOF. Low overall network message load
All the desirable properties are required in spite of arbitrary simultaneous process failures
Failure detection protocols
1. Centralized Heartbeating
a. Heartbeat is a seq no. which is incremented (i++)b. All the processes send heartbeats periodically to a central process
c. The central process maintains list of processes and heartbeats
d. If central process doesn't receive heartbeat from a process within a timeout, it marks the process as failed
Disadvantages:
- SPOF of central process (violates 100% complete desirable property)
- Central process is highly overloaded
2. Ring Heartbeating
All processes are in a ring-like structure and each process sends heartbeat to its left and right neighbors.If a neighbor process doesn't receive heartbeat from the process within a specified timeout, it marks the process as failed
It's better than centralized heartbeating since it reduces load on a single process.
Disadvantages:
- It's unpredictable on simultaneous multiple failures where in 3 consecutive processes in a ring fail. It's possible that few failures go undetected (violates 100% complete property)
- This approach has the overhead of repairing the ring in the event of failures
3. All-to-all Heartbeating
Each process sends heartbeat to all the processes in the systemAdvantage is it has equal load per member, though the overall load is high.
It is 100% complete since no failures go undetected.
However, if one of the processes is slow in receiving hearbeats, it's possible that lot of processes might be incorrectly marked as failed.
Hence need to make all-to-all hearbeating protocol more robust to increase the probabilistic guarantee of accuracy.
4. Gossip-stype failure detection (variant of all-to-all heartbeating but more robust)
In addition to above all-to-all heartbeating semantics, each node maintains a membership list/table consisting of the <pid, seq#, local time when seq was received from pid>.Periodically, this local membership list/table is sent to "b" targets similar to the gossip protocol.
On receipt of membership list at the target, it updates its local membership list (updating the entries for pid, if they are more recent than what it has).
For example, in a 4 process system, let's assume
P1 membership list : <{1, 10120, 66}, {2, 10103, 62}, {3, 10098, 63}, {4, 10111, 65}>
P2 membership list : <{1, 10118, 64}, {2, 10110, 64}, {3, 10090, 58}, {4, 10111, 65}>
If P1 sends its membership list to P2, on receipt P2 updates its membership list as follows
<{1, 10120, 70}, {2, 10110, 64}, {3, 10098, 70}, {4, 10111, 65}>
Note: current time is 70 at P2 (asynchronous clocks)
If a particular node/process last update entry in the membership list exceeds a threshold wrt current time (t-fail secs), that process is marked failed by the protocol.
However, that failed entry is not immediately deleted from the row. Only after T-cleanup seconds (which is close to T-fail) it will delete the member from the list.
Need for two different timeout (T-fail and T-cleanup):
Due to the nature of gossip protocol, it's possible that one process deletes the entry from its list but it could be chosen as target for gossip by another process which still has the entry, thus causing the deleted entry to be incorrectly re-added.
The two processes can incorrectly ping-pong the failed process retaining the entry without actually detecting. Hence to avoid this ping-pong affect, instead of deleting the entry immediately the process entry is left for another T-cleanup (which is typically equal to T-fail) seconds.
What happens if gossip period T-gossip is decreased?
Detection time decreases, but bandwidth/messages increase.
What happens to P-mistake(false positive rate) as T-fail,T-cleanup is increased?
Detection time increases, but P-mistake decreases because we give slightly longer time for non-faulty nodes heartbeats to go across.
Tradeoff: False positive rate (P-mistake) vs. detection time (T-gossip, T-fail, T-cleanup) vs. bandwidth
Analysis:
- All-to-all heartbeating : Load = N/T (T is heartbeat sent evert T time units)
- Gossip-variant of all-to-all heartbeating : Load = N/tg where tg is gossip period; L = N Log N / T. Note that gossip load is more than all-to-all, but has better accuracy.
- Optimal L is independent of N, but both the above are dependent on N and hence are sub-optimal
* L= O(N/T)
* Try to achieve simultaneous detection at all processes
* fail to distinguish Failure Detection and Dissemination componentns
Key:
- separate the two components
- use a non heartbeat-based Failure detection component
5. SWIM (Scalable Weakly-consistent Infection-style Membership) Failure Detector Protocol : Probabilistic Failure Detector
a. Instead of heart-beating, we use pinging in each protocol period (T time units)b. Stage 1: Process pi picks a random process pj and sends ping. Process pj responds back with an ACK
c. If either the original ping (pi -> pj) or the ACK (pj -> pi) is lost, pi tries to reach pj again, but by sending ping to "k" other random processes which ping process pj indirectly and send the ACK back to pi indirectly.
Time-bounded completeness
- key : select each membership element once as a ping target in a traversal
* round-robin pinging
* random permutation of list after each traversal
- Each failure is detected in worst case 2N-1 (local) protocol periods
- Preserves Failure Detection properties
Dissemination and Suspicion
Dissemination options1. Multicast (Hardware/IP)
- unreliable- not enabled in all routers/switches
- multiple simultaneous multicasts especially when multiple processes detect the same failure almost at the same time
2. Point-to-point (TCP/UDP)
- expensive especially when there are thousands of processes3. Zero extra messages (SWIM failure detection): Piggyback on failure detector messages
- infection-style dissemination with ping, ack messages* Epidemic/gossip style dissemination
* maintain a buffer of recently joined/evicted processes (piggyback from this buffer; prefer recent updates)
* Buffer elements are garbage collected after a while
Suspicion Mechanism
1. False detections due to perturbed processes, packet losses (from congestion)2. Indirect pinging may not solve the problem (eg. correlated message losses near pinged host)
3. Key: suspect a process before declaring it as failed in the group. If while suspect messages are being disseminated and process become alive, mark the process alive.
- Distinguish multiple suspicions of a process
* per-process incarnation number* inc# for pi can be represented only by pi (when it receives a <suspect, pi> message)
* somewhat similar to DSDV routing protocol
Rules of incarnation# and "alive, suspect, failed" state machine
- Higher inc# notification over-ride lower inc#s- Within an inc# : (suspect inc#) > (alive, inc #)
- (failed, inc#) overrides everything else
Wrap Up
1. Failures are the norm, not the exception in datacenters2. Every distributed system uses a failure detector
3. Many distributed systems use a membership service
4. Ring failure detection underlies IBM SP2 and many other similar clusters/machines
5. Gossip-style failure detection underlies Amazon EC2/S3 (rumored)
 
No comments:
Post a Comment