[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Is continuous replication of state possible?
On Sat, 2005-01-08 at 18:21 -0800, Jacob Gorm Hansen wrote: > There is a paper by Bressoud and Schneider about hypervisor-based fault > tolerance on the PA-RISC (which had precise performance counters) which > is worth reading, I found a copy online at > http://roc.cs.berkeley.edu/294fall01/readings/bressoud.pdf . Thanks, I hadn't read this paper before and found it interesting. I think there are a few good bits in it like the use of the recovery register for epochs of instruction execution but I think they have made a mistake by choosing to design a primary-backup rather than an N-way active system and I think the I/O architecture is fairly fundamentally flawed as a result of a confusion as to whether the peripheral devices are inside or outside the boundary of the replicated system. I think a better approach for a fail-stop fault tolerant system would be to choose an N-way active approach and put peripherals very clearly outside the boundary of the replicated system: With this approach... 1) A distributed consensus protocol is used to reach agreement on a sequence of inputs to the system. 2) The same sequence of inputs is passed to all replicas. This is the input boundary of the replicated fault-tolerant context. 3) The replicas start with the same state and receive the same sequence of inputs so make the same sequence of responses. Every response of all replicas is of the form "execute response X on node Y". So the output of the replicas is the same for all replicas. This is the output boundary of the replicated fault-tolerant context. 4) The output of the replicas contains the information specifying which node must actually action the output. One node actions the output, the remaining nodes do nothing. 5) With peripherals outside the replication boundary, all communication from a peripheral to the virtual machine is passed through the distributed consensus protocol and committed to the sequence of inputs before being processed by all replicas. All communication from the fault-tolerant virtual machine to peripherals is made through a specific node chosen by the fault-tolerant virtual machine. If a peripheral is accessible from multiple nodes then the virtual machine will see multiple redundant paths to the peripheral and may use multi-pathing software to perform path failover when a node fails. If a peripheral is only accessible to one node then the fault-tolerant virtual machine may use RAID over several such peripherals attached to different nodes to create a fault-tolerant virtual peripheral. This approach is also applicable to byzantine fault tolerant systems if you enhance it with a byzantine fault tolerant distributed consensus protocol and get each replica to digitally sign each output and forward the signature to the replica responsible for actioning it so the actioner can prove that it is operating on behalf of a quorum of replicas. In this case a byzantine failure of a node still looks like a path failure to the virtual machine because communications from the node are dropped by the recipient when the digital signatures are found to be insufficient. Here are a few random other comments on the paper: With the recovery register approach you can obviously break out early if you encounter an idle instruction since this will be deterministic across all replicas. If you emulate the CPU in software then this approach is very easy. The time of day clock is a good example of solving the problem of non-deterministic operations by asking the hypervisor and having it pass the result through the consensus protocol to all replicas. In terms of the approach outlined above this translates into a replica output to a specific node to ask that node the time. The node sends the time through the distributed consensus protocol to be received by all replicas. This works for random number generation as well (as someone noted in another post on this topic) but in that case you'd want to ask a node for a big batch of random numbers in advance. The node would generate a batch and pass it through the distributed consensus protocol to the replicas so that each replica had the same random number pool and didn't incur the consensus protocol overhead on every request. Using a consensus protocol like PAXOS which supports multiple concurrent ballots would allow the random number pool to be replenished at a low-water mark without ever stalling the virtual machine operation. "Fundamental to our design is communication among the hypervisors. This implies that the hypervisors must not be partitioned by communications failures." PAXOS does much better and, I think shows the limit of what can be achieved for fail-stop fault tolerant systems. "I/O Accessibility Assumption: I/O operations possible by the processor executing the primary virtual machine are also possible by the processor executing a backup virtual machine." See the discussion above. "We assume that the channel linking the primary and backup processors is FIFO...that the processor executing the backup detects the primary's processor failure only after receiving the last message sent by the primary's hypervisor" The PAXOS protocol does significantly better than this. For anyone interested in replication, "The Part Time Parliament" by Leslie Lamport which describes PAXOS is a great read. After that, I'd recommend reading "Practical Byzantine Fault Tolerance" by Miguel Castro and "Provably Secure Competitive Routing against Proactive Byzantine Adversaries via Reinforcement Learning" by Baruch Awerbuch, David Holmer and Herbert Rubens. -- Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |