Concept and Practice of Fault-Tolerant Distributed Systems

Abstract

Many fault-tolerant distributed systems were developed over the years, mainly for use in mission critical applications, such as scientific computations, business transaction processing, and military purposes. This lecture revisits some of these existing systems. The systems are introdused using an abstract model. The model uses the 'depend' relation to specify the interactions among the server and client components. The commonly used assumptions about failure semantics of systems and subsystems are described, as well as the techniques to enforce and validate a given failure semantics. Examples for these issues are given based on the introduced practical systems.