Distributed Fault-Tolerance

David Powell LAAS-CNRS, Toulouse, France david.powell@laas.fr Full Paper in Postscript File
A distributed system can be defined as a set of computing nodes, interconnected by a communication network, that cooperate to carry out some common work. The nodes can typically be considered as independent from the viewpoint of failures, so distribution can be a useful framework for providing fault tolerance. However, distribution of system state and other dependencies between cooperating tasks also mean that a distributed service can easily be disrupted if any of the nodes involved should fail. With the added potential complication of unreliable communication, distributed services often need to be made fault-tolerant if they are to be useful. These two opposing facets of the relationship between fault-tolerance and distribution are strong motivations for distributed fault tolerance techniques. An important characteristic of these techniques is that error processing and fault treatment are implemented mainly by software using distributed, message-passing algorithms. The tutorial will address distributed system models and assumptions, consistency issues, and fault-tolerance based on recovery and replication. Speaker's biography Dr. David Powell holds a position of "Directeur de Recherche" at LAAS-CNRS, Toulouse, where he leads the Dependable Computing and Fault Tolerance Research Group. He has written over ninety papers for international and national journals and conferences, is co-author of two books on dependable computing and holds a patent for a fault and damage-tolerant network for data transmission. His current research work concerns distributed algorithms for software-implemented fault-tolerance, stochastic Petri net modelling for dependability evaluation and the use of fault injection for estimating fault-tolerance coverage. He has managed several national and European research contracts. In particular, he was the Scientific Director of the Delta-4 Esprit project and is currently chief architect of the GUARDS Esprit project. He has acted as a consultant for several aerospace, telecommunication and data processing companies in France and for the European Commission. Dr. Powell is a member of ACM and IEEE.