Rationale
The scope, complexity, and pervasiveness of computer-based and controlled systems continue to increase dramatically. The consequences of such systems failing can be important, with serious injury occurring or lives lost, human-made and natural systems destroyed, security breached, businesses failed, or opportunities lost. In modern systems, software more and more assumes the responsibility of providing functionality and control, and therefore becomes more significant to the overall system performance and dependability.
Ideally, the processes by which such software is conceptualized, created, analyzed, and tested would have advanced to the point where software could be developed without errors. Although significant progress has been achieved in recent years, e.g. by applying well-defined software development methods combined with quality assurance techniques and rigorous testing, unfortunately not all errors are prevented. Even if the best people, practices, and tools are used, it would be very risky to assume the software developed is error-free. It is therefore important to teach students the current techniques that can be used to write fault-tolerant software.
Course Contents
The goal of this course is to study the techniques that can be applied by software developers to produce fault-tolerant software, e.g. software that continues to deliver service in spite of the operational effects of software design faults and faults of the surrounding environment. The course aims not only at presenting the concepts, but concentrates on implementation issues as well.
The first part of the course presents the need for reliability, and puts fault tolerance in relation with other reliability issues, e.g. verification and validation, fault prevention, quality assurance, etc. Then, the main concepts of fault tolerance are introduced: failure classification, types of redundancy, types of recovery. The features available in modern programming languages to support fault tolerance are reviewed: exceptions, serialization, threading.
The second part of the course concentrates on different forms of error recovery for sequential and concurrent systems at run-time. The notion of error confinement and the idealized fault-tolerant component are presented. Various advanced transaction and atomic action models are studied, including ways of implementing the schemes in programming languages. Advantages and disadvantages of the different models are highlighted. Design diversity, e.g. N-version programming, and data diversity are presented.
Finally, the third part of the course presents a dependability-focussed requirements engineering process called DREP, which helps a software developer to design a system and interactions between the system and the environment by taking into account that faults might occur in the environment that threaten the safety and reliability of the system under development.
Throughout the course, several case studies will illustrate the encountered concepts.
Last modified: November 23, 2015, Jörg Kienzle