Software Fault tolerance

Short Description

Software fault tolerance, concepts and implementation. Failure classification; information and time redundancy; forward and backward error recovery; error confinement; idealized fault-tolerant component; sequential and concurrent systems; exception handling; transactions and atomic actions; voting; design diversity. Case studies.


Jörg Kienzle

McConnell Engineering, Room 327

Phone: 514-398-2040


Office hours: Monday 11:30 - 12:30

Teaching Assistant

Wisam Al Abed

McConnell Engineering, Room 322

Phone: 514-398-7071 ext. 00116


Office hours: upon request


COMP 409 - Concurrent Programming or consent of the instructor

Textbooks that could be Helpful

  1. Laura L. Pullum: Software Fault Tolerance: Techniques and Implementation, Artech House, Norwood, MA, 2001. ISBN: 1-58053-137-7

  2. This book presents recovery blocks and n-version programming and other advanced fault tolerance models based on these two initial models in detail.

  3. Jörg Kienzle: Open Multithreaded Transactions: A Transaction Model for Concurrent Object-Oriented Programming. Kluwer Academic Publishers, 2003. ISBN 1-4020-1727-8

  4. This book gives a nice overview of classic and advanced transaction models, and explains open multithreaded transactions in detail. It also describes the design of OPTIMA, an object-oriented framework providing support for transactions to concurrent object-oriented programming languages. The most important programming language features used for implementing transactions are also covered.

  5. Lee, P. A.; Anderson, T.: Fault Tolerance - Principles and Practice, 2nd edition, Springer Verlag, 1990.

  6. This book covers all parts of the course, but is a little outdated. In particular, the recent development in the field of advanced transaction and atomic action models are not addressed, and it does not go into implementation details.

  7. Ramamritham, K.; Chrysanthis, P. K.: Advances in Concurrency Control and Transaction Processing, ACM Press, Los Alamitos, California, 1997.

  8. This book covers in detail different transaction models and concurrency control techniques employed in transaction processing.

  9. Jean-Claude Geffroy and Gilles Motet: Design of Dependable Computing Systems, Kluwer Academic Publishers, 2002. ISBN 1-4020-0437-0

  10. This book does a very good job in presenting the fundamental concepts of fault tolerance. It also goes into detail on fault avoidance and fault removal.


  1. 4 homework assignments (55%)

  2. 1 warmup assignment (1 x 5%)

  3. 2 programming assignments (2 x 20%)

  4. 1 dependability-focused requirements engineering assignment (1 x 10%)

  5. Project (45%)

  6. Implement a software fault tolerance scheme (distributed or concurrent) as a library / framework for a programming language of your choice, or

  7. Study a specific software fault tolerance scheme / middleware or application using software fault tolerance (e.g. airbus, space-shuttle, TGV, air-traffic control, nuclear power plant, etc.) and present it in class

Note on Academic Integrity

McGill University values academic integrity. Therefore, all students must understand the meaning and consequences of cheating, plagiarism and other academic offences under the Code of Student Conduct and Disciplinary Procedures (see for more information).

Last modified: December 3, 2013, Jörg Kienzle

Home ⎯ Overview ⎯ Handouts & ScheduleCOMP-667_Overview.htmlCOMP-667_Handouts.htmlshapeimage_3_link_0shapeimage_3_link_1shapeimage_3_link_2