Fault Tolerant Computing
Fault Tolerant Computing is activity of the PARALLEL
TEMPUS project.
Available kind of materials:
- Document in HTML
format
- Document in (Gzipped)
Postscript
- Transparencies in
(Gzipped) Postscript
Overview
Documents
- Basic
terminology by István
Majzik (9 pages Gzipped
Postscript 40k)
Motivation by dr. András Pataricza (12 slides
Gzipped Postscript 34k)
Software Based Fault
Tolerance by István Majzik
(75 slides Gzipped Postscript
90k)
- The dependability concept
- On the impairments of dependability
- On the attributes of dependability
- On the means of dependability
- Redundancy techniques
-
Modelling of fault tolerant systems by György Csertán (40 pages Gzipped Postscript 126k)
Fault Modeling by dr. András Pataricza (15 slides
Gzipped Postscript 41k)
Automatic Test
Generation by dr. András
Pataricza (21 slides Gzipped
Postscript 61k)
- Introduction
- The modeling approach
- Dataflow networks
- Informal presentation of the model
- Formalism of dataflow networks
- Fault modeling
- Uncertainty modeling
- Features of the approach
- An example
- Model refinement
- Approaches to model refinement
- Dataflow refinement
- Domain refinement
- Structure refinement
- Refinement of the example
- Appendix
-
Concurrent error detection by István Majzik (28 pages Gzipped Postscript) 95k
Watchdog
Processors by dr. András
Pataricza (24 slides Gzipped
Postscript 62k)
Dependable
Multiprocessors by dr. András
Pataricza (9 slides Gzipped
Postscript 24k)
- Basics of error detection
- Classification of errors in microprocessor systems
- Concurrent error detection techniques
- Watchdog processors
- Control flow checking using derived signatures
- Control flow checking using assigned signatures
- Watchdog processors in parallel systems
- Conclusion
-
Master-checker mode by dr. András Pataricza (9 pages Gzipped Postscript) 31k
Master-Checker
Setup by dr. András
Pataricza (14 slides Gzipped
Postscript 35k)
CPU Testing by dr. András Pataricza (12 slides
Gzipped Postscript 82k)
- The master-checker principle
- Fault coverage
- Error latency
- Comparison of the different self-test techniques
- Memory
protection by dr. András
Pataricza (15 pages Gzipped Postscript 40k)
Memory Testing by
dr. András Pataricza (17
slides Gzipped Postscript
54k)
- Introduction
- Information vs. time redundancy
- Error detecting codes in compact testing
- Overview of RAID by András Petri (14 pages Gzipped Postscript 113k)
Overview of RAID by András Petri (8 pages Gzipped Postscript 85k)
- Overview
- Disk array basics
- Data striping and redundancy
- Basic RAID organizations
- Performance and cost comparisons
- Comparisons
- Reliability
- Implementation considerations
- Statistical techniques for
analyzing fault-tolerant systems by András Petri (8 pages Gzipped Postscript 70k)
- Introduction
- Statistical techniques
- Parameter estimation
- Distribution characterization
-
Software fault tolerance by István Majzik (13 pages Gzipped Postscript 62k)
Software Based Fault
Tolerance by István Majzik
(75 slides Gzipped Postscript
90k)
- Exception handling
- The recovery block scheme
- The N-version programming scheme
- The N-self-checking programming scheme
- Self-configuring optimistic programming
- Language support for software fault tolerance
- Hardware architecture and software fault tolerance
- Comparison of the schemes
- System-Level Fault
Diagnosis by Tamás
Bartha (80 pages Gzipped Postscript
188k)
Integrated
Diagnostics by dr. András
Pataricza (29 slides Gzipped
Postscript 98k)
- Introduction
- System model
- Fault models
- Testing models
- Diagnostic algorithms
- Classification of diagnostic algorithms
- Deterministic algorithms
- Probabilistic algorithms
- AI based diagnosis
Additional transparencies
- Components of
Dependable Distributed Systems by Tamás Bartha
Transparency (27 Slides Gzipped Postscript 93k)
Abstract (HTML)
- Concept and Practice
of Fault-Tolerant Distributed Systems by Tamás Bartha
Transparencies (34 Slides Gzipped Postscript 159k)
Abstract (HTML)
Literature
- S. Mishra and R. D. Schlichting
Abstractions for Constructing Dependable Distributed Systems
TR 92-19, University of Arisona, 1992.
- F. Christian
Understanding Fault-Tolerant Distributed Systems
Comm. of the ACM, vol. 34, No. 2, pp. 57-78, Feb. 1991.
[DoA]
[TEMPUS]
[MODIFY]
[PARALLEL]
[DOWNLOAD]
[UPLOAD]
[SEARCH]
[FEEDBACK]