Document Type

Thesis - University Access Only

Award Date


Degree Name

Master of Science (MS)

Department / School

Computer Science


Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by making up for its software or hardware errors. As parallel and/or distributed systems become large and important, they need fault tolerance features more than ever. Unfortunately, since most systems do not even provide mechanisms for fault-tolerant programs, programmers have to deal with faults by themselves. One of the most important problems in achieving fault tolerance for parallel and/or distributed systems is overhead cost due to redundancies. Overhead cost should be minimized to get the best result where redundancy is essential to fault tolerance. This paper discusses the factors affecting fault tolerance overhead for parallel and/or distributed systems and the problem of optimizing those factors to get the best output. First, we develop a fault-tolerant structure for a distributed system. Then, a mathematical model of fault tolerance overhead is constructed for this structure. Next, factors are found to conciliate fault tolerance overhead and reliability, a problem that has always been controversial. For the third step, factors are optimized by calculation and mathematical proofs. Then, the factors are validated experimentally by applying the optimized factors to a real program. Finally, the fault-tolerant structure for a distributed system model is generalized.

Library of Congress Subject Headings

Fault-tolerant computing
Electronic data processing -- Distributed processing -- Mathematical models
Overhead costs



Number of Pages



South Dakota State University