Document Type

Thesis - University Access Only

Award Date

1998

Degree Name

Master of Science (MS)

Department / School

Computer Science

Abstract

Fault tolerance can be defined as a concept of recovery that keeps a computer system operational by making up for its software or hardware errors. As parallel and/or distributed systems become large and important, they need fault tolerance features more than ever. Unfortunately, since most systems do not even provide mechanisms for fault-tolerant programs, programmers have to deal with faults by themselves. One of the most important problems in achieving fault tolerance for parallel and/or distributed systems is overhead cost due to redundancies. Overhead cost should be minimized to get the best result where redundancy is essential to fault tolerance. This paper discusses the factors affecting fault tolerance overhead for parallel and/or distributed systems and the problem of optimizing those factors to get the best output. First, we develop a fault-tolerant structure for a distributed system. Then, a mathematical model of fault tolerance overhead is constructed for this structure. Next, factors are found to conciliate fault tolerance overhead and reliability, a problem that has always been controversial. For the third step, factors are optimized by calculation and mathematical proofs. Then, the factors are validated experimentally by applying the optimized factors to a real program. Finally, the fault-tolerant structure for a distributed system model is generalized.

Library of Congress Subject Headings

Fault-tolerant computing
Electronic data processing -- Distributed processing -- Mathematical models
Overhead costs

Format

application/pdf

Number of Pages

Publisher

South Dakota State University

Recommended Citation

Shim, Yong-Sang, "Modeling and Analyzing Fault Tolerance Overhead for Distributed Systems" (1998). Electronic Theses and Dissertations. 404.
https://openprairie.sdstate.edu/etd2/404

Download

COinS

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Modeling and Analyzing Fault Tolerance Overhead for Distributed Systems

Document Type

Award Date

Degree Name

Department / School

Abstract

Library of Congress Subject Headings

Format

Number of Pages

Publisher

Recommended Citation

Search

Browse

Author Corner

Links

Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange

Electronic Theses and Dissertations

Modeling and Analyzing Fault Tolerance Overhead for Distributed Systems

Author

Document Type

Award Date

Degree Name

Department / School

Abstract

Library of Congress Subject Headings

Format

Number of Pages

Publisher

Recommended Citation

Share

Search

Browse

Author Corner

Links