Abstract

Fault tolerance is in the center of distributed system design that covers various methodologies. This research paper aims to investigate different types and techniques of fault tolerance which are being used in many real time distributed systems. The fault can be detected and recovered by many techniques. Moreover, an appropriate fault detector can avoid loss due to system crash or any kind of failure in system. This paper provides a framework for detecting fault in real time system which is supposed to be handled and processed further by the help of coordinator.

Keywords: Fault, tolerance, Fault, detection, Real, time, distributed, system, Conceptual, framework, Coordinator, Spftaware, and, hardware faults

Received: 18 July 2018 / Revised: 27 August 2018 / Accepted: 2 October 2018 / Published:6 November 2018

Contribution/ Originality

This study contributes in the existing literature of fault tolerance in time distributed systems.

1. INTRODUCTION

When multiple instances of an application are running on several machines and one of the server goes down, there is a need to implement an autonomic fault tolerance technique that can handle these types of faults. Distributed Computing Systems consists of variety of hardware and software components. Failure of any of these components can lead to unpredicted behavior of system which results in failure to guarantee availability and reliability of critical services [1]. A failure occurs when a hardware component is broken and needs replacement or a node/processor is halted or forced to reboot; or software has failed to complete its run. Fault tolerance is the property of a system, where system tends to work even in case of fault present in system. When a fault present in real time distributed system is not detected and recovered properly on time then it results into failure of system. A task running on real time distributed system should be feasible, reliable and scalable, real time distributed systems like nuclear systems, robotics, air traffic control systems, grid etc. are highly dependable on deadline. Fault present in system can be detected by applying reliable fault detector followed by some recovery technique. These systems must function with high availability even under software and hardware faults. Hardware fault-tolerance is achieved through applying extra hardware like processors, communication links, resource (memory, I/O device) whereas in software fault tolerance tasks, messages are added into the system to deal with faults. The main aim of fault tolerant distributed computing is to provide proper solutions to these system faults upon their occurrence and make the system more dependable by increasing its reliability.

Real-time computer controller is typically provided with sensors which will provide readings at periodic intervals and the computer must respond by sending signals to actuators. There may be unexpected or irregular events and these must also receive a response. In all cases, there will be a time-bound within which the response should be delivered. The ability of the computer to meet these demands depends on its capacity to perform the necessary computations in the given time. The paper is organized as follow: Basic Concepts and Types of Fault is described in section 2 as background along with behavior of failure systems, related works is described in section 3, The proposed model or framework of fault tolerance in section 4, methodologies of framework is described in section 5 and finally we conclude with conclusion and future works in section 6.

2. BACKGROUND

2.1. Basic Concepts and Types of Fault

Basically system to be fault tolerant is much more similar with concept of dependable system, and any system to be dependable it must be available, reliable and secure. Following are few terminologies that are very closely related to dependability of system and its behavior:

Fault can be termed as defect at the lowest level of abstraction and there can be different failure in a system they can be of following types [2].

Processor Faults (Node Faults): Processor faults occur when the processor behaves in an unexpected manner.

It can be classified into three types:

Fail-Stop – Here a processor can both be active and participate in distribute protocols or is totally failed and will never respond. In this case the neighboring processors can detect the failed processor.
Slowdown – Here a processor might run in degraded fashion or might totally fail.
Byzantine – Here a processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation.

Network fault: A Fault occur in a network due to network partition, Packet Loss, Packet corruption, destination failure, link failure, etc.

Physical faults: This Fault can occur in hardware like fault in CPUs, Fault in memory, Fault in storage, etc.

Media faults: Fault occurs due to media head crashes.

Faults can be categorized on the basis of computing resources and time

Fault on basis of resources are:

Timing failure

Response failure
Crash failure

Fault with respect to time are:

Permanent: appear and persist until repaired
Intermittent: appear-disappear-reappear behavior
Transient : appear once and disappear

Failed System Behavior

A system can behave after failure in three ways such as

Fail Stop System

No output when system fails
Immediately stops to sending any message or event
Does not respond any message receiving on network
Any failure in system results a permanent faul

Byzantine system

not stop after failure but gives wrong output

Fail-fast system

system behaves like a Byzantine system for some time but moves into a fail-stop mode after a short period of time
system does not perform any operation once it has failed

2.2. Motivation

The occurrence of fault in a system cannot be predicted and even small changes or failure in system can lead to tremendous effect. So, in order to make processing of transaction more reliable for achieving better outcome of result even in presence of fault, need of fault tolerance is essential which can avoid faulty system.

2.3. Objectives

To provide safety to the real-time system that is a system must be designed in a way that if it is not working correctly, it will fail in very safe manner.

3. RELATED WORK

There are some important methods for tolerating fault in various systems given by many authors in their research. According to Alain, et al. [3] the Algorithm Architecture Adequation (AAA) method will generate a static code automatically for real time distributed embedded system. This method basically used for processor failure with fail stop behavior.

Luo et al mention that TERCOS and DEBUS are the best approaches used to exploiting redundancies in Fault-Tolerant and Real-Time Distributed Systems [4].

Girault et al. mention that processor and communication link failure can be tolerated by using offline scheduling technique and generate a fault tolerate distributed schedule. Job scheduling is one of the method in grid computing for scheduling a task. The fault can be occur in loosely coupled job scheduling with job replication scheme such that jobs are efficiently and reliably executed can be tolerated [5].

In programming asynchronous multiprocessing systems, the customary approach has been to make process synchronization independent of the execution rates of any components which means that synchronous algorithm is required in which one process must wait for another to do something before it processed ahead. These time- independent algorithms cannot be fault-tolerant because a process could fail by doing nothing, and such a failure manifests itself only as a reduction of the process's execution rate [3].

Leslie Lamport [6] has made an additional assumption that the clock are synchronized to keep approximately the same absolute time in order to show how they can be used in solving the synchronization problems that occurs in distributed systems and use of clock allows elimination of acknowledgement message. If distributed system is really single system, then the processes must be synchronized in same way. Conceptually, the most easiest way to synchronize any process is just to get them to do the same work at same time. In this paper also they have implemented a kernal that performs a necessary synchronization i.e. making sure that two different process do not try to modify file at same time.

4. FRAMEWORK

The proposed system consists of two major components:

Fault Detector and b) Coordinator

Figure-1. General Architecture of Fault Detection System

4.1. Fault Detector

Fault detector is used to detect fault in a system and it runs on each node in the user space. Fault Detector monitors all the systems where system activities are classified into two of classes :

Normal type
and Anomalies type

Here system detector checks whether the activities are of normal type or is of anomalies type. If the activity happens to be of anomalies type then the Fault Detector sets an Alarm and next step is handled by coordinator.

4.2. Coordinator

After detection of fault in a system, it is coordinator which is responsible to carry out the further tasks. Once the fault is detected in a system, coordinator gets the alarm from fault detector and after getting those alarms, the coordinator must be able to take corrective action within certain period of time. Finally, all records are supposed to be recorded in the fault log table.

5. METHOD DEVELOPMENT

Detecting and handling the Fault in the system is carried out in the following 2 phases:

Monitoring Phase and
Corrective Action Phase

Fig-2. Work Flow of Proposed System

Monitoring Phase

Monitoring task is performed by fault detector where task of fault detector is to take input from systems. Once input is set in a system then fault detector detects whether there is any types of fault present or not. If fault is found then the system send those occurrence of fault to the coordinator where corrective task is supposed to be performed by coordinator, but if there is not any types of fault present in system then further task is supposed to be carried out by system.

Corrective Action Phase

This process is activated as soon as there is presence of fault in a system. After detecting any sorts of fault in a system, the coordinator gets alarm from the system once fault is detected. After that coordinator is responsible for handling those fault. For handling the fault, the coordinator searches for the new node with sufficient resources. It then requests that node to perform the task and halts the process running is the fault node.

6. CONCLUSION

Fault tolerance has been an important issue from the beginning of the phase. Several researches have been carried out for solving this issue. Still it is a prominent issue in this area. This paper deals with existing approaches to solve the problem of fault tolerance. To address this issue, a framework has been designed with its working mechanism. Furthermore, this model requires to be validated which can be performed by using the simulation process.

Funding: This study received no specific financial support.

Competing Interests: The authors declare that they have no competing interests.

Contributors/Acknowledgement: Both authors contributed equally to the conception and design of the study.

REFERENCES

[1] M. Shah, "Fault tolerant distributed computing, Texas, US," 2002.

[2] C. Randy and J. Theodore, Distributed operating systems and algorithms, 1st ed.: Addison Wesley Longman Inc, 1997.

[3] G. Alain, L. Christophe, S. Mihaela, and S. Yves, "Generation of fault-tolerant static scheduling for real-time distributed embedded systems with multi-point links," presented at the IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, San Francisco, USA, April 2001.

[4] L. Wei, Y. FuMin, T. Gang, P. LiPing, and Q. Xiao, "TERCOS: A novel technique for exploiting redundancies in faulttolerant and real-time distributed systems," presented at the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications(RTCSA 2007), 2007.

[5] P. Townend and J. Xu, "Fault tolerance within a grid environment," Time-Out, vol. 1, p. S3, 2003.

[6] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of distributed consensus with one faulty process (No. RR-245)." Yale Univ New Haven CT Dept of Computer Science, 1982.

Index