Embedded super computing is becoming indispensable for complex, computing-intensive scientific and industrial applications, and parallel systems are supplanting traditional uniprocessor platforms. Dependability and fault tolerance thus become critical to the performance of parallel systems.
Failures are no longer just undesirable situations; depending on the application, they can be hazardous or even catastrophic. Migrating to parallel systems offers application developers new prospects but also exposes them to new dangers. Multiprocessor cooperation a parallel system’s most powerful feature can also be its fatal weakness. More processors means more faults, and failure of a single processor can crash the whole system.
A major factor is communication system. Interprocessor communication, which coordinates processors and enhances their power, is key to a successful parallel system. Distributed-memory multiprocessor systems rely on message communication between nodes. Message-passing applications are based on either synchronous (blocking) or asynchronous (non blocking) communication for the coherence of parallel tasks. In the synchronous mode, problems arise when communication links or communicating threads are in an erroneous state (broken links, threads in infinite loops, and so on).
When such errors occur, communicating threads remain blocked, since communication cannot be initiated or completed. Likewise, problems also arise in asynchronous communication when communicating threads are in erroneous state, or when mailbox mechanisms supporting asynchronous communication malfunction. Clearly, fault-tolerant communication mechanisms are key factors in parallel system dependability and can unlock a system’s full potential.
The approach to more dependable systems involves taking fault tolerance (FT) measures at two levels:
- The operating system level
- The application level.
Fortunately, there is a middle way. Developer proposed solutions often stem from common requirements. These requirements can be categorized and addressed in a framework that lies between the application and the operating system. An application developer can then select the desired FT level and tailor FT mechanism to the application, thereby effort and shortening the time to market. Several recent research initiatives have investigated fault tolerance for embedded applications on distributed systems.
These vary from proposals for generic architectures for dependable distributed computing and predictably dependable distributed computing systems to approaches for treatment of hardware and software faults in real-time applications and applied software FT solutions in massively parallel systems. In addition, extended research has focused on developing distributed real-time operating systems with fault-tolerant behavior. Meanwhile, complex models and frameworks have emerged to evaluate FT system dependability. The Esprit project EFTOS (Embedded Fault-Tolerant Super computing) develops a framework to integrate fault tolerance flexibly and easily into distributed, embedded, high-performance computing (HPC) applications.
This framework consists of reusable FT modules acting at different levels. The cost and performance overhead of generic operating system and hardware-level FT mechanisms are avoided, and application developers aren't burdened with providing adhoc FT programming. Integration of this functionality into actual embedded applications has validated the approach and provided promising results.
FAULT TOLERANCE REQUIREMENTS
Systems within the scope of the EFTOS project exhibit errors in the following categories:
- Untested or unforeseen input values triggering software errors
- Electromagnetic interference causing hardware faults
- Propagation of errors through communication channels from one part of the system to other parts
- Errors propagating from one process to another causing memory corruption, since processes run concurrently from same memory space
- Loss of subsequent inputs because of a faulty input item
- Failures to deadlines and time constraints
The EFTOS project developed FT modules that deal with the different types of errors in these categories. Application developers prioritized the modules according to their applications needs, the modules’ impact on fault tolerance, and the feasibility of integrating them into target application. We were then able to determine where fault tolerance was required (processing and networking modules) and the steps needed to achieve it (detection, isolation, and recovery mechanisms).