Elektron: the Journal of the South African Institute of the Electrical Engineers, Jan. 2000.


Operating Systems for Safety-Critical Applications

by Dr. Yinong Chen, Programme for Highly Dependable Systems,
University of the Witwatersrand

 Full Paper in Postscript File

Introduction
Areas of computer applications are far wider than what people normally imagine. 
Besides the visible applications like word processing and the Internet access 
using your desktop computer, computer applications cover a vast spectrum of areas, 
from nuclear reactor and aircraft control systems, to vehicle antilock brake systems 
(ABS) and electronic toys like play stations. Some of these applications are 
safety-critical, that is, catastrophic consequences may occur if a computer in the 
control system becomes faulty. Design errors and operational faults are in general 
not avoidable. All we can do are to make the probability of system failures as 
low as possible, or make the system as dependable as possible.

Dependability has been defined as that property of a computer system such that reliance 
can justifiably be placed on the service it delivers [1]. Dependability covers a wide 
range of attributes like reliability, availability, safety and security. Safety is an 
attribute of non-occurrence of catastrophic consequences on human life or the environment. 
A safety-critical system is one by which the safety of the system is assured. 
A safety-critical system must have a predictable failure probability.

This article discusses issues related to the computer software and operating systems 
in safety-critical systems.

Software in Safety-Critical Systems

Due to potential catastrophic consequences, any component in a safety-critical system 
must have been proved to be correct or to have the dependability that complies with 
the safety standard. For this reason, commercial off-the-shelf (COTS) operating systems 
are normally not acceptable for safety-critical systems. Even a simple operating system 
is too complex to be verified or to be proved to meet the dependability requirement. 
Traditionally, application programs have to run on a "bare" machine, allowing the 
application software designers to have a total control and visibility of the entire 
software system.

As the complexity of the system and the software increases, it is extremely inconvenient 
and difficult to write correct code for a bare machine, where the programmers have to worry 
about task scheduling, memory sharing and input/output management. This leads to the 
introduction of a small "runtime kernel" to provide necessary operating system functions. 
This kernel must not include any functions which are not necessary for the particular 
application so that it is small enough for a full correctness verification.

Can We Rely on COTS Operating Systems?

Growing demand on functionality drives the software complexity in safety-critical 
systems to an extend where a full operating system becomes necessary. Fig.1 shows 
the size of software measured in words of executable codes used by airbus civil 
planes [2]. For a software system consists of 10 million of words, it is very 
difficult, even impossible, to develop in the traditional way on a bare machine 
without a proper operating system environment.


Fig.1 Growing complexity of software in airbus

To address this problem, a nature thought is to use existing operating systems. 
The question is, can we rely on COTS (commercial off-the-shelf) operating systems 
for building dependable systems?

The advantages of using COTS operating systems are obvious. It reduces the cost 
and time of software development. The problem is that the development of COTS 
operating systems doesn't necessarily consider the conditions and requirements 
necessary for a safety-critical system. According to the research results from 
the Institute for Complex Systems at the Carnegie Mellon University, conventional 
COTS operating systems are not adequate for building safety-critical systems. 
Fifteen operating systems from ten vendors were tested [3]. The failure rates 
(number of failures detected over the total number tests conducted) of these 
systems are shown in Fig.2. The failure rate ranges from 10% to 23%. Note, the 
inputs chosen for tests do not belong to the normal inputs that an operating system 
is designed to handle. They are unexpected inputs which may only occur when an 
operator error or system error occurs. According to the results, AIX4.1 has 
the lowest failure rate. The free Linux operating system exhibits good behaviour in testing.

Fig.2 Failure rates of ten commercial operating systems

Is There a Solution?

We must never give up the hope of finding a solution. The computer scientists 
in the Dependable Computing Group at LAAS-CNRS in France, who pioneered the 
research in this area, came up with a solution using the microkernel technology. 
The latest generation of operating systems are developed as middelware on the 
top of the microkernel. The idea is to use the microkernel technology to 
re-develop a highly dependable operating system with fault-tolerant mechanisms 
integrated into the system, instead of building these mechanisms on the top of 
an operating system [4]. The advantages are that a level in the hierarchy is 
saved. The complexity can be reduced by implementing only those functions that 
are needed by the specific application.

Another solution is proposed according to the study of the researchers at the 
Carnegie Mellon University. They find that different operating systems fail to 
different inputs. Their idea of reducing the failure rate is to run the same 
application on multiple operating systems. The outputs from these replicate 
applications are compared against each other. The majority is then used as 
the final output. The experimental results exhibit significant reduction of 
the failure rate, as shown in fig.3. Initially, the failure rates range from 
10% to 23%. As the number of operating systems used increases, the average 
failure rate decreases.


Fig.3 Multi-version comparisons reduce the failure rate

A higher dependability is not only important for safety-critical systems. 
A more dependable system will increase the productivity and user satisfaction 
in conventional systems. One of the research projects in the Programme for 
Highly Dependable Systems at Wits University has been using the dependable 
computing concept to improve the availability of Internet service [5]. The 
idea is to build a distributed system using COTS operating systems on which 
fault-tolerant mechanisms are implemented, as shown in Fig.4. As explained 
in previous sections, such a system is not appropriate for safety-critical 
systems due unpredictable failure rate at the COTS operating system level. 
Our target application however is the commercial Internet servers which 
are not safety-critical but the higher availability is extremely important.

Fig.4 Distributed operating system with fault-tolerant extension


References
[1]	Laprie, Dependability of Computer Systems: from Concepts to Limits, IFIP International Workshop on Dependable Computing and its Applications, Johannesburg, January 1998, pp. 108 - 126 (also see www.cs.wits.ac.za/ research/workshop/programme.html).
[2]	Potocki de Montalk, J.P., Computer software in civil aircraft, Microprocessors and Microsystems, 17 (1) 1993, pp. 17 - 23.
[3]	Koopman P.J., De Vale J., Comparing the robustness of POSIX operating systems, IEEE 29th Annual International Symposium on Fault-Tolerant Computing, Madison, June 1999, pp.30 - 37.
[4]	Salles F., Arlat J., Fabre J-C., Can we rely on COTS microkernels for building fault-tolerant systems? The 33rd Meeting of IFIP 10.4 WG, Cape Town, January 1998, pp.13 -20.
[5]	Chen Y., Hazelhurst S., Galpin V., Mateer R., Mueller C, Modelling software development of a decentralised virtual service redirector for Internet applications, The 7th IEEE Workshop on Future Trends of Distributed Computing Systems, Cape Town, December 1999, pp. 235 - 241.