A Design Methodology for Software Fault Injection in  Embedded Systems 

N. Krishnamurthy, V. Jhaveri and Jacob A. Abraham 

The Computer Engineering Research Center 
Department of Electrical & Computer Engineering 
ENS 424 
The University of Texas at Austin 
Austin, TX 78712-1014 
UT-Campus Mail: CERC C8800
USA

Ph: (512) 471-8010
    (512) 471-8011
Fax: (512) 471-8967
Email : krishnam@cerc.utexas.edu

Full Paper in Postscript File


KEYWORDS
--------
Fault Injection, Design Methodology, Embedded systems, Dependability, VxWorks, 


ABSTRACT
--------

Most fault injection tools are tightly coupled to their implementation
platforms, are not easy to port and have primarily dealt with time-sharing
systems.  This paper addresses the problem of designing software fault
injection systems for dependability evaluation of embedded systems and more
specifically, the implementation of such a fault injector FIESTA (Fault
Injection for Embedded System Target Applications).  A generic design
methodology, to enable quick prototyping of software-implemented fault
injectors, is presented.  Unlike most other design methodologies for fault
injection tools, our approach achieves a very fast and efficient
implementation of such tools for commercial embedded systems.  Our technique
is compatible with most existing embedded systems and distributed
architectures in the commercial market.  The proposed methodology has been
implemented through the realization of a fault injection tool (FIESTA) for a
commercial real-time MC68040 VME-based system running the VxWorks 5.3 (TORNADO
1.0) operating system.  Results of fault injection experiments using FIESTA
are also presented.

We have attempted to use fault injection for two purposes.  The first
objective is to evaluate the dependability properties of an embedded system by
emulating possible hardware faults in software.  The other function is for
evaluating and testing the robustness of the software applications running on
such systems.  Traditionally, most fault injection tools have attempted to
cover the first objective; recently, there has been a growing interest in
using fault injection for "software robustness" evaluation and "software fault"
tolerance.  This has always been true for mission-critical systems, but of
late, embedded systems are proliferating into all aspects of our lives.  Our
design methodology enables us to use fault injection for both the above
objectives.

Design Methodology
------------------

After the analysis of the design and architecture of most of the popular
existing commercial implementations of UNIX and non-UNIX distributed and
embedded systems in the market, we found that the design of a fault injection
tool, for such systems, would be heavily influenced by its specific
architecture and operating system implementation.  The fault injection
application would have to rely on architecture-dependent debugger support
functions and operating system calls (such as "ptrace" on Linux/SunOS) or
(taskRegsSet or dbgBrkGetSoft() on VxWorks 5.3).  This would tie down the
fault injection tool to the specific product and would be non-portable.  Our
objective was to develop a fault injection tool that could be used on any
commercial embedded system without having to port it everytime between different
systems.  For example, a fault injection tool for a MC68040 board running
VxWorks 5.3 should also work for a MC68040 board running LynxOS without having
to rewrite the entire tool.

Based on our analysis of these different commercial architectures, we
concluded that the best way to implement a fault injection tool for such
systems would be to build upon the existing infrastructure offered by them and
to identify the common design aspects of these systems.  With this in
perspective, we identified the following key principles common to all of these
embedded systems.

A) Most of these systems were used in commercial applications and so had a
        well established and robust development environment.
B) All of these systems had a debugger/monitoring tool that was fine-tuned to
        their particular product.
C) The debugger and development environment architecture design had been
        developed and was being continually upgraded so that it posed minimum
        intrusiveness on the target system.
D) The debugger had a command line interface through which
        applications/threads could be debugged.
E) Some of the debuggers in certain modes were capable of setting breakpoints
        in interrupt handlers, device drivers and kernel level data
        structures.

Based on the above observations, we decided that our underlying engine for the
fault injection tool would be the real-time debugger that is part of the
development environment sold with these systems.  This design architecture
reflects our philosophy that it is preferable to separate a single fault
injection tool that "does it all" into two complementary and cooperating
systems, and let each do what it does best.  The debugger would handle the
critical distributed and real-time chores (they also abstract away the
specific architecture dependent support functions which otherwise would have
to be written and implemented on the target), while the fault injection
management would be done separately on the host. The fault injection tool's
interface to the target would be through the real-time debugger.

The most important difference between our technique and others is that our
fault injection is built on top of the debugger.  As the companies improve the
performance and real-time debugging features of their debugger and with
changes in the hardware architecture, there is a direct improvement in the
performance and capabilities of our fault injection tool without having to
recode the tool.  This is possible because the debugger subsumes the changes
into its functionality but maintains the same interfaces.  However, in other
approaches, the entire fault injection agent would have to be rewritten to
take advantage of these improvements.  Therefore, fewer software design bugs
and shorter turnaround times for our fault injection tools would be
characteristic of our design methodology.

An assumption we make about the debugger is that it is capable of supporting
the following operations.

1) Setting and deleting a breakpoint
2) Examining and modifying registers and memory
3) Examining and modifying kernel data structures and library routines
4) Single stepping and continuing
5) Running the application 
6) Catching exceptions and reporting it to the user
7) Reporting exit status back to the user
8) Setting and deleting breakpoints in interrupt handlers and device drivers.

Once the commands for the above actions and their responses are identified,
the development of the fault injection tool is straightforward.

The fault injection tool FIESTA
-------------------------------

FIESTA was developed for the commercial real-time operating system VxWorks
(which is one of the more popular commercial embedded system operating
systems) running on a MC68040 target.  FIESTA does fault injection on the
VxWorks target application through their debugger "gdbm68k" and is capable of
injecting faults into a target application for two different configurations of
the embedded system.

a) Target Server host-based approach.
b) Target Based approach.

This is possible because of the design methodology that we had adopted and
therefore the fault injection system is independent of the different
configurations in which the target system may exist.  Most embedded systems in
the commercial market operate in either of the configurations above.

FIESTA is capable of random and specific fault injection functions, for both
memory faults and CPU register faults, on the target application. It is
capable of emulating both data and address line faults on the target. The
fault types supported are stuck-at-1, stuck-at-0 and random faults.  It also
supports random mutation of application executables.

As mentioned in our first paragraph, the random mutation of code is used to
"stress" the software application.  Moreover, study of the behavior of the
embedded application under transient random faults rather than specific fault
models was regarded as more vital to the dependability evaluation of the
target system.  Fault injection is carried out by defining an experiment of
fault injection and setting all the relevant parameters in a GUI (developed in
Tcl/Tk).

Recovery Feature
----------------

A recovery mechanism has been incorporated into FIESTA to enable it to recover
from a target crash.  This requirement came about because of the occasional
target hangs and crashes due to fault injection on the target application.
Because of the host/target architecture, nothing happens to the fault
injection process running on the host.  The target hang is detected by setting
up of a timeout mechanism.

Dynamic Execution Count Injection
---------------------------------

A feature of FIESTA is the possibility of injecting a fault on a particular
dynamic execution count of an instruction rather than the very first time the
instruction is executed.  This differs from many earlier injection techniques.
The dynamic execution count injection is also important in cases where
recursive functions and loops are present, for example, control loops.  The
user could target a particular execution count of a machine instruction for an
injection.  This feature can also be used to inject communication faults
between embedded systems by targeting the injection on the instructions that
comprise the communication layer interface.

Kernel Injection
----------------

FIESTA is also capable of injecting faults into kernel level data structures
and functions.  This capability can also be used to ``stress'' the kernel and
analyze its robustness under faults.  This capability again reiterates the
strength of our design methodology and can be implemented on any embedded
system whose debugger is capable of accessing internal kernel functions and
data structures.

Experimental Results
--------------------

A number of fault injection experiments are being carried out on an
application executing on the MC68040, VxWorks 5.3 (TORNADO) platform and will
be presented in the paper.  Distributions of the various exceptions and their
effect on the target will also be presented.

CONCLUSIONS
-----------

In this paper, we have developed a generic design methodology for quick
prototyping of fault injection tools for embedded systems and have implemented
a fault injection tool FIESTA based on our design philosophy.  To our
knowledge, this is the first time that such an approach has been applied to
the design of a fault injection tool for commercially available embedded
systems.  By partitioning the functionality of the fault injection process
into two separate domains and utilizing the robustness and time-tested
techniques of the commercial vendor's debuggers, we have shown that it is
quite easy, efficient and practical to develop a fault injection tool for any
embedded system that is supported by a good debugger.  It would be fair to say
that we have found a technique of designing fault injection tools for
dependability evaluation of embedded systems.

We are currently working on integrating FIESTA with an ADA real-time debugger
for fault injection into distributed embedded ADA applications.