A Design Methodology for Software Fault Injection in Embedded Systems N. Krishnamurthy, V. Jhaveri and Jacob A. Abraham The Computer Engineering Research Center Department of Electrical & Computer Engineering ENS 424 The University of Texas at Austin Austin, TX 78712-1014 UT-Campus Mail: CERC C8800 USA Ph: (512) 471-8010 (512) 471-8011 Fax: (512) 471-8967 Email : krishnam@cerc.utexas.edu Full Paper in Postscript File
KEYWORDS -------- Fault Injection, Design Methodology, Embedded systems, Dependability, VxWorks, ABSTRACT -------- Most fault injection tools are tightly coupled to their implementation platforms, are not easy to port and have primarily dealt with time-sharing systems. This paper addresses the problem of designing software fault injection systems for dependability evaluation of embedded systems and more specifically, the implementation of such a fault injector FIESTA (Fault Injection for Embedded System Target Applications). A generic design methodology, to enable quick prototyping of software-implemented fault injectors, is presented. Unlike most other design methodologies for fault injection tools, our approach achieves a very fast and efficient implementation of such tools for commercial embedded systems. Our technique is compatible with most existing embedded systems and distributed architectures in the commercial market. The proposed methodology has been implemented through the realization of a fault injection tool (FIESTA) for a commercial real-time MC68040 VME-based system running the VxWorks 5.3 (TORNADO 1.0) operating system. Results of fault injection experiments using FIESTA are also presented. We have attempted to use fault injection for two purposes. The first objective is to evaluate the dependability properties of an embedded system by emulating possible hardware faults in software. The other function is for evaluating and testing the robustness of the software applications running on such systems. Traditionally, most fault injection tools have attempted to cover the first objective; recently, there has been a growing interest in using fault injection for "software robustness" evaluation and "software fault" tolerance. This has always been true for mission-critical systems, but of late, embedded systems are proliferating into all aspects of our lives. Our design methodology enables us to use fault injection for both the above objectives. Design Methodology ------------------ After the analysis of the design and architecture of most of the popular existing commercial implementations of UNIX and non-UNIX distributed and embedded systems in the market, we found that the design of a fault injection tool, for such systems, would be heavily influenced by its specific architecture and operating system implementation. The fault injection application would have to rely on architecture-dependent debugger support functions and operating system calls (such as "ptrace" on Linux/SunOS) or (taskRegsSet or dbgBrkGetSoft() on VxWorks 5.3). This would tie down the fault injection tool to the specific product and would be non-portable. Our objective was to develop a fault injection tool that could be used on any commercial embedded system without having to port it everytime between different systems. For example, a fault injection tool for a MC68040 board running VxWorks 5.3 should also work for a MC68040 board running LynxOS without having to rewrite the entire tool. Based on our analysis of these different commercial architectures, we concluded that the best way to implement a fault injection tool for such systems would be to build upon the existing infrastructure offered by them and to identify the common design aspects of these systems. With this in perspective, we identified the following key principles common to all of these embedded systems. A) Most of these systems were used in commercial applications and so had a well established and robust development environment. B) All of these systems had a debugger/monitoring tool that was fine-tuned to their particular product. C) The debugger and development environment architecture design had been developed and was being continually upgraded so that it posed minimum intrusiveness on the target system. D) The debugger had a command line interface through which applications/threads could be debugged. E) Some of the debuggers in certain modes were capable of setting breakpoints in interrupt handlers, device drivers and kernel level data structures. Based on the above observations, we decided that our underlying engine for the fault injection tool would be the real-time debugger that is part of the development environment sold with these systems. This design architecture reflects our philosophy that it is preferable to separate a single fault injection tool that "does it all" into two complementary and cooperating systems, and let each do what it does best. The debugger would handle the critical distributed and real-time chores (they also abstract away the specific architecture dependent support functions which otherwise would have to be written and implemented on the target), while the fault injection management would be done separately on the host. The fault injection tool's interface to the target would be through the real-time debugger. The most important difference between our technique and others is that our fault injection is built on top of the debugger. As the companies improve the performance and real-time debugging features of their debugger and with changes in the hardware architecture, there is a direct improvement in the performance and capabilities of our fault injection tool without having to recode the tool. This is possible because the debugger subsumes the changes into its functionality but maintains the same interfaces. However, in other approaches, the entire fault injection agent would have to be rewritten to take advantage of these improvements. Therefore, fewer software design bugs and shorter turnaround times for our fault injection tools would be characteristic of our design methodology. An assumption we make about the debugger is that it is capable of supporting the following operations. 1) Setting and deleting a breakpoint 2) Examining and modifying registers and memory 3) Examining and modifying kernel data structures and library routines 4) Single stepping and continuing 5) Running the application 6) Catching exceptions and reporting it to the user 7) Reporting exit status back to the user 8) Setting and deleting breakpoints in interrupt handlers and device drivers. Once the commands for the above actions and their responses are identified, the development of the fault injection tool is straightforward. The fault injection tool FIESTA ------------------------------- FIESTA was developed for the commercial real-time operating system VxWorks (which is one of the more popular commercial embedded system operating systems) running on a MC68040 target. FIESTA does fault injection on the VxWorks target application through their debugger "gdbm68k" and is capable of injecting faults into a target application for two different configurations of the embedded system. a) Target Server host-based approach. b) Target Based approach. This is possible because of the design methodology that we had adopted and therefore the fault injection system is independent of the different configurations in which the target system may exist. Most embedded systems in the commercial market operate in either of the configurations above. FIESTA is capable of random and specific fault injection functions, for both memory faults and CPU register faults, on the target application. It is capable of emulating both data and address line faults on the target. The fault types supported are stuck-at-1, stuck-at-0 and random faults. It also supports random mutation of application executables. As mentioned in our first paragraph, the random mutation of code is used to "stress" the software application. Moreover, study of the behavior of the embedded application under transient random faults rather than specific fault models was regarded as more vital to the dependability evaluation of the target system. Fault injection is carried out by defining an experiment of fault injection and setting all the relevant parameters in a GUI (developed in Tcl/Tk). Recovery Feature ---------------- A recovery mechanism has been incorporated into FIESTA to enable it to recover from a target crash. This requirement came about because of the occasional target hangs and crashes due to fault injection on the target application. Because of the host/target architecture, nothing happens to the fault injection process running on the host. The target hang is detected by setting up of a timeout mechanism. Dynamic Execution Count Injection --------------------------------- A feature of FIESTA is the possibility of injecting a fault on a particular dynamic execution count of an instruction rather than the very first time the instruction is executed. This differs from many earlier injection techniques. The dynamic execution count injection is also important in cases where recursive functions and loops are present, for example, control loops. The user could target a particular execution count of a machine instruction for an injection. This feature can also be used to inject communication faults between embedded systems by targeting the injection on the instructions that comprise the communication layer interface. Kernel Injection ---------------- FIESTA is also capable of injecting faults into kernel level data structures and functions. This capability can also be used to ``stress'' the kernel and analyze its robustness under faults. This capability again reiterates the strength of our design methodology and can be implemented on any embedded system whose debugger is capable of accessing internal kernel functions and data structures. Experimental Results -------------------- A number of fault injection experiments are being carried out on an application executing on the MC68040, VxWorks 5.3 (TORNADO) platform and will be presented in the paper. Distributions of the various exceptions and their effect on the target will also be presented. CONCLUSIONS ----------- In this paper, we have developed a generic design methodology for quick prototyping of fault injection tools for embedded systems and have implemented a fault injection tool FIESTA based on our design philosophy. To our knowledge, this is the first time that such an approach has been applied to the design of a fault injection tool for commercially available embedded systems. By partitioning the functionality of the fault injection process into two separate domains and utilizing the robustness and time-tested techniques of the commercial vendor's debuggers, we have shown that it is quite easy, efficient and practical to develop a fault injection tool for any embedded system that is supported by a good debugger. It would be fair to say that we have found a technique of designing fault injection tools for dependability evaluation of embedded systems. We are currently working on integrating FIESTA with an ADA real-time debugger for fault injection into distributed embedded ADA applications.