The Fault Tolerant Computer System of The Brazilian Scientific Application Microsatellites Alderico Rodrigues de Paula Jr. Claudio Roland Sonnenburg Instituto de Pesquisas e Desenvolvimento - IP&D Universidade do Vale do Para’ba - UNIVAP AV. Shishima Hifume 2.911 12244-000 S‹o Josˇ dos Campos, SP , Brazil FAX: (55) 12 349 1252 alderico@univap.br; sonnen@univap.br Full Paper in Postscript File
Extended Abstract A fault tolerant computer system has been conceived to become the standard framework that will be utilized by the future family of Brazilian small satellites for scientific applications. To minimize cost and fabrication time, a modular microsatellite bus, that can easily be adapted for different mission requirements, was developed. The development of the first satellite (SACI-1 - Primeiro Satˇlite Cient’fico Brasileiro) started in 1994 in cooperation with some Brazilian Universities. Four experiments were selected for the first mission: Airglow Photometer, Plasma Bubble Experiments, Solar and Anomalous Cosmic Rays Observation in Magnetosphere and Geomagnetic Experiment. The experiment boxes are assembled with the microsatellite bus in 400 x 400 x 600 mm parallelepiped structure. The bus is a sandwiched type structure that holds the power supply, telecommunication and computer subsystems. The microsatellite is spin stabilized. Its total mass is 60 KG and the on board equipment power consumption is 30W on average. One of the objectives of the SACI program is the development of a microsatellite bus that could easily be adapted for different scientific application missions. Based on this premise, a standard modular computer system that could satisfy the following requirements was developed: a) The computer system should be a fault tolerant distributed system able to survive any single hardware fault without degradation and also to survive most of double hardware faults. When a second fault occurs, the computer system should be able to execute, at least, the critical tasks. The computer reliability should be better than 0.98 for a two year mission. b) The interface with the satellite subsystems should be completely redundant. Any hardware fault in one of the interfaces should not affect the communication between the computer and the satellite subsystems. c) The computer system should have a large mass memory, capable of storing the data collected from the experiments when the satellite is not in contact with a ground station. d) The computer system should be easily adapted for different mission performance requirements without a complete redesign. e) The computer system components should be able to support a cosmic radiation total dose of up to 20 Krads for a two year mission and all components should be latchup free for radiation particles up to 50 Mev or should have a circuit to recover from latchups. Different computer architectures were analyzed for the scientific microsatellite computer system. The selected one was the modified mesh architecture. Each processing module should be able to communicate with four other processing modules or with the satellite subsystems interface through a high speed serial link. All interfaces should be connected to two different processing modules. To allow graceful degradation, the mesh architecture was modified with the addition to each processing module of a watch-dog-timer and a set of switches used to disconnect the processing module from the mesh and to connect the input links to the output links when a catastrophic fault is detected. The main tasks assigned to the SACI-1 onboard computer are: a) Reception and processing of the telecommands from the Control Center. b) Acquisition of the telemetry signals from the satellite subsystems and experiments. c) Onboard processing of the data requested by the experiments. An analysis was done at the beginning of the project to select off-the-shelf microprocessors that satisfied the radiation requirements and that were suitable to be used in the proposed architecture. As a result of this analysis, the Transputer T805, manufactured by INMOS, was selected. The performance analysis demonstrated that, in order for the computer system to execute all specified onboard tasks and to store all acquired data, two processing modules, each having a program memory of 128Kbytes and a mass memory of 8 Mbytes, were necessary. The dependability analysis demonstrated that the reliability of each processing module for a two years mission was 0.98. However, the reliability of the computer system, using two processing modules, was 0.96, which is less than what has specified for the computer system. To satisfy the reliability requirements, three processing modules were necessary, resulting in a reliability of 0.998 for a two year mission. The designed onboard computer is composed of three processing modules based on the INMOS Transputer T805 and three internally redundant interfaces (SRI, UAC and TCTM) that are interconnected by a 10 Mbps serial line. The SRI is an interface with the experiments that exchanges messages in serial form. The UAC is used to generate the commands to the satellite subsystems and also to acquire the telemetry signals. Finally, the TCTM receives the telecommand frames from the Control Center and sends the telemetry frames to the ground stations. The SACI-1 computer system was designed to degrade gracefully to two or to one processing modules. Initially, all the three processing modules are powered on, and the execution of the tasks distributed among them. When a catastrophic fault occurs in one processing module, the computer system degrades to two processing modules. In this case, the tasks are relocated and all tasks can continue normal execution. Additionally, the system is designed to degrade to one processing module when any two processing modules fail. In this case, not all tasks will be executed, only the most important ones are selected. The onboard software is composed of a set of processes that exchange messages among themselves through a routing process. The active processes in each processing module are defined by a task table that can be updated by the routing process or by telecommands sent from the ground station. The processes are organized in three classes: Application, System and Interface. The Application Processes execute the onboard data processing required by the experiments. The system processes execute general tasks such as telemetry and telecommand processing, fault analyzes and diagnosis routines and housekeeping tasks. Finally, the I/O processes control the data exchange among the processing modules and the satellite subsystem interfaces. The critical part of the onboard software (composed of the routing processes and fault treatment routines) was described using the CPS-Z formal language. The CPS-Z language is the union of the Z language used to describe sequential systems with the language CSP used for concurrent system description. After the formal description, it was formally proved that the critical parts of the software were deadlock free. The SACI-1 fault-tolerance mechanisms were organized in two hierarchical layers: the upper layer that corresponds to the Control Center and the lower layer that corresponds to the onboard computer. Some of the fault tolerance mechanisms operate automatically without human intervention using one or more layers. Others may need the intervention of the Control Center operator or even the specialist group. The Control Center receives the service telemetry sent by the onboard computer, verifies the consistency and range of the telemetry parameters and the fault reports. If some discrepancy is detected and, when the Control Center is programmed to solve the detected problem, it sends the commands to reconfigure the onboard computer. Otherwise, it sets an alarm to inform the operator that a problem was detected onboard. When the operator is unable to solve the problem, he should consult the specialist team. Based on the specialistās analysis, the operator can program the Control Center computer to send commands to reconfigure the computer system or to execute an additional diagnosis program onboard to obtain more information about the fault. The lower layer, that corresponds to the onboard computer, is subdivided in hardware, routing and application program layers. Some mechanisms use only one layer, others use two or more layers. Most of the fault tolerant mechanisms are organized in three phases: error detection, fault analysis and recovery. The error detection mechanisms are divided in two classes: concurrent with the operation and not concurrent with the operation, such as a diagnosis program. When an error is detected and it is not corrected by the hardware, the fault analysis routine is activated. The fault analysis routine is programmed to deal with some classes of faults. If it can not solve the problem in this layer, it sends a report to a higher layer fault tolerant mechanisms. Based on the fault analysis results, the system can return to the normal operation or activate the failure flag line. The failure flag of a processing module is connected to the interrupt input port of the other two modules. Therefore, when the failure flag of one processing module is activated, it causes a system general reset. After a reset, all surviving processing modules execute the initialization program that contains a diagnosis routine that evaluates the health of the processing module. If a catastrophic fault is detected, the processing module maintains the failure flag activated. The modules in good health read the failure flags from the other modules and, then, load an execution task table based on the failure flag line of the other modules. The task table can also be updated by Control Center commands. Each processing module is composed of a CPU based on the Transputer T805, a watch-dog-timer and link switches (WDT-SW), an interrupt and I/O circuit (INTER-I/O), a main memory (MAIN MEM) and a mass memory (MASS MEM). The main fault detection mechanism in the CPU is the watch-dog-timer (WDT). If the WDT is not reset within a predefined time interval, it generates an interrupt. If after the interrupt, the WDT is not reset again, it generates an initialization pulse to restart the processing module. If, after this procedure, the processing module is not able to return to normal operation, the WDT activates the failure flag signal which causes a general system reset. In the onboard computer, the main memory is composed of memory chips that are sensitive to space radiation. To overcome this problem, a circuit to detect double errors and correct single errors (EDC) is utilized. The single errors are corrected automatically by the EDC. When a double error is detected, an interrupt request is sent to the CPU and the fault handle routine is activated. When the double error occurs in a critical program code or critical variable region, the processing module is reset. Otherwise, the fault handle routine may decide to reload the routine in which the double error was detected. When the SACI-1 project started, high capacity memory chips qualified for space applications were not available in the market. So, the design team decided that industrial memory chips that were radiation resistant up to 20 Krads and that presented a low probability to produce latchups due to a high energy particles were to be used. After analyzing different memory chips, a 512 Kwords of 8 bits chip was selected. The memory chips were organized in two independent banks of four Mbytes. For each bank, a circuit to limit the latchup current and to disconnect the power supply of bank for 10 msec, when a latchup was detected, was utilized. Since the mass memory is only used to store experiment and telemetry frames for few hours and the stored data will not be used onboard, a decision has been made not to use additional hardware to detect or to correct errors onboard, but to code the data frames using a error detection and correction code. With the utilization of high density memory chips such as the 512 Kbytes chips, a single cosmic particle can cause double or triple errors in a single word of the memory. To circumvent this problem, the frames are coded using a Hamming code in each bit column of the frame. For each 1 Kbyte data frame, 10 parity bytes were added. In the system layer, the main fault detection mechanisms are the diagnosis program that is executed periodically and the verification of the address of messages exchanged between the processes that is accomplished by the routing process. The error detection mechanisms in the application layer depend on the specific application and will not be discussed in this paper. Generally, these mechanisms perform the verification of the consistency of the computed data. At present, the SACI-1 is in the qualification test phase. Current plans are for it to be piggyback launched with Chinese-Brazilian Earth Satellite (CBERS) using the Chinese Long-March 4 launcher in the first semester of 1998.