The Fault Tolerant Computer System of The Brazilian 
Scientific Application Microsatellites

Alderico Rodrigues de Paula Jr. 
Claudio Roland Sonnenburg 
Instituto de Pesquisas e Desenvolvimento - IP&D
Universidade do Vale do Para’ba - UNIVAP
AV. Shishima Hifume 2.911
12244-000 S‹o JosŽ dos Campos, SP , Brazil
FAX: (55) 12 349 1252
alderico@univap.br;  sonnen@univap.br

Full Paper in Postscript File


Extended Abstract

	A fault tolerant computer system has been conceived to become the
standard framework that will be utilized by the future family of
Brazilian small satellites for scientific applications. To minimize cost
and fabrication time, a modular microsatellite bus, that can easily be
adapted for different mission requirements, was developed. The
development of the first satellite (SACI-1 - Primeiro SatŽlite
Cient’fico Brasileiro) started in 1994 in cooperation with some
Brazilian Universities. Four experiments were selected for the first
mission: Airglow Photometer, Plasma Bubble Experiments, Solar and
Anomalous Cosmic Rays Observation in Magnetosphere and Geomagnetic
Experiment.

	The experiment boxes are assembled with the microsatellite bus in 400 x
400 x 600 mm parallelepiped structure. The bus is a sandwiched type
structure that holds the power supply, telecommunication and computer
subsystems. The microsatellite is spin stabilized. Its total mass is 60
KG and the on board equipment power consumption is 30W on average.

One of the objectives of the SACI program is the development of a
microsatellite bus that could easily be adapted for different scientific
application missions. Based on this premise, a standard modular computer
system that could satisfy the following requirements was developed:

a) The computer system should be a fault tolerant distributed system
able to survive any single hardware fault without degradation and also
to survive most of double hardware faults. When a second fault occurs,
the computer system should be able to execute, at least, the critical
tasks. The computer reliability should be better than 0.98 for a two
year mission.
b) The interface with the satellite subsystems should be completely
redundant. Any hardware fault in one of the interfaces should not affect
the communication between the computer and the satellite subsystems.
c) The computer system should have a large mass memory, capable of
storing the data collected from the experiments when the satellite is
not in contact with a ground station. 
d) The computer system should be easily adapted for different mission
performance requirements without a complete redesign.
e) The computer system components should be able to support a cosmic
radiation total dose of up to 20 Krads for a two year mission and all
components should be latchup free for radiation particles up to 50 Mev
or should have a circuit to recover from latchups.

	Different computer architectures were analyzed for the scientific
microsatellite computer system. The selected one was the modified mesh
architecture. Each processing module should be able to communicate with
four other processing modules or with the satellite subsystems interface
through a high speed serial link. All interfaces should be connected to
two different processing modules. To allow graceful degradation, the
mesh architecture was modified with the addition to each processing
module of a watch-dog-timer and a set of switches used to disconnect the
processing module from the mesh and to connect the input links to the
output links when a catastrophic fault is detected. 

	The main tasks assigned to the SACI-1 onboard computer are: 

a)  Reception and processing of the telecommands from the Control
Center.
b)  Acquisition of the telemetry signals from the satellite subsystems
and experiments. 
c)  Onboard processing of the data requested by the experiments.

An analysis was done at the beginning of the project to select
off-the-shelf microprocessors that satisfied the radiation requirements
and that were suitable to be used in the proposed architecture. As a
result of this analysis, the Transputer T805, manufactured by INMOS, was
selected.  The performance analysis demonstrated that, in order for the
computer system to execute all specified onboard tasks and to store all
acquired data, two processing modules, each having a program memory of
128Kbytes and a mass memory of 8 Mbytes, were necessary. The
dependability analysis demonstrated that the reliability of each
processing module for a two years mission was 0.98. However, the
reliability of the computer system, using two processing modules, was
0.96, which is less than what has specified for the computer system. To
satisfy the reliability requirements, three processing modules were
necessary, resulting in a reliability of 0.998 for a two year mission. 

The designed onboard computer is composed of three processing modules
based on the INMOS Transputer T805 and three internally redundant
interfaces (SRI, UAC and TCTM) that are interconnected by a 10 Mbps
serial line. The SRI is an interface with the experiments that exchanges
messages in serial form. The UAC is used to generate the commands to the
satellite subsystems and also to acquire the telemetry signals. Finally,
the TCTM receives the telecommand frames from the Control Center and
sends the telemetry frames to the ground stations. The SACI-1 computer
system was designed to degrade gracefully to two or to one processing
modules. Initially, all the three processing modules are powered on, and
the execution of the tasks distributed among them. When a catastrophic
fault occurs in one processing module, the computer system degrades to
two processing modules. In this case, the tasks are relocated and all
tasks can continue normal execution. Additionally, the system is
designed to degrade to one processing module when any two processing
modules fail. In this case, not all tasks will be executed, only the
most important ones are selected.

The onboard software is composed of a set of processes that exchange
messages among themselves through a routing process. The active
processes in each processing module are defined by a task table that can
be updated by the routing process or by telecommands sent from the
ground station.  The processes are organized in three classes:
Application, System and  Interface. The Application Processes execute
the onboard data processing required by the experiments. The system
processes execute general tasks such as telemetry and telecommand
processing, fault analyzes and diagnosis routines and housekeeping
tasks. Finally, the I/O processes control the data exchange among the
processing modules and the satellite subsystem interfaces. 

	The critical part of the onboard software (composed of the routing
processes and fault treatment routines) was described using the CPS-Z
formal language. The CPS-Z language is the union of the Z language used
to describe sequential systems with the language CSP used for concurrent
system description. After the formal description, it was formally proved
that the critical parts of the software were deadlock free.

The SACI-1 fault-tolerance mechanisms were organized in two hierarchical
layers: the upper layer that corresponds to the Control Center and the
lower layer that corresponds to the onboard computer. Some of the fault
tolerance mechanisms operate automatically without human intervention
using one or more layers. Others may need the intervention of the
Control Center operator or even the specialist group.

The Control Center receives the service telemetry sent by the onboard
computer, verifies the consistency and range of the telemetry parameters
and the fault reports. If some discrepancy is detected and, when the
Control Center is programmed to solve the detected problem, it sends the
commands to reconfigure the onboard computer. Otherwise, it sets an
alarm to inform the operator that a problem was detected onboard. When
the operator is unable to solve the problem, he should consult the
specialist team. Based on the specialistâs analysis, the operator can
program the Control Center computer to send commands to reconfigure the
computer system or to execute an additional diagnosis program onboard to
obtain more information about the fault.

The lower layer, that corresponds to the onboard computer, is subdivided
in hardware, routing and application program layers. Some mechanisms use
only one layer, others use two or more layers. Most of the fault
tolerant mechanisms are organized in three phases: error detection,
fault analysis and recovery.

The error detection mechanisms are divided in two classes: concurrent
with the operation and not concurrent with the operation, such as a
diagnosis program. When an error is detected and it is not corrected by
the hardware, the fault analysis routine is activated. The fault
analysis routine is programmed to deal with some classes of faults. If
it can not solve the problem in this layer, it sends a report to a
higher layer fault tolerant mechanisms. Based on the fault analysis
results, the system can return to the normal operation or activate the
failure flag line. The failure flag of a processing module is connected
to the interrupt input port of the other two modules. Therefore, when
the failure flag of one processing module is activated, it causes a
system general reset.

After a reset, all surviving processing modules execute the
initialization program that contains a diagnosis routine that evaluates
the health of the processing module. If a catastrophic fault is
detected, the processing module maintains the failure flag activated.
The modules in good health read the failure flags from the other modules
and, then, load an execution task table based on the failure flag line
of the other modules. The task table can also be updated by Control
Center commands.

Each processing module is composed of a CPU based on the Transputer
T805, a watch-dog-timer and link switches (WDT-SW), an interrupt and I/O
circuit (INTER-I/O), a main memory (MAIN MEM) and a mass memory (MASS
MEM).

The main fault detection mechanism in the CPU is the watch-dog-timer
(WDT). If the WDT is not reset within  a predefined time interval, it
generates an interrupt. If after the interrupt, the WDT is not reset
again, it generates an initialization pulse to restart the processing
module. If, after this procedure, the processing module is not able to
return to normal operation, the WDT activates the failure flag signal
which causes a general system reset.

In the onboard computer, the main memory is composed of memory chips
that are sensitive to space radiation. To overcome this problem, a
circuit to detect double errors and correct single errors (EDC) is
utilized. The single errors are corrected automatically by the EDC. When
a double error is detected, an interrupt request is sent to the CPU and
the fault handle routine is activated. When the double error occurs in a
critical program code or critical variable region, the processing module
is reset. Otherwise, the fault handle routine may decide to reload the
routine in which the double error was detected.

When the SACI-1 project started, high capacity memory chips qualified
for space applications were not available in the market. So, the design
team decided that industrial memory chips that were radiation resistant
up to 20 Krads and that presented a low probability to produce latchups
due to a high energy particles were to be used. After analyzing
different memory chips, a 512 Kwords of 8 bits chip was selected. The
memory chips were organized in two independent banks of four Mbytes. For
each bank, a circuit to limit the latchup current and to disconnect the
power supply of bank for 10 msec, when a latchup was detected, was
utilized.

Since the mass memory is only used to store experiment and telemetry
frames for few hours and the stored data will not be used onboard, a
decision has been made not to use additional hardware to detect or to
correct errors onboard, but to code the data frames using a error
detection and correction code. With the utilization of high density
memory chips such as the 512 Kbytes chips, a single cosmic particle can
cause double or triple errors in a single word of the memory. To
circumvent this problem, the frames are coded using a Hamming code in
each bit column of the frame. For each 1 Kbyte data frame, 10 parity
bytes were added.

In the system layer, the main fault detection mechanisms are the
diagnosis program that is executed periodically and the verification of
the address of messages exchanged between the processes that is
accomplished by the routing process. The error detection mechanisms in
the application layer depend on the specific application and will not be
discussed in this paper. Generally, these mechanisms perform the
verification of the consistency of the computed data.

	At present, the SACI-1 is in the qualification test phase. Current
plans are for it to be piggyback launched with Chinese-Brazilian Earth
Satellite (CBERS) using the Chinese Long-March 4 launcher in the first
semester of 1998.