A reliable distributed file system for UNIX based on NFS

Mario Magalhaes Leboute (leboute@pro.via-rs.com.br)
Taisy Silva Weber (taisy@inf.ufrgs.br)
Instituto de Informatica - UFRGS
Caixa Postal 15064, 91501-970, Porto Alegre, RS FAX 55.051.3191576

Full Paper in Postscript File

Extended abstract
Redundancy is the key word on providing fault tolerance to computer systems.
As long as sophisticated fault tolerance techniques can guarantee the
continuity of very critical tasks such as the computer control of an
aircraft even on the loss of processors, they are not usually suitable to
usual computer systems. In usual computer systems, such as a corporative network, the most critical computer resource is data storage. The losses
due to data disruption and data unavailability largely overwhelm all other
causes of lost on computer systems. Data availability also positively
affects general systems availability on the network. Considering the
growing need for data storage reliability, several specific techniques
([BHI90], [BRE86], [ELA85], [HIS90], [KUM91], [LAD90], [LEV90], [LIS91],
[PUR87]) were suggested to increase data dependability. Although, no one
suggestion have effectively become a standard, nor any commercial system
have even remotely shown all the features possible to replicated distributed
file systems according the fault tolerance theory. The RNFS aims to build a
replicated distributed file system to UNIX, based on one of the most popular
distributed file systems: the NFS [SAN85], [RFC1094]. The choice of NFS
bases on its high acceptance, well-designed model of administration, and on
its state-free design, itself convenient to fault tolerance. The NFS has still
be used as start point to build fault tolerant file systems [LIS91], but these
systems did not reach the mainstream. The RNFS fault-tolerant file system
reaches dependability through automatic file replication over an arbitrary
number of servers in a network. When a server fails, the system automatically
brings available a copy of data from another server. The RNFS replication model
is based on a primary copy algorithm. Primary copy algorithms use a centralized
entity to coordinate updates to the pool of data copies and assure data
consistency. On RNFS, one server on each group of replicated data is dynamically
chosen as group coordinator and called the primary server. The data stored on
this server is called primary copy, and the other servers on the group are
called secondary servers. All the clients communicate only with the primary
server, what is responsible to broadcast updates to the secondary servers,
in order to keep data on them up to date. If the primary server fails, an
election algorithm brings a new primary from the remaining secondary servers,
and all clients switch access to this server. If secondary servers fail, they
are removed from the data update cycle on the primary and the system continues
to run. RNFS chooses the UNIX's "import/export" file volume as unit of
replication. This keeps administration compatible with existing NFS and avoids
the fragmentation of individual file replication. It is also largely better
then the complete server or disk mirroring implemented by most of commercial
systems, cause it makes replication highly configurable. RNFS can support
many replicated volumes at the same time, spread over a pool of servers on
any arbitrary arrange, and the replication degree is configurable on a
per-volume basis. RNFS also aims complete NFS compatibility, with clients
and servers of both systems accessing each other transparently, with the
loss of some fault-tolerance features on NFS-RNFS mountings. The fault model
supported by RNFS follows the expected behavior of a network of UNIX
workstation and servers, linked with gateways and routers, and powered
by an unreliable electrical source. In this scenery, an important fail event
is the partitioning of the network, and RNFS handles it by a "minimal
connected group" approach. Simultaneous faults can also occur, if energy
fails on the whole organization. Another important aspect taken into account
is that file servers can display "pseudo-faults", i.e., invalid states on a
particular copy of data that do not result from hardware or software failures,
but from situations like a full disk unit or an arbitrary fail in the
synchronization of replicated volumes. RNFS detects and signalizes this
faults differently from hardware ones, and they are not intended to automatic
correction. Another important feature of RNFS is the automatic recovery of
failed servers. When a failed server brings up, it is detected by an "I'm
alive" protocol, and the current primary server tries to recover its content
by a volume transfer algorithm. During a volume transfer process, the
recovering server is kept on a special state and the full content of
out-of-date replicated volumes are copied to them. If the operation is
successfully, the recovered server becomes a secondary and reintegrates the
group. Another more efficient recover algorithm, based on version numbers, is
described forward. The performance on clients operations, specially on updates,
is always an important bottleneck on replicated file systems. RNFS intends to
reach performance by no requiring atomic operations at any level. The
synchronous diffusion of updates on the primary servers assures that client
operations follows an "at most one" semantic on all servers under any
combination of fails, and the NFS principle of idempotent operations is used
to assure a "redo recovery" of partially completed operations during the server
recover process. RNFS currently uses a synchronous sequential loop to the
broadcasting of updates from the primary for the secondary servers, but is
under current research the use of rpc broadcasting to speed up write operations.
One specially demanding question on building replicated file systems based on
NFS is the internal representation of file handlers. File handlers are converted
three times during a normal NFS access, first from an user handler to a kernel
VFS Vnode, them from the Vnode to a NFS file handler, and again on the server to
a VFS Vnode. The translation from the kernel Vnode to a NFS file handler is
particularly difficult to adapt to replication schemes, due to the fact that NFS
implementation uses "reverse paths" (i.e. physical directory Inode list
numbers), together with "file handler signatures", to form a NFS file handler.
This means that the clients keep on its nuclei a sort of image of the physical
file system structure on the server on their Vnode pool, and this image cannot
be easily translated to another server during a server switching due to failure,
or (more important) during a broadcast of updates from primary to secondary
servers. We believe that this problem in particular has avoided the full
development of NFS-based replicated file systems up to day, and the research
team was very focused on it. RNFS currently solves the handler translation
problem maintaining a "multiple signature table" on RNFS servers, besides a
cross mounting schema between all servers inside a replication group, to allow
a fast translation of handlers during update diffusion. Is currently under
implementation an explicit "Inode revalidation" system call on clients to
recalculate Vnodes related to replicated files if its primary server failed and
needs to be switched to a new one. The current research also covers an
differential recover algorithm to bring failed servers up-to-date from data on
the primary. This algorithm uses slight modified Ext2 (extend file system type
2) physical Inodes on servers file systems to hold a 32-bits file version
number. This version number is updated together with file data at any physical
write, and reflects the number of the last write operation on the file, or this
number less one (non-atomic operations again). The counters are not allowed to
wrap up, and stop at the maximum possible integer value, if it is reached. This
value also means an "invalid counter state". During server recovery, version
file numbers on primary and secondary servers are compared by a recursive search
algorithm, and only files or directories that differ on the version number, or
that have invalid version numbers, are transferred from primary to secondary
servers, then the version numbers return to zero on all connected copies. The
contents of files or directories created during fails are always completely
copied to the recovering server through the basic recursive copy method. We
believe this method will provide very fast recovery of usual failures, and will
provide an enormous advantage on the recovery of short failures over large file
volumes. We also work to prove the correctness of this schema on all possible
combination of fails according the fail model, and are studying the extension
of this solution to UNIX file systems other then Ext2.

Bibliography

[BHI90] BHIDE, A.; ELNOZAHY, E.; MORGAN, S. Implicit Replication in a
Network File Server. In: WORKSHOP ON MANAGEMENT OF REPLICATED DATA, 1990,
Houston. Washington: IEEE Computer Society Press, c1990.
[BRE86] BRERETON. O. P. Management of Replicated Files in a UNIX Environment.
Software - Practice and Experience, Sussex, England, v.16, p.771-780, Aug. 1986.
[ELA85] EL-ABBADI, A; SKEEN, A. D; CRISTIAN, F. An Efficient Fault-Tolerant
Protocol for Replicated Data Management. In: SYMP. ON PRINCIPLES OF DATABASE
SYSTEMS, 4., [S.l.], 1985. Proceedings... NY ACM Press, 1985. p.215-229.
[HIS90] HISGEN, A. et al. Granularity and Semantic Level of Replication in the
Echo Distributed File System. In: WORKSHOP ON MANAGEMENT OF REPLICATED DATA,
1990, Houston. Washington: IEEE Computer Society Press, c1990.
[KUM91] KUMAR, A. Hierarchical Quorum Consensus: To New Algorithm for Managing
Replicated Data. IEEE Trans. on Computers, New York, v. 40, n. 9, p.996-1004,
Sept. 1994.
[LAD90] LADIN, R.; LISKOV, B.; SHRIRA, L. Lazy Replication: Exploiting the
Semantics of Distributed Services. Operating Systems Review, New York, v.25,
n.1, p.49-55, Jan. 1991.
[LEV90] LEVY, E.; SILBERSCHATZ, A. Distributed File Systems: Concepts and
Examples. Computing Surveys, New York, v. 22, n. 4, p.321-374, Dec. 1990.
[LIS91] LISKOV, B. et al. Replication in the Harp File System. Operating
Systems Review, New York, 1991, v. 25, n 5, p. 226-238.
[PUR87] PURDIN, T.; SCHLICHTING, R.; ANDREWS, G. File Replication Facility
for Berkeley UNIX. Software - Practice and Experience, Sussex, England,
v.17, p.923-940, Dec. 1987.
[RFC1094] IETF Request for Comments 1094. The NFS Protocol
Especification. Available in the server www.internic.net.
[SAN85] SANDBERG, R. et al. Design and Implementation of the Sun Network
File System. In: USENIX ASSOCIATION CONFERENCE, 1985, Berkeley. Proceedings... Berkeley: USENIX, 1985.