A reliable distributed file system for UNIX based on NFS Mario Magalhaes Leboute (leboute@pro.via-rs.com.br) Taisy Silva Weber (taisy@inf.ufrgs.br) Instituto de Informatica - UFRGS Caixa Postal 15064, 91501-970, Porto Alegre, RS FAX 55.051.3191576 Full Paper in Postscript File
Extended abstract Redundancy is the key word on providing fault tolerance to computer systems. As long as sophisticated fault tolerance techniques can guarantee the continuity of very critical tasks such as the computer control of an aircraft even on the loss of processors, they are not usually suitable to usual computer systems. In usual computer systems, such as a corporative network, the most critical computer resource is data storage. The losses due to data disruption and data unavailability largely overwhelm all other causes of lost on computer systems. Data availability also positively affects general systems availability on the network. Considering the growing need for data storage reliability, several specific techniques ([BHI90], [BRE86], [ELA85], [HIS90], [KUM91], [LAD90], [LEV90], [LIS91], [PUR87]) were suggested to increase data dependability. Although, no one suggestion have effectively become a standard, nor any commercial system have even remotely shown all the features possible to replicated distributed file systems according the fault tolerance theory. The RNFS aims to build a replicated distributed file system to UNIX, based on one of the most popular distributed file systems: the NFS [SAN85], [RFC1094]. The choice of NFS bases on its high acceptance, well-designed model of administration, and on its state-free design, itself convenient to fault tolerance. The NFS has still be used as start point to build fault tolerant file systems [LIS91], but these systems did not reach the mainstream. The RNFS fault-tolerant file system reaches dependability through automatic file replication over an arbitrary number of servers in a network. When a server fails, the system automatically brings available a copy of data from another server. The RNFS replication model is based on a primary copy algorithm. Primary copy algorithms use a centralized entity to coordinate updates to the pool of data copies and assure data consistency. On RNFS, one server on each group of replicated data is dynamically chosen as group coordinator and called the primary server. The data stored on this server is called primary copy, and the other servers on the group are called secondary servers. All the clients communicate only with the primary server, what is responsible to broadcast updates to the secondary servers, in order to keep data on them up to date. If the primary server fails, an election algorithm brings a new primary from the remaining secondary servers, and all clients switch access to this server. If secondary servers fail, they are removed from the data update cycle on the primary and the system continues to run. RNFS chooses the UNIX's "import/export" file volume as unit of replication. This keeps administration compatible with existing NFS and avoids the fragmentation of individual file replication. It is also largely better then the complete server or disk mirroring implemented by most of commercial systems, cause it makes replication highly configurable. RNFS can support many replicated volumes at the same time, spread over a pool of servers on any arbitrary arrange, and the replication degree is configurable on a per-volume basis. RNFS also aims complete NFS compatibility, with clients and servers of both systems accessing each other transparently, with the loss of some fault-tolerance features on NFS-RNFS mountings. The fault model supported by RNFS follows the expected behavior of a network of UNIX workstation and servers, linked with gateways and routers, and powered by an unreliable electrical source. In this scenery, an important fail event is the partitioning of the network, and RNFS handles it by a "minimal connected group" approach. Simultaneous faults can also occur, if energy fails on the whole organization. Another important aspect taken into account is that file servers can display "pseudo-faults", i.e., invalid states on a particular copy of data that do not result from hardware or software failures, but from situations like a full disk unit or an arbitrary fail in the synchronization of replicated volumes. RNFS detects and signalizes this faults differently from hardware ones, and they are not intended to automatic correction. Another important feature of RNFS is the automatic recovery of failed servers. When a failed server brings up, it is detected by an "I'm alive" protocol, and the current primary server tries to recover its content by a volume transfer algorithm. During a volume transfer process, the recovering server is kept on a special state and the full content of out-of-date replicated volumes are copied to them. If the operation is successfully, the recovered server becomes a secondary and reintegrates the group. Another more efficient recover algorithm, based on version numbers, is described forward. The performance on clients operations, specially on updates, is always an important bottleneck on replicated file systems. RNFS intends to reach performance by no requiring atomic operations at any level. The synchronous diffusion of updates on the primary servers assures that client operations follows an "at most one" semantic on all servers under any combination of fails, and the NFS principle of idempotent operations is used to assure a "redo recovery" of partially completed operations during the server recover process. RNFS currently uses a synchronous sequential loop to the broadcasting of updates from the primary for the secondary servers, but is under current research the use of rpc broadcasting to speed up write operations. One specially demanding question on building replicated file systems based on NFS is the internal representation of file handlers. File handlers are converted three times during a normal NFS access, first from an user handler to a kernel VFS Vnode, them from the Vnode to a NFS file handler, and again on the server to a VFS Vnode. The translation from the kernel Vnode to a NFS file handler is particularly difficult to adapt to replication schemes, due to the fact that NFS implementation uses "reverse paths" (i.e. physical directory Inode list numbers), together with "file handler signatures", to form a NFS file handler. This means that the clients keep on its nuclei a sort of image of the physical file system structure on the server on their Vnode pool, and this image cannot be easily translated to another server during a server switching due to failure, or (more important) during a broadcast of updates from primary to secondary servers. We believe that this problem in particular has avoided the full development of NFS-based replicated file systems up to day, and the research team was very focused on it. RNFS currently solves the handler translation problem maintaining a "multiple signature table" on RNFS servers, besides a cross mounting schema between all servers inside a replication group, to allow a fast translation of handlers during update diffusion. Is currently under implementation an explicit "Inode revalidation" system call on clients to recalculate Vnodes related to replicated files if its primary server failed and needs to be switched to a new one. The current research also covers an differential recover algorithm to bring failed servers up-to-date from data on the primary. This algorithm uses slight modified Ext2 (extend file system type 2) physical Inodes on servers file systems to hold a 32-bits file version number. This version number is updated together with file data at any physical write, and reflects the number of the last write operation on the file, or this number less one (non-atomic operations again). The counters are not allowed to wrap up, and stop at the maximum possible integer value, if it is reached. This value also means an "invalid counter state". During server recovery, version file numbers on primary and secondary servers are compared by a recursive search algorithm, and only files or directories that differ on the version number, or that have invalid version numbers, are transferred from primary to secondary servers, then the version numbers return to zero on all connected copies. The contents of files or directories created during fails are always completely copied to the recovering server through the basic recursive copy method. We believe this method will provide very fast recovery of usual failures, and will provide an enormous advantage on the recovery of short failures over large file volumes. We also work to prove the correctness of this schema on all possible combination of fails according the fail model, and are studying the extension of this solution to UNIX file systems other then Ext2. Bibliography [BHI90] BHIDE, A.; ELNOZAHY, E.; MORGAN, S. Implicit Replication in a Network File Server. In: WORKSHOP ON MANAGEMENT OF REPLICATED DATA, 1990, Houston. Washington: IEEE Computer Society Press, c1990. [BRE86] BRERETON. O. P. Management of Replicated Files in a UNIX Environment. Software - Practice and Experience, Sussex, England, v.16, p.771-780, Aug. 1986. [ELA85] EL-ABBADI, A; SKEEN, A. D; CRISTIAN, F. An Efficient Fault-Tolerant Protocol for Replicated Data Management. In: SYMP. ON PRINCIPLES OF DATABASE SYSTEMS, 4., [S.l.], 1985. Proceedings... NY ACM Press, 1985. p.215-229. [HIS90] HISGEN, A. et al. Granularity and Semantic Level of Replication in the Echo Distributed File System. In: WORKSHOP ON MANAGEMENT OF REPLICATED DATA, 1990, Houston. Washington: IEEE Computer Society Press, c1990. [KUM91] KUMAR, A. Hierarchical Quorum Consensus: To New Algorithm for Managing Replicated Data. IEEE Trans. on Computers, New York, v. 40, n. 9, p.996-1004, Sept. 1994. [LAD90] LADIN, R.; LISKOV, B.; SHRIRA, L. Lazy Replication: Exploiting the Semantics of Distributed Services. Operating Systems Review, New York, v.25, n.1, p.49-55, Jan. 1991. [LEV90] LEVY, E.; SILBERSCHATZ, A. Distributed File Systems: Concepts and Examples. Computing Surveys, New York, v. 22, n. 4, p.321-374, Dec. 1990. [LIS91] LISKOV, B. et al. Replication in the Harp File System. Operating Systems Review, New York, 1991, v. 25, n 5, p. 226-238. [PUR87] PURDIN, T.; SCHLICHTING, R.; ANDREWS, G. File Replication Facility for Berkeley UNIX. Software - Practice and Experience, Sussex, England, v.17, p.923-940, Dec. 1987. [RFC1094] IETF Request for Comments 1094. The NFS Protocol Especification. Available in the server www.internic.net. [SAN85] SANDBERG, R. et al. Design and Implementation of the Sun Network File System. In: USENIX ASSOCIATION CONFERENCE, 1985, Berkeley. Proceedings... Berkeley: USENIX, 1985.