NFS Recovery and Client Migration
From Linux NFS
Exporting a shared cluster file system with multiple NFS servers can increase performance and availability through load-balancing and failover. The NFSv4 protocol has migration and failover features, but implementation is challenging, and some protocol extensions are required.
Consider a few scenarios, in order of increasing complexity:
- Server 1 and server 2 share a cluster filesystem. Server 1 runs an NFS server, but server 2 doesn't. When server 1 fails or is shut down, an NFS service is started on server 2, which also takes over server 1's IP address.
- Servers 1, 2, and 3 share a cluster filesystem. Everything is as in the previous scenario, except that server 3 is also running a live NFS server throughout. Here, we need to ensure that server 3 is not allowed to acquire locks that it shouldn't while server 2 is taking over from server 1.
- Servers 1, 2, and 3 share a cluster filesystem. Everything is as in the previous scenario, except that server 2 is already running a live NFS server. We must handle the failover with minimal interruption to the preexisting clients of server 2.
- As in the previous scenario, except that we expect to keep server 1 running throughout, and migrate only some of its clients. This allows us to do load-balancing.
The implementation may choose to block all locking activity during the transition, possibly as long as a grace period, which is on the order of a minute. This may be simpler to implement and may be adequate for some applications. However, we prefer to implement the transition period in such a way that applications see no significant delay. A variety of behaviors in between (e.g., that limit delays only to certain files) are also possible.
Finally, the implementation may allow the client to continue to use all its existing state on the new server, or may require the client to go through the same state recovery process it would go through on server reboot. The latter approach requires less intrusive modifications to the NFS server, and can be done without requiring ceasing locking activity, but there are still optimizations possible using the former method that may reduce latency to the migrating client.
We are exploring most of these possibilities as part of our Linux NFSv4 server implementation effort.
In the process of designing the migration implementation for Linux, we have identified two small deficiencies in the NFSv4 protocol that limit the migration scenarios that an NFSv4 implementation can reliably support.
Migration to a live server
Scenarios 3 and 4 above involve migrating clients to a NFSv4 server that is already servering clients of its own. This causes some problems, which we need a little background to explain.
To manage client state, we first need a reliable way to identify individual clients; ideally, it should allow us to identify clients even across client reboots. Thanks to NAT, DHCP, and userspace NFS clients, the IP address is not a reliable way to identify clients. Therefore the NFSv4 protocol uses a client-generated "client identifier" for this purpose.
However, to avoid some potential problems caused by servers that have multiple IP addresses, the NFSv4 spec requires the client to calculate the client identifier in such a way that it is always different for different server IP addresses.
This creates some confusion during migration--the client identifier which the client will present to the new server will differ from the one it presented to the old server, so we do not have a way to track the client across the migration.
The problem can be avoided in scenarios 1 and 2 by allowing the new server to take over the old server IP address.
The problem has been discussed in the NFSv4 ietf working group, but a solution has not yet been agreed on. Nevertheless, we expect the problem to be solved in NFSv4.1.
Transparent state migration
We have identified one small protocol change necessary to support transparent migration of state--that is, migration that doesn't require the client to perform lock reclaims as it would on server reboot.
The problem is that a client, when migrating, does not know whether the server to which it is migrating will wants it to continue to use the state it acquired on the previous server, or wants it to reacquire its state. The client could attempt to find out by trying to use its current state with the new server and seeing what kind of error it gets back. However, there's no guarantee this will work--accidental collisions in the stateid's handed out by the two servers may mean, for example, that the server cannot return a helpful error in this case.
The current NFSv4.1 draft partially solves this problem by defining a new error (NFS4ERR_MOVED_DATA_AND_STATE) that the server can return to simultaneously trigger a client migration and to indicate to the client that the new server is prepared to accept state handed out by the old server. (This solution is only partial because it doesn't help with failover--in that case, it's too late for the old server to return any errors to the client.) The final 4.1 specification will probably contain a more comprehensive solution, so at this point we're confident that the problem will be solved.
Linux implementation issues
Scenario 1 and NFSv4 reboot recovery
The current linux implementation currently supports the above scenario 1 with NFSv2/v3. (See, for example, Falko Timme's Setting up a high-availability NFS server, which actually shares data at the block level instead of a cluster filesystem.) It is possible to support the same scenario in NFSv4 as long as the directory where the NFSv4 server stores its reboot recovery information (/var/lib/nfs/v4recovery/ by default), is located on shared storage. However, there is a regression compared to v2/v3, because the v2/v3 also provides synchronous callouts to arbitrary scripts whenever that information changes, so that it could be shared using methods other than shared storage. The NFSv4 reboot recovery information is currently under redesign, and one of the side effects of the new design will be to allow such callouts. This work is not yet completed.
Scenario 2 and grace period control
The simplest way to support scenario 2 is to require clients of node 1 to recover their state on node 2 using the current server reboot recovery mechanism, by forcing both server 2 and 3 to observe a grace period.
We have patches to implement this approach available from the "server-state-recovery" branch of our public git repository; see a browsable version of the repository.
This approach is unsatisfactory because it forces the NFS server to observe a grace period for all exported filesystems, even those that aren't affected (or even shared across nodes), and because it blocks all locking activity across the cluster for the duration of the grace period.
Therefore, we have a second design that allows us to limit the impact: instead of simply forcing servers 2 and 3 into grace, we remove all grace period checking from the nfs server itself, instead allowing the underlying filesystem to enforce locking operations when it is called to perform the locks. (This new behavior is enabled only for filesystems such as cluster filesystem that define lock methods; behavior for other filesystems is unchanged.) In order to enable the filesystem to make correct grace period decisions, we also need to distinguish between "normal" lock operations and reclaims; we accomplish by flagging reclaims locks when they are passed to the filesystem.More detail on this design
The cluster filesystem may then choose how to handle the migration; it may choose to continue to enforce a grace period globally across the whole filesystem, but it is now given the information to enable it to make more sophisticated decisions if it prefers.
The patches in our git repository, referenced above, include incomplete support for this new design.
Due to the minimal support for cluster filesystems in the mainstream Linux kernel, these patches (like the byte-range locking patches) are unlikely to be accepted for the time being.
While implementing this we also noticed a problem with our current NFSv4 implementation, which is that its grace period is not necessarily synchronized with the grace period used by lockd, which can cause problems in a mixed NFSv2/v3/v4 environment; those patches are available from our git repository and we expect them to proceed upstream normally.
Scenario 3 and 4, and reboot recovery interfaces
Like statd, the NFSv4 server is required to maintain information in stable storage in order to keep track of which clients have succesfully established state with it. This solves certain problems identified in rfc 3530, where combinations of server reboots and network partitions can lead to situations where neither client nor server could otherwise determine whether the client should still be allowed to reclaim state after a reboot.
We are in the process of redesigning the linux implementation of this system, partly for reasons given under "Scenario 1 and NFSv4 reboot recovery" above.
As part of this new design, we need a way for a userland program to tell the NFS on startup which clients have valid state with the server (after that program has retried this information from stable storage).
Scenarios 3 and 4 require a kernel interface allowing an administrator to migrate particular clients from and to NFS servers. To this end, we plan to use the same reboot recovery interface.
The interface will consist of a call that takes a client identifier and an status (as an integer).
For normal nfsd startup, one call will be made for each known client, with a status of 0.
Similarly, to inform a server that a new client is migrating to it (and hence that it should allow lock reclaims from that client), we will again make one call for that client with status 0.
To inform a client that it should initiate a migration event, by returning NFS4ERR_MOVED to the client, we'll make a call for that client with status NFS4ERR_MOVED.
This interface may also be extended in the future to allow for, for example, administrative revocation of locks.