Cluster client migration prototype
From Linux NFS
As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named rpc.stransd) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.
For migration, the administrator runs a client program (so far called rpc.strans) that contacts the source-server's rpc.stransd and sends the IP address of the client to migrate. rpc.stransd looks up all clientids associated with that IP address and sends them to the target-server's rpc.stransd, which saves them in stable storage and notifies its (the target-server's) knfsd that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.
The mechanism by which one rpc.stransd transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. In order to facilitate that, the underlying cluster filesystem will also need to transfer its own bookkeeping of opens/locks/leases from the source node to the target node. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the 126.96.36.199 Linux kernel. Until proper git repositories are online, there is a patch for the kernel and a tarball of the source for the userland components.
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:
- boot the cluster, bring up cman and clvmd everywhere
- mount the gfs2 filesystem on the nfs servers
- cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)
- then start up nfs on the servers, making sure that rpc.stransd is running by the time knfsd is starting up
- start wireshark on the client
- have the client mount the source-server and hold a file open with, e.g., less(1)
- arrange the migration: $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP>
- in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate
- the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that mount(1) will continue to show the source-server, though that's not actually the case.
A network trace of the client 188.8.131.52 migrating from server 184.108.40.206 to 220.127.116.11 is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.