Nfsd4 server recovery
From Linux NFS
This incorporates revisions based on comments on the original document posted at .
The Linux server's reboot recovery code has long-standing architectural problems, fails to adhere to the specifications in some cases, and doesn't take full advantage of NFSv4.1. An overhaul has been a long-standing todo.
This is my attempt to state the problem and a rough solution.
Requirements, as compared to current code:
- Correctly implements the algorithm described in section 8.6.3 of rfc 3530, and eliminates known race conditions on recovery.
- Does not attempt to manage files and directories directly from inside the kernel.
Requirements, in more detail:
A server can go down and come back up again for any number of reasons:
- The server may crash.
- Power may go out.
- The administrator may reboot the server.
- The administrator may manually stop and restart the NFS server without stopping other services on the machine, for example using "service nfs stop" and "service nfs start" (where the details may vary from one distribution to another).
We will call any of these events a "restart".
A "server instance" is the lifetime from start to shutdown of a server; a restart ends one server instance and starts another. Normally a server instance consists of a grace period followed by a period of normal operation. However, a server could go down before the grace period completes. Call a server instance that completes the grace period "full", and one that does not "partial".
Call a client "active" if it holds unexpired state on the server. Then:
- An NFSv4.0 client becomes active as soon as it succesfully performs its first OPEN_CONFIRM, or its first reclaim OPEN.
- An NFSv4.1 client becomes active when it succesfully performs a RECLAIM_COMPLETE.
- Active clients become inactive when they expire. (Or when they are revoked--but the Linux server does not currently support revocation.)
- On startup all clients are initially inactive.
On startup the server needs access to the list of clients which are permitted to reclaim state. That list is exactly the list of clients that were active at the end of the most recent full server instance.
To maintain such a list, we need records to be stored in stable storage. Whenever a client changes from inactive to active, or active to inactive, stable storage must be updated, and until the update has completed the server must do nothing that acknowledges the new state.
So, the fundamental requirements are:
- When a new client becomes active, a record for that client must be created in stable storage before responding to the rpc in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE).
- When a client expires, the record must be removed (or otherwise marked expired) before responding to any requests for locks or other state which would conflict with state held by the expiring client.
- Updates must be made by upcalls to userspace; the kernel will not be directly involved in managing stable storage. The upcall interface should be extensible.
- The records must include the client owner name, to allow identifying clients on restart. The protocol allows client owner names to consist of up to 1024 bytes of binary data. (This is the client-supplied long form, not the server-generated shorthand clientid; co_ownerid for 4.1).
Nice to have
Also desirable, but not absolutely required in the first implementation:
- We should not take the state lock while waiting for records to be stored. (Doing so blocks all other stateful operations while we wait for disk.)
- The server should be able to end the grace period early when the list of clients allowed to reclaim is empty, or when they are all 4.1 clients, after all have sent RECLAIM_COMPLETE. (But see note about NLM below).
- We should allow pluggable methods for storage of reboot recovery records, as the NFSv2 and NFSv3 code currently does. These may be used by some high-availability systems.
Possibly also desirable:
- Record the principal that originally created the client, and whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661 section 126.96.36.199).
Also, note that our server also supports NLM. As long as either is NLM is enabled, we must ensure that an NFSv4 client is not able to perform a non-reclaim lock while still waiting for lock reclaims from NLM clients, and vice versa. And note that NLM also supports share locks, so even NFSv4 open reclaims have the same issue.
We will write a new userspace daemon to handle to manage state in userspace. The new daemon will be written with the possibility in mind of later combining it with one of the other existing daemons (such as idmapd), but it may stand alone at first.
Previous prototype code from CITI will be considered as a starting point.
Kernel<->user communication will use four files in the "nfsd" filesystem. All of them will use the encoding used for rpc cache upcalls and downcalls, which consist of whitespace-separated fields escaped as necessary to allow binary data.
Three of them will be used for upcalls; the daemon reads request from them, and writes responses back:
- given a client owner, returns an error. Does not return until a new record has safely been recorded on disk. The kernel will call this on the first reclaim OPEN or OPEN_CONFIRM (for v4.0 clients) or on RECLAIM_COMPLETE (for 4.1 clients).
- request and reply are both empty; the daemon returns only after it has recorded to disk the fact that the grace period completed. The kernel will not allow any non-reclaim opens until this returns.
- given a client owner, replies with an empty reply. Replies only after it has recorded to disk the fact that the client has expired. The kernel will call this when a client loses its lease, before removing its locks and opens (and allowing potentially conflicting operations).
One additional file will be used for a downcall:
- before starting the server, the daemon will open this file, write a newline-separated list of client owners permitted to recover, then close the file. If no clients are allowed to recover, it will still open and close the file.
The daemon will use the presence of these upcalls to determine whether the server supports the new recovery mechanism (and may just exit if it does not). Also, nfsd may use the daemon's open of allow_client to decide whether userspace supports the new mechanism. Thus allows a mismatched kernel and userspace to still maintain reboot recovery records.
In addition, we could support seamless reboot recovery across the transition to the new system by making the daemon convert between on-disk formats. However, for simplicity's sake we plan for the server to be refuse all reclaims on the first boot after the transition.
By default, the daemon will store records as files in the directory /var/lib/nfs/v4clients. The file name will be a hash of the client_owner, and the contents will consist of two newline-separated fields:
- The client owner, encoded as in the upcall.
- A timestamp.
More fields may be added in the future.
Before starting the server, and writing to allow_client, the daemon will manage boot times and old clients using files in /var/lib/nfs:
- Record the current time in new_boot_time (replacing any existing such file).
- If the file boot_time exists:
- It will be read, and the contents interpreted as an ascii-encoded unix time in seconds.
- All client records older than that time will be removed.
- All remaining clients will be written to allow_client.
- If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is created if necessary.
The daemon will then wait for create_client, expire_client, and grace_done calls. On grace_done, it will rename new_boot_time to boot_time.
Another Draft Design
This draft design will use rpc_pipefs to handle most of the communications. The reason for that is that I'm not convinced that polling on /proc/fs/nfsd files will work well. How would the daemon know that the kernel has data ready to read off the socket? That may be fixable, but rpc_pipefs should already be suitable for this.
Most of the files would be replaced with a single rpc_pipefs pipe in a nfsd subdirectory in rpc_pipefs. The daemon would open this pipe and listen on it. The upcall format would contain a command field that the daemon would use to determine what sort of message this was. The commands are as follows:
- set_client: given a server IP address, and information that can be used to generate clientid4, returns clientid4 (and a verifier?). Called during SETCLIENTID/EXCHANGE_ID phase. If set_client is called during the grace period, then the client must already be in the db and the existing clientid4 is returned. If not, an error will be returned (NFS4ERR_GRACE?). If set_client is called outside the grace period, then a unique clientid4 is generated, stored on disk and returned to the kernel. The stored clientid is marked "incomplete".
- confirm_client: given a clientid4 and a server-side IP address, returns an error. The kernel will call this on the first reclaim OPEN or OPEN_CONFIRM (for v4.0 clients) or on RECLAIM_COMPLETE (for 4.1 clients). Does not return until client is marked "confirmed".
- expire_client: given a clientid4 and server IP address, replies with an empty reply. Replies only after it has recorded to disk the fact that the client has expired. The kernel will call this when a client loses its lease, before removing its locks and opens (and allowing potentially conflicting operations).
- begin_grace: kernel will send a server IP address and number of seconds. Daemon responds with a count of clients that need to be reclaimed. If that number is 0, then the kernel will know that the grace period can be immediately lifted. The daemon need not do anything else.
- grace_complete: given a server IP address, returns with an empty reply. The client will upcall with this command after the grace period expires. The daemon will use that to purge any unreclaimed client records for the given server address.
One concern is that this data is not per-client like most of the stuff under rpc_pipefs, but hopefully that "weirdness" won't be show stopper.
The important thing to note is that all of the above commands are kernel-initiated. We also want to allow the daemon to initiate a grace_complete as well when all of the client records for an address have been reclaimed. That requires a different interface, possibly a simple file in /proc/fs/nfsd. When the last state record for an IP address has been recovered, it would write that IP addr into the file. The kernel would then know that the grace period for that IP address is now complete.
Accomodating active/active NFSv4 clusters
To handle the situation where we have an active/active NFSv4 cluster with IP addresses that float between machines, we'll need to further tie each of the client records to one of the server's IP addreses. The create_client and expire_client interfaces will need to contain a server IP address encoded, and the daemon will need to store that information. We'll also need to have some way to tell the daemon to only feed records for a certain IP address to the server, so that when the server picks up a new address it can get the new records.
What may be best is to have a "simple" state recovery daemon for single server configurations, and a "clustered" one for clustered configurations. The upcall/downcall formats should be the same for them, however.