Cluster Coherent NFSv4 and Delegations
From Linux NFS
NFSv4 adds a new protocol feature: delegations. RFC 3530 explains:
- The major addition to NFS version 4 in the area of caching is the ability of the server to delegate certain responsibilities to the client. When the server grants a delegation for a file to a client, the client is guaranteed certain semantics with respect to the sharing of that file with other clients. At OPEN, the server may provide the client either a read or write delegation for the file. If the client is granted a read delegation, it is assured that no other client has the ability to write to the file for the duration of the delegation. If the client is granted a write delegation, the client is assured that no other client has read or write access to the file.
- Delegations can be recalled by the server. If another client requests access to the file in such a way that the access conflicts with the granted delegation, the server is able to notify the initial client and recall the delegation. This requires that a callback path exist between the server and client. If this callback path does not exist, then delegations can not be granted. The essence of a delegation is that it allows the client to locally service operations such as OPEN, CLOSE, LOCK, LOCKU, READ, WRITE without immediate interaction with the server.
Linux NFSv4 Delegation Support for Cluster Filesystems
To coordinate NFSv4 delegations with local access, we implement delegations with the lease extension to the VFS lock subsystem. The VFS lock subsystem uses fcntl() to set and get a lease. To allow a lease to be recalled, e.g., on account of a conflicting open, the VFS layer has a break_lease() function
* The break_lease call needs to be added to the VFS rename and unlink implementations. * dmr: this is done in the 2.6.16-based directory delegations kernel; that code needs to be broken into patches, moved forward, and tested.
When the NFS server's open() method is invoked, it may issue or recall a delegation. A delegation can be issued if it does not conflict with an existing delegation. Issuing a delegation is optional. A delegation can be recalled at any time. Recalling a delegation is mandatory if a conflicting open is received.
A conflicting open can come from a variety of sources: local access, NFS access, Samba access, etc. Every invocation of the VFS open method must check for conflict with an existing delegation and recall it if necessary. NFSD may wait for the delegation recall to complete, or may respond to the OPEN request with NFSERR_DELAY.
* Is that last bit true? * dmr: if you mean the last sentence, yes, it's true in that the server could do that (or quickly stall for a period less than a client's retry interval), but all it does now is respond with NFSERR_DELAY.
If an OPEN request forces a delegation recall, NFSD issues a CB_RECALL request to all clients holding the conflicting delegation. This is implemented on the client with the VFS layer break_lease() call, which notifies lease holders that a conflicting OPEN has occurred. The VFS layer makes this determination without consulting the underlying file system.
* Now I'm confused.
Once the recall of conflicting delegations is complete, NFSD can proceed with its pending OPEN request. In order to determine whether it can issue a delegation for the request, NFSD needs information that lives on the other side of the VFS layer. The VFS lease subsystem can make the determination by examining the entry for the file in the open inode table: if there are no writers, then a READ delegation can be issued; if there are no readers or writers, then a WRITE delegation can be issued. NFSD must obtain the result of this determination from the VFS layer.
* I think I gt this wrong.
If NFSD elects to grant a delegation, it must inform the underlying file system.
- VFS OPEN must ask the file system to check for delegation recall in progress prior to granting an OPEN, granting a delegation, or initiating a recall.
* Does "delegation recall in progress" cover (a) recalls initiated as a result of the current open and (b) recalls from other opens that have not completed?
- If delegation is issued, the NFS client must set up a callback path for a potential CB_RECALL request from the server.
- NFSD must ask the file system if a delegation can be granted.
- The VFS must tell the file system of a lease conflict (rename, unlink, etc) and compel it to recall any delegations.
Extend the set/get/breaklease interfaces to service cluster file systems. The extensions will resemble the posix locking extensions (callbacks, etc).
What we probably need is new inode operations:
- break_lease(inode, mode)
- setlease(filp, mode)
- getlease(filp, &mode)
Where mode can be one of read, write, or unlock. We'd also allow the mode to be or'ed with a nonblocking flag?
The VFS lease subsystem includes a series of lock manager callbacks. Will these be sufficient for the cluster filesystem case?
Actually current setlease and getlease functions use a struct file_lock instead of (or in addition to) the mode. Do we need that?
Also, setlease and getlease could be file operations instead of inode operations. This is probably a fairly arbitrary choice.
To handle the possibility that break_lease, setlease, getlease, etc. might block, even in the absence of contention, we might want to allow an -EINPROGRESS return to be followed by a callback e.g. break_lease_result(inode, stat); where stat might be -EAGAIN (we're waiting for the lease to be broken) or OK (it was immediately broken, or there never was one).
Implementation awaits progress in resolving the above issues.