CITI Experience with Directory Delegations
From Linux NFS
(rewrite of background section) |
(rewrite operations section) |
||
Line 26: | Line 26: | ||
CITI is implementing directory delegations as described in Section 11 of the minor version draft. (Section 11 also describes a directory notification extension that CITi is not implementing.) | CITI is implementing directory delegations as described in Section 11 of the minor version draft. (Section 11 also describes a directory notification extension that CITi is not implementing.) | ||
- | == | + | ==Directory Delegation Operations== |
- | + | An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation. | |
- | + | Granting a delegation request is entirely at the | |
- | + | server's discretion. | |
- | server's discretion | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | Upon receipt of an operation that conflicts with an existing delegation, the server recalls the delegation from all | |
- | delegation | + | clients holding the delegation by issuing them the XXX callback operation. |
- | clients holding | + | When a client receives a recall request, it relinquishes the delegation and responds to the server with the DELEGRETURN operation, |
- | + | When all the clients have returned the delegation, the server proceeds with the conflicting operation. | |
- | + | ||
- | operation | + | |
- | + | Although NFS clients and servers have knowledge of the acquisition and recall of directory delegations, delegation state is opaque to applications. | |
- | acquisition and recall | + | |
- | + | ||
==(problems and solutions)== | ==(problems and solutions)== |
Revision as of 19:36, 12 October 2006
NOTE: this is a rough work-in-progress; please send criticism to richterd at (nospam) citi.umich.edu thank you.
[2006-8-2: I've added some rough, preliminary numbers of opcounts from doing compiles with/without directory delegations]
Contents |
Background
NFSv4 allows clients to cache directory contents:
- READDIR uses a directory entry cache
- LOOKUP uses the name cache
- ACCESS and GETATTR use a directory metadata cache.
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information.
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." NFSv4.1 Internet Draft
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." NFSv4.1 Internet Draft
Analysis of network traces at the University of Michigan (FIXME: need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." NFSv4.1 Internet Draft
To improve performance and reliability, NFSv4.1 introduces read-only directory delegations, a protocol extension that allows consistent caching of directory contents. CITI is implementing directory delegations as described in Section 11 of the minor version draft. (Section 11 also describes a directory notification extension that CITi is not implementing.)
Directory Delegation Operations
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation. Granting a delegation request is entirely at the server's discretion.
Upon receipt of an operation that conflicts with an existing delegation, the server recalls the delegation from all clients holding the delegation by issuing them the XXX callback operation. When a client receives a recall request, it relinquishes the delegation and responds to the server with the DELEGRETURN operation, When all the clients have returned the delegation, the server proceeds with the conflicting operation.
Although NFS clients and servers have knowledge of the acquisition and recall of directory delegations, delegation state is opaque to applications.
(problems and solutions)
Common examples of the previously mentioned "high miss" cases involve the PATH shell variable and the loading of shared libraries. When a user executes a program, the user's shell examines the list of directories in the PATH environment variable and looks for the program binary in each of those directories, in turn, until the program is found. Often there can be 5 to 10 (or more) PATH entries, and normally a given program binary is in only one of those directories. Even when the client is searching for repeatedly-absent files, it must nevertheless check with the server in case they have appeared.
A similar situation arises during software compilation, when the include paths are repeatedly serially searched for header files. Given that header files are generally in only one of those directories, this results in a high miss-rate.
With respect to the PATH and shared library cases (where no directory-mutating operations are being performed), directory delegations provide a significant advantage. This stems from "negative dentry caching" -- that is, the caching of information about non-existent directory entries. In the absence of directory delegations, if a client attempts to OPEN a non-existent file, close-to-open consistency semantics require that the operation be sent to the server, regardless of whether the client has a negative dentry cached. However, if a client holds a delegation on the directory and has a negative dentry stored for the missing file, it can "trust" that the file has not appeared, which obviates the need for the OPEN.
Another example is if a client performs an 'ls' or a 'stat' on a non-existent file, three separate RPC calls are made to service an ACCESS, a LOOKUP, and a GETATTR -- only to find that the file still does not exist. If the directory were delegated and the client has a negative dentry for the non-existent file, however, the client once again is assured that the file has not appeared.
Beyond just these "high miss" cases, analysis of NFSv3 (whose client cache revalidation semantics NFSv4 roughly mirrors) network traces by Brian Wickman at the University of Michigan shows that a significant amount of NFS traffic consists of the periodic GETATTRs which clients send when an attribute timeout triggers a cache revalidation. Naturally, if a directory is delegated, it need not be revalidated until the directory is mutated.
(notifications)
Another aspect concerning directory delegations in the minor version draft is an extension called notifications. The intent behind notifications is to avoid having to revoke a delegation and force the client to refetch the contents of a directory when only a relatively small change has been made. If a delegated directory is very large, it can be expensive to return a delegation -- which involves two RPCs -- and subsequently refetch the directory's entire contents, particularly if multiple clients have delegations on that directory.
Notifications mitigate this circumstance by allowing a client to request that the server merely send a message describing the change; this avoids having to revoke the delegation and refetch the directory. In environments where files are created and deleted with some moderate degree of frequency, notifications could conceivably provide significant benefits where plain directory delegations alone would result in a prohibitive number of recalls and directory refreshes. Examples might include directories where lockfiles are used, or where a few new files are created or deleted periodically, as with some compilation or when doing CVS updates.
In the proposed model, a client would be able to request notifications on directory entry and directory attribute changes, as well as directory entry attribute changes. Enabling a server to track that would involve a lot of extra state. Furthermore, the client and server negotiate a rate at which notifications are sent, which allows the server to batch several notifications and deliver them asynchronously and conceivably even prune self-cancelling notifications (e.g., "CREATE foo ... REMOVE foo"). Notifications would also introduce another level of "fairness" to maintain, in terms of deciding how to allot notifications among multiple clients. Wickman's simulator work at CITI investigated some aspects of enabling notifications and found that in some cases, certainly with directory entry attribute changes, the number of notifications dispatched to support the directory delegation far outweighed the cost of simply not using a delegation at all. He also found that for certain workloads, if a server batched notifications for a long time (>20 seconds, sometimes >50 seconds), a significant reduction (5x-50x) in traffic could be achieved. For instance, lockfiles in mailboxes often have a lifetime under 10 seconds, so addition/deletion notifications could be pruned. There is, however, a direct trade-off between the batching delay and the client's cache consistency. A lesser version of notifications -- wherein only directory-mutating operations would generate notifications -- has been loosely proposed and would involve much less server state, but seems not to be going anywhere. Primarily because of the complexity of implementation and the open questions of how best to benefit from notifications, we are not implementing them at this time.
Using Directory Delegations
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other client's delegations on that directory must still be recalled.)
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new sessions extensions in NFSv4.1 to identify the client.
Negative Caching
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. Currently, close-to-open consistency requires that, e.g., all OPENs are sent to the server (i.e., negative caching provides no benefit in that case). With directory delegations, one is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH. Knowing just when to acquire those delegations may be a matter to address in client-side policy.
Delegations and the Linux VFS Lease Subsystem
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. However, it is still possible for a local process on the server to modify a directory without breaking directory leases.
The final implementation will also ensure that operations by local processes break directory leases. This will require addressing some tricky VFS locking issues: the difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's i_mutex -- are held while the calling kernel thread blocks.
UPDATE
At this point, we are testing general VFS-level directory lease-breaking -- i.e., both NFS and non-NFS operations will break leases. Our approach is described in the next section.
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. NFS leases are all acquired and revoked in-kernel.
Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases
In the following I will refer to the leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type, as described above.
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:
- Delay responding to the original operation until all recalls are complete.
- Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.
- Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.
For now, we have implemented option number 1.
UPDATE
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory mutating operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this: When breaking a lease where the call is coming over NFS: 1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially) long periods. 2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done. 3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first) and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease, return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up with a cycle. When breaking a lease where the call is server-local: 1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode. 2) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned, after which the breaker is unblocked and its operation succeeds. 3) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, the operation will succeed in one pass. 4) Regardless of whether 2) or 3) happened, at the end lease-granting is naturally re-enabled for the inode(s) in question.
Policy (partial)
client: prior to a READDIR, request.
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.
server: will need to ID "misbehaving" clients and cordon them off.
server: when to preemptively recall? --> server load metric
(simulator)
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing file and directory delegations, based on recorded network traces of NFSv3 use in college environments. The stateless nature of NFSv3 required the instrumentation of OPEN and CLOSE operations into the traces, e.g., but given that in the absence of delegations, NFSv4 client-side cache validation closely mimics that of NFSv3, enough information was available to get an overall impression of the state of the clients' caches. Wickman wrote a simulator to use the instrumented traces to test different delegation models and policies. We now want to use real-world NFSv4 network traces with the simulator, but given the current absence of widescale mainstream deployment of NFSv4, we need to find such traces of representative workloads. Using actual NFSv4 traffic will give a more accurate picture of client-cache state and will more clearly identify operations obviated by delegations; this is both because the traces will not need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator will allow us to develop client- and server-side policies for requesting and granting delegations.
Some preliminary numbers
A significant demonstration of the benefits of negative dentry caching is with software compilation. For instance, when building software using the make(1) program, various directories are repeatedly searched for header files. Since header files tend only to be located in one of the directories, and since many object files depend on the same headers, there are a great number of unnecessary checks. By caching negative dentries, a significant number of NFS operations are obviated.
We have some very rough numbers in terms of opcounts with vs. without directory (not file) delegations enabled. We used a very naive client policy of simply requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are rough, but indicate that compilation environments stand to benefit from directory delegations.
Doing make(1) on cscope-15.5 (first without, then with directory delegations):
READ: 136 124 WRITE: 137 136 OPEN: 1576 1576 ACCESS: 1169 161 (86% reduction) GETATTR: 903 628 (30% reduction) LOOKUP: 1494 496 (67% reduction) GET_DIR_DELEG: 7 DELEGRETURN: 1
Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):
READ: 19803 19892 WRITE: 21921 21869 OPEN: 497472 494648 ACCESS: 20638 3406 (83.5% reduction) GETATTR: 41794 24563 (41.0% reduction) LOOKUP: 45063 17447 (61.3% reduction) READDIR: 1016 884 (13.0% reduction) GET_DIR_DELEG: 750 DELEGRETURN: none
Status
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms of OP-counts); pynfs tests are also being written.
The client
- The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.
- As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
- .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR on the wire).
- TODO: get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)
- TODO: redo existing opcount tests and instead tally bandwidth savings ...
- getting real NFSv4 workload network traces would be great -- can you help? (richterd AT citi.umich.edu)
- When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?
The server
- Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).
- The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.
- An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.
- The corresponding VFS-level operations also break delegations and are being tested.
- How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.
- TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.
- TODO: also -- policy, look at dir deleg/file deleg interactions, ..