Cluster Coherent NFSv4 and Share Reservations

From Linux NFS

Revision as of 18:15, 5 April 2006 by Andros (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Cluster Coherent NFSv4 and Share Reservations


NFSv4 share reservations come in two flavors - ACCEPT and DENY. ACCEPT reservations are familiar to Linux users - they are the posix open() flags O_RDONLY, O_WRONLY and O_RDWR which are mapped to NFSv4 access shares of READ, WRITE, and BOTH respectivly.

NFSv4 also has deny shares NONE, READ, WRITE, and BOTH. With the exception of deny NONE, deny shares act as a type of whole file lock: requesting deny READ at open means that no other open with read access will succeed.

Linux supports a posix syscall interface which does not include support for share reservations. Specifically, there is no way for an application to request deny shares. So, Linux NFSv4 clients will always use a deny of NONE, because there is no way to express any other deny share through the posix open() interface.

This is also true for local access to NFSv4 exports. While the NFSv4 server bookeeps and enforces deny shares from clients who can express them (e.g. Windows clients), there is no way to enforce deny shares on local access.

In the cluster file system case, where multiple NFSv4 servers are exporting the same back-end file system, the share ACCESS/DENY decision needs to be distributed to take into account shares from other NFSv4 servers; in other words, the NFSv4 server has to ask the cluster file system if an incoming OPEN share can be granted.

Linux Deny Share Support

Reasons that getting deny share support into the kernel will be difficult include:

   * Deny shares are not present in POSIX systems such as linux.
   * Deny shares are only needed to support NFSv4 windows clients.
   * There is no native NFSv4 windows client (all third party - hummingbird)
   * There are currently no open Linux file systems that support deny shares
   * The userlevel samba server uses open and flock (with all the races) to implement deny share locking
   * Unix NFSv4 clients (no deny shares, only access shares) currently work correctly

Implementation Issues

We want to correctly enforce open share deny bits, for the benefit of windows v4 clients, across the whole cluster. This is complicated, since an open is simultaneously

   * a lookup
   * a create (possibly)
   * a lock

We manage to do a and b atomically on the client with open intents. The distributed filesystem may have to do the same thing. We need to also deal with c atomically somehow.

One possible problem (there may be others): you can't lock before create, so you must create first. But once you've created, someone else may find the file and get a share lock. Returning a deny to an open that created a file is probably unexpected behavior.

So it'd be nice to add the share_lock to the open instead of making it a separate operation.

One approach

   * Add 2 bits to the open flags, deny_read and deny_write. (Use the existing open bits as the allow bits.) Also make sure these get propagated to the intent structure.
     Provide operation adjust_share(file, flags). FS should be allowed to refuse operations that could not result from open or close. (So, anything that doesn't only turn bits on or only turn them off.) 

Is there a race here?: Say we can an open create with a share lock. How do we decide whether to treat it as an upgrade or an open?

Best attempt

   * look up; upgrade if we find it.
   * open; if we get an error indicating a share conflict, retry the lookup. Etc.

Obviously not ideal. Would it help to get a reference on the dentry before trying the open?

Is there currently a lookup/open race if the backend is a distributed filesystem? I suppose that's up to them--we need to look at how we implement open and make sure it does the intent stuff right. On a brief glance it looks to me like we probably don't.

An alternative might be to expose something similar to the openowner to the vfs and let it decide (by comparing openowners) whether a given open is an upgrade or a new open.


No progress to report.

Personal tools