Cluster Coherent NFSv4 and Share Reservations

From Linux NFS

Revision as of 19:27, 9 October 2006 by Peterhoneyman (Talk | contribs)
Jump to: navigation, search

Cluster Coherent NFSv4 and Share Reservations

Background

NFSv4 share reservations control the concurrent sharing of files at the time they are opened. Share reservations come in two flavors, ACCEPT and DENY. There are three types of ACCEPT reservations: READ, WRITE, and BOTH; and four types of DENY reservations: NONE, READ, WRITE, and BOTH.

ACCEPT reservations are familiar to Linux users, as they map directly to posix open() flags. NFSv4 ACCEPT shares of READ, WRITE, and BOTH map directly to O_RDONLY, O_WRONLY and O_RDWR, respectively.

NFSv4 DENY reservations act as a type of whole file lock applied when a file is opened. NFSv4 DENY shares of READ, WRITE, and BOTH prevent other opens with read, write, or any access from succeeding. DENY NONE allows other opens to proceed.

The Linux system call interface for open() follows the posix standard, which does not include support for share reservations. In particular, there is no direct analog in posix for an application to request DENY READ, WRITE, or BOTH shares. Consequently, Linux NFSv4 clients always use DENY NONE.

The mismatch between posix and NFSv4 shares is also reflected on an NFSv4 server. The Linux NFSv4 server that receives DENY reservations from clients that can express them, which in practice means Windows clients, does the appropriate bookeepping and enforcement, but the local filesystem is unable to enforce DENY shares for local access on the server.

When a cluster file system is exported with NFSv4, multiple NFSv4 servers export a common back-end file system, so ACCESS and DENY reservations must be distributed to take into account shares from other NFSv4 servers. In other words, the NFSv4 server has to ask the cluster file system if an incoming OPEN share can be granted.

Linux Deny Share Support

Reasons that getting deny share support into the kernel will be difficult include:

   * Deny shares are not present in POSIX systems such as linux.
   * Deny shares are only needed to support NFSv4 windows clients.
   * There is no native NFSv4 windows client (all third party - hummingbird)
   * There are currently no open Linux file systems that support deny shares
   * The userlevel samba server uses open and flock (with all the races) to implement deny 
     share locking
   * Unix NFSv4 clients (no deny shares, only access shares) currently work correctly

Implementation Issues

We want to correctly enforce open share deny bits, for the benefit of windows v4 clients, across the whole cluster. This is complicated, since an open is simultaneously

   * a lookup
   * a create (possibly)
   * a lock

We manage to do a and b atomically on the client with open intents. The distributed filesystem may have to do the same thing. We need to also deal with c atomically somehow.

One possible problem (there may be others): you can't lock before create, so you must create first. But once you've created, someone else may find the file and get a share lock. Returning a deny to an open that created a file is probably unexpected behavior.

So it'd be nice to add the share_lock to the open instead of making it a separate operation.

One approach

   * Add 2 bits to the open flags, deny_read and deny_write. (Use the existing open bits 
     as the allow bits.) Also make sure these get propagated to the intent structure.
   * Provide operation adjust_share(file, flags). FS should be allowed to refuse operations 
     that could not result from open or close. (So, anything that doesn't only turn bits on 
     or only turn them off.) 

Is there a race here?: Say we can an open create with a share lock. How do we decide whether to treat it as an upgrade or an open?

Best attempt

   * look up; upgrade if we find it.
   * open; if we get an error indicating a share conflict, retry the lookup. Etc.

Obviously not ideal. Would it help to get a reference on the dentry before trying the open?

Is there currently a lookup/open race if the backend is a distributed filesystem? I suppose that's up to them--we need to look at how we implement open and make sure it does the intent stuff right. On a brief glance it looks to me like we probably don't.

An alternative might be to expose something similar to the openowner to the vfs and let it decide (by comparing openowners) whether a given open is an upgrade or a new open.

Status

No progress to report.

Personal tools