Cluster Coherent NFS and Byte Range Locking

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
m (promote section)
 
(22 intermediate revisions not shown)
Line 1: Line 1:
-
'''Cluster Coherent NFS and Byte Range Locking'''
+
=Background=
-
''Background''
+
For some time, exporting byte-range locks to NFS has been a challenge in Linux.  Support for file system locks was designed with a process model and a local file system in mind.  This suggested a synchronous interface in which a process that requests a lock is either granted the lock or suspended and placed on a queue.  When the lock becomes available, a suspended process is granted the lock and allowed to proceed. 
-
Clustered filesystems with exports to NFS clients face several issues with providing byte-range locking over NFS.
+
This synchronous approach breaks down when the request is made by a server, e.g., LOCKD or NFSD, where threads are a scarce resource.  The synchronous approach threatens to block the server process, which constitutes a disaster.
-
NFS advisory locking is performed by LOCKD or the NFSv4 server on the exporting node. In the current implementation, LOCKD calls the VFS posix locking layer even if the underlying filesystem provides its own ->lock() locking routine. This is because LOCKD is single-threaded, so LOCKD is not able to block, waiting on communication with another cluster node.
+
Hence an asynchronous lock request interface has emerged.
-
The VFS posix locking layer provides an asynchronous lock manager callback, fl_notify(), that allows LOCKD to queue blocking lock requests and continue to service other client requests.
+
One of the complexities in making that transformation is a mechanism for queueing contending requests.  The queue should be fair, not giving preference to one source of lock requests over another.  Ideally, contending lock requests should be granted in the order in which they are issued.  This argues for a single queue of pending requests, no matter their source of issue.
-
The NFSv4 server simply treats all blocking locks as non-blocking, choosing not to implement another lock request queue.
+
Cluster file systems exported with NFS introduce another layer of complexity: often they need to coordinate their locks with a lock manager in the back end.  But back end coordination can be delayed, e.g., by inter-node communication, which poses another threat to a threaded server.
-
'''NFSv4 Blocking Locks'''
+
Finally, NFSv4 introduces one more layer of complexity: unlike NLM locks, which block, NFSv4 byte-range locks are non-blocking, so clients contending for a lock must poll.  This raises the stakes for fair queueing, as a local process waiting for a lock will almost always acquire the contended lock before an NFSv4 client can.
-
The NFSv4 server needs to implement blocking-locks. Unlike NLM clients, NFSv4 clients do not register a blocking lock callback with the server. Instead, they poll the server to see if the blocked lock is available. This presents a fairness problem, and the NFSv4 spec suggests that the server should maintain an ordered list of pending blocking locks. To really solve the fairness problem, all consumers of a lock should share such an ordered list e.g. local lock, LOCKD, and NFSv4 server lock requests.
+
=NFSv4 Blocking Locks=
-
'''Tasks'''
+
Addressing fair queueing, the NFSv4 spec suggests that the server should maintain an ordered list of pending blocking locks. More broadly, queue fairness suggests that all lock requestors (local processes, LOCKD, and the NFSv4 server) should share such an ordered list.
-
    * Implement a shared blocking lock fair queue
+
==Tasks==
-
    * Implement the NFSv4 server fl_notify and use the fair queue
+
-
'''Progress'''
+
* Implement a shared blocking lock fair queue
 +
* Implement the NFSv4 server fl_notify and use the fair queue
-
We have written [http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=fair-queueing patches] that change the semantics of the existing file_lock->fl_block queue to integrate it with the NFSv4 server and to make it more 'fair'. This queue holds all blocking locks in requesting order; new blockers are added to the tail.
+
==Progress==
-
These patches have not been reviewed by the wider kernel community.
+
We have written [http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=fair-queueing patches] that change the semantics of the existing file_lock->fl_block queue to integrate it with the NFSv4 server and to make it more 'fair.' This queue holds all blocking locks in the order in which they were requested.  New blockers are added to the tail.
-
We have however identified a number of spec and implementation problems along the way and had fixes incorporated into the linux kernel and the new NFSv4.1 draft.
+
These patches have not been reviewed by the wider kernel community.  However, the effort exposed  a number of spec and implementation problems for which fixes were incorporated into the Linux kernel and the new NFSv4.1 draft.
-
''The existing fl_block semantics:''
+
===The existing fl_block semantics===
-
When the lock is released, we traverse the fl_block list and wake each blocker, resulting in a 'scrum' to get the lock. The winner then places all losers on its fl_block list. So, this queue is 'fair' in the sense that the blokers wake in order. It's not fair in the sense that LOCKD has bookeeping tasks to perform prior to actually grabbing the lock, giving local lockers an advantageIt's even worse for NFSv4; since NFSv4 clients must poll for blocking locks, the NFSv4 server is forced to wait for the client in question to poll again before it can attempt to acquire the lock.
+
When a lock becomes available, local blocked processes are awakened and contending NLM clients are issued a "grant" callback. Contending NFSv4 clients, which do not block in anticipation of a server callback, receive no notificationInstead, they repeatedly poll the server to discover whether the blocked lock is available.  
-
''The new 'fair' fl_block semantics:''
+
In more detail, when a lock is released, the kernel traverses the lock's fl_block list and wakes each blocked requester, resulting in a 'scrum' to get the lock. The winner then places all losers on its fl_block list.
-
We tried modifying the VFS lock code so that it actually *applies* all locks it can on behalf of the waitersWe wake those waiters whose locks succeed, and return others to the fl_block list.
+
This queue is fair to contending processes in that that blockers wake in order, and it is likely that a process awakened late will find the lock already claimed.  But it's not fair to LOCKD, which has to perform some bookeeping tasks before requesting the lock, which gives local processes an unfair advantageAnd it is especially unfair to the NFSv4 server, which must wait for a contending client question to poll again before it can attempt to acquire the lock.
-
We also added a kernel lock to protect the fl_block list during this processing.  We immediately ran into a few problems:
+
===The new 'fair' fl_block semantics===
-
    * Claiming the lock means calling posix_lock_file which calls kmalloc which can sleep,
+
We tried modifying the VFS lock code so that it grants locks to queued contenders, wakes the lucky ones whose locks succeed, and returns the others to the fl_block list.  We used a kernel lock to protect the fl_block list during processing. We immediately ran into a few problems:
-
      a no-no when under a spinlock; so we'd have to use a semaphore or mutex; but
+
-
    * For the purposes of mandatory lock checking, this new lock must be obtained in the
+
-
      read/write path to check for lock compliance, and adding a semaphore or mutex to
+
-
      the performance-critical read/write path is thought to be inefficient.
+
-
We investigated alternative locking schemes, however we soon identified a critical problem: an NFSv4 client that has been polling for a lock may stop polling at any time without notice (if, for example, someone signals the client process that is blocking on the lock).  Therefore it is incorrect for the VFS code to grant a lock to waiter on our behalf, when it is possible that "waiter" may end up not actually wanting the lock.
+
* Claiming the lock means calling posix_lock_file which calls kmalloc which can sleep, not possible when under a spinlock; so we'd have to use a semaphore or mutex; but
 +
* For the purposes of mandatory lock checking, this new lock must be obtained in the read/write path to check for lock compliance, and adding a semaphore or mutex to the performance-critical read/write path is thought to be inefficient.
-
If the server grants the lock early, and the client chooses not to poll again, then there is no way for the server to cancel the lock that it has already granted.  (If the lock has downgraded or coalesced existing locks, then it may not be possible to
+
We investigated alternative locking schemes, however we soon identified a critical problem: an NFSv4 client that has been polling for a lock may stop polling at any time without notice.  (For example, a user might grow weary of waiting for an application polling for a lock to make progress and issue an interrupt.)
-
undo its effect with a simple unlock.)
+
-
Correct support for blocking NFSv4 locks therefore *requires* the ability to apply a new kind of byte-range lock to the backend filesystem that allows us to temporarily block other lock requests, but that does not downgrade or coalesce with existing posix locks, to allow us to later remove the lock safely if the client does not return.
+
Granting a lock to a client that does not want it is benign if the lock grant can be revoked.  However, in some cases it may be difficult for the server to revoke the errantly granted lock, e.g., if the lock has been downgraded or coalesced with other locks.  In these cases, the incorrect behavior can not be reversed with a simple unlock.
-
'''Cluster Filesystem ->lock() Interface'''
+
This suggests the ability to request a new kind of byte-range lock from the back end file system -- a <i>provisional lock</i> -- that supersedes contending lock requests, but that does not downgrade or coalesce existing posix locks.  This lets us remove the lock safely and easily if the client does not return.
-
There is currently a filesystem ->lock() method, but it is defined only by a few filesystems that are not exported via NFS. So none of the lock routines that are used by LOCKD or the NFSv4 server bother to call those methods. Cluster filesystems would like to NFS to call their own lock methods which keep a consistant view of a lock across cluster filesystem nodes. But the current ->lock() interface is not suitable for cluster filesystems in a couple of ways.
+
Our [http://linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=shortlog;h=fair-queueing patches] add this provisional lock type to the VFS lock code. After these patches, the VFS lock code again walks through the fl_block list, now applying provisional locks as it can, and waking these queued contenders. We do not upgrade the lock to a real posix byte-range lock until the contender wakes up and requests (or, optionally, cancels) the lock.
-
    * We'd rather not block the NFSv4 server or LOCKD threads for longer than necessary,
+
To address the concern that the contender may never return, we consider the three cases: process, NLM client, and NFSv4 client.
-
      so it'd be nice to have a way to make lock requests asynchronously. This is
+
-
      particularly helpful for non-blocking locks, which do not have the option of
+
-
      returning a temporary "blocked" response and then responding with a granted callback
+
-
      later.
+
-
    * Given that in the blocking case we want the filesystem to be able to return from ->lock()
+
-
      without having necessarily acquired the lock, we need to be able to handle the case where
+
-
      a process on the client is interrupted and the client cancels the lock.
+
-
'''Tasks'''
+
First, the structure of the Linux kernel guarantees that a contending process on the queue must return: a process can lose interest in a lock only through an external signal, and the kernel signal handling code removes the process from the lock queue.  Similarly, an NLM client that loses interest in a lock cancels its request when it wakes up, giving LOCKD the opportunity to revoke the lock request.  Finally, if an NFSv4 client loses interest in a lock, NFSD revokes the lock request after a timeout.
-
    * Design and implement an asynchronous ->lock() interface
+
The provisional lock is simple enough to be applied without requiring memory allocations, which sidesteps the kernel spinlock problems described earlier.
-
    * Have LOCKD and the NFSv4 server test for and call the new ->lock()
+
-
'''Progress'''
+
Along the way, we identified and fixed some problems with the NFSv4 protocol:
-
Since acquiring a filesystem lock may require comminication with remote hosts, and to avoid blocking lock manager threads during such communication, we allow the results to be returned asynchronously.
+
* The NFSv4 protocol has no equivalent to the NLM "cancel" call.  This means that when a client process stops blocking on a lock, the server may wait up to a lease period (typically about a minute) before giving up and allowing another waiting client to take the lock.  We found a solution that is backwards compatible (and thus implementable by current NFSv4.0 clients and servers), and also added language describing this solution to [http://www.nfsv4-editor.org/draft-07/draft-ietf-nfsv4-minorversion1-07.html#anchor56 the new NFSv4.1 draft]
-
When a filesystem ->lock() call needs to block due to a delay in satisfying a non-blocking lock request, the file system will return -EINPROGRESS, and then later return the results with a callback registered via the lock_manager_operations struct.
+
* The NFSv4 protocol has no equivalent to the "grant" call; clients must thus poll very frequently if they wish to acquire contended locks in a timely manner. However, the traditional NLM grant call is known to have problems, e.g., numerous race conditions.  We therefore proposed an alternate mechanism that allows a server to notify a client of lock availability without committing the server to granting the lock to that client.  Speicific language for NFSv4.1 has been proposed and met with interest, but is awaiting working group consensus.
 +
 
 +
=Cluster Filesystem lock() Interface=
 +
 
 +
A Linux file system is allowed to export its own lock() method, but only a few file systems bother to do so.  In particular, none of the file systems exported with NFS export a private lock() method.  Consequently, neither LOCKD nor NFSD attempt to use private lock() methods.
 +
 
 +
Cluster filesystems, on the other hand, do want to export a lock() method that is called by LOCKD and NFSD so that the back end can maintain a consistent view across servers.  However, the current private lock() interface is unsuitable for cluster file systems.
 +
 
 +
* As before, we can't afford to block the NFSv4 server or LOCKD threads, which argues for an asynchronous interface.  This is especially helpful for non-blocking locks, which do not offer the option of returning a temporary "blocked" response followed by a callback that grants the request.
 +
* From the earlier discussion, even if the request is for a blocking lock, the file system must anticipate a return from the lock() method without having fully acquired the lock. We also need to anticipate cases where a process on the client is interrupted and the client cancels the lock.
 +
 
 +
==Tasks==
 +
 
 +
* Design and implement an asynchronous interface to the private lock() method
 +
* Have LOCKD and NFSD test for the presence of a private lock() method and invoke the method when it is present
 +
 
 +
==Progress==
 +
 
 +
Acquiring a cluster file system lock may require comminication with remote hosts.  To avoid blocking lock manager threads during such communication, we allow the results to be returned asynchronously.
 +
 
 +
If a file system specific lock() invocation decides that it must block, e.g., because of a delay incurred in the course of granting a non-blocking lock request, the file system returns -EINPROGRESS.  Later, the file system returns the result of the lock request through a callback registered in the lock_manager_operations struct.
An FL_CANCEL flag is added to the struct file_lock to indicate to the file system that the caller wants to cancel the provided lock.
An FL_CANCEL flag is added to the struct file_lock to indicate to the file system that the caller wants to cancel the provided lock.
-
New routines vfs_lock_file, vfs_test_lock, and vfs_cancel_lock replace posix_lock_file, posix_test_file, and posix_cancel_lock in LOCKD and the NFSv4 server. They call the new filesystem ->lock() method if it exists, else call the posix conterparts.
+
New routines vfs_lock_file, vfs_test_lock, and vfs_cancel_lock replace posix_lock_file, posix_test_file, and posix_cancel_lock in LOCKD and the NFSv4 server. They invoke the private lock() method if it exists, otherwise they invoke the posix lock() method.
-
'''Status'''
+
=Status=
-
Our solution has been tested with the GPFS file system. The relevant patches have been submitted to the Linux community, and we are responding to comments.
+
Our solution has been tested with the GPFS file system. Patches have been submitted to the Linux community, and we are responding to comments.
-
A major issue for acceptance is the lack of a consumer in the Linux kernel - e.g. a cluster file system with byte-range locking.
+
The lack of a consumer in the Linux kernel, e.g., a cluster file system with byte-range locking, has impeded acceptance, but with GFS2 now included in the Linux [http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.19-rc1.tar.bz2 2.6.19-rc1] kernel, we have reason for optimism.

Latest revision as of 17:09, 11 October 2006

Contents

Background

For some time, exporting byte-range locks to NFS has been a challenge in Linux. Support for file system locks was designed with a process model and a local file system in mind. This suggested a synchronous interface in which a process that requests a lock is either granted the lock or suspended and placed on a queue. When the lock becomes available, a suspended process is granted the lock and allowed to proceed.

This synchronous approach breaks down when the request is made by a server, e.g., LOCKD or NFSD, where threads are a scarce resource. The synchronous approach threatens to block the server process, which constitutes a disaster.

Hence an asynchronous lock request interface has emerged.

One of the complexities in making that transformation is a mechanism for queueing contending requests. The queue should be fair, not giving preference to one source of lock requests over another. Ideally, contending lock requests should be granted in the order in which they are issued. This argues for a single queue of pending requests, no matter their source of issue.

Cluster file systems exported with NFS introduce another layer of complexity: often they need to coordinate their locks with a lock manager in the back end. But back end coordination can be delayed, e.g., by inter-node communication, which poses another threat to a threaded server.

Finally, NFSv4 introduces one more layer of complexity: unlike NLM locks, which block, NFSv4 byte-range locks are non-blocking, so clients contending for a lock must poll. This raises the stakes for fair queueing, as a local process waiting for a lock will almost always acquire the contended lock before an NFSv4 client can.

NFSv4 Blocking Locks

Addressing fair queueing, the NFSv4 spec suggests that the server should maintain an ordered list of pending blocking locks. More broadly, queue fairness suggests that all lock requestors (local processes, LOCKD, and the NFSv4 server) should share such an ordered list.

Tasks

  • Implement a shared blocking lock fair queue
  • Implement the NFSv4 server fl_notify and use the fair queue

Progress

We have written patches that change the semantics of the existing file_lock->fl_block queue to integrate it with the NFSv4 server and to make it more 'fair.' This queue holds all blocking locks in the order in which they were requested. New blockers are added to the tail.

These patches have not been reviewed by the wider kernel community. However, the effort exposed a number of spec and implementation problems for which fixes were incorporated into the Linux kernel and the new NFSv4.1 draft.

The existing fl_block semantics

When a lock becomes available, local blocked processes are awakened and contending NLM clients are issued a "grant" callback. Contending NFSv4 clients, which do not block in anticipation of a server callback, receive no notification. Instead, they repeatedly poll the server to discover whether the blocked lock is available.

In more detail, when a lock is released, the kernel traverses the lock's fl_block list and wakes each blocked requester, resulting in a 'scrum' to get the lock. The winner then places all losers on its fl_block list.

This queue is fair to contending processes in that that blockers wake in order, and it is likely that a process awakened late will find the lock already claimed. But it's not fair to LOCKD, which has to perform some bookeeping tasks before requesting the lock, which gives local processes an unfair advantage. And it is especially unfair to the NFSv4 server, which must wait for a contending client question to poll again before it can attempt to acquire the lock.

The new 'fair' fl_block semantics

We tried modifying the VFS lock code so that it grants locks to queued contenders, wakes the lucky ones whose locks succeed, and returns the others to the fl_block list. We used a kernel lock to protect the fl_block list during processing. We immediately ran into a few problems:

  • Claiming the lock means calling posix_lock_file which calls kmalloc which can sleep, not possible when under a spinlock; so we'd have to use a semaphore or mutex; but
  • For the purposes of mandatory lock checking, this new lock must be obtained in the read/write path to check for lock compliance, and adding a semaphore or mutex to the performance-critical read/write path is thought to be inefficient.

We investigated alternative locking schemes, however we soon identified a critical problem: an NFSv4 client that has been polling for a lock may stop polling at any time without notice. (For example, a user might grow weary of waiting for an application polling for a lock to make progress and issue an interrupt.)

Granting a lock to a client that does not want it is benign if the lock grant can be revoked. However, in some cases it may be difficult for the server to revoke the errantly granted lock, e.g., if the lock has been downgraded or coalesced with other locks. In these cases, the incorrect behavior can not be reversed with a simple unlock.

This suggests the ability to request a new kind of byte-range lock from the back end file system -- a provisional lock -- that supersedes contending lock requests, but that does not downgrade or coalesce existing posix locks. This lets us remove the lock safely and easily if the client does not return.

Our patches add this provisional lock type to the VFS lock code. After these patches, the VFS lock code again walks through the fl_block list, now applying provisional locks as it can, and waking these queued contenders. We do not upgrade the lock to a real posix byte-range lock until the contender wakes up and requests (or, optionally, cancels) the lock.

To address the concern that the contender may never return, we consider the three cases: process, NLM client, and NFSv4 client.

First, the structure of the Linux kernel guarantees that a contending process on the queue must return: a process can lose interest in a lock only through an external signal, and the kernel signal handling code removes the process from the lock queue. Similarly, an NLM client that loses interest in a lock cancels its request when it wakes up, giving LOCKD the opportunity to revoke the lock request. Finally, if an NFSv4 client loses interest in a lock, NFSD revokes the lock request after a timeout.

The provisional lock is simple enough to be applied without requiring memory allocations, which sidesteps the kernel spinlock problems described earlier.

Along the way, we identified and fixed some problems with the NFSv4 protocol:

  • The NFSv4 protocol has no equivalent to the NLM "cancel" call. This means that when a client process stops blocking on a lock, the server may wait up to a lease period (typically about a minute) before giving up and allowing another waiting client to take the lock. We found a solution that is backwards compatible (and thus implementable by current NFSv4.0 clients and servers), and also added language describing this solution to the new NFSv4.1 draft
  • The NFSv4 protocol has no equivalent to the "grant" call; clients must thus poll very frequently if they wish to acquire contended locks in a timely manner. However, the traditional NLM grant call is known to have problems, e.g., numerous race conditions. We therefore proposed an alternate mechanism that allows a server to notify a client of lock availability without committing the server to granting the lock to that client. Speicific language for NFSv4.1 has been proposed and met with interest, but is awaiting working group consensus.

Cluster Filesystem lock() Interface

A Linux file system is allowed to export its own lock() method, but only a few file systems bother to do so. In particular, none of the file systems exported with NFS export a private lock() method. Consequently, neither LOCKD nor NFSD attempt to use private lock() methods.

Cluster filesystems, on the other hand, do want to export a lock() method that is called by LOCKD and NFSD so that the back end can maintain a consistent view across servers. However, the current private lock() interface is unsuitable for cluster file systems.

  • As before, we can't afford to block the NFSv4 server or LOCKD threads, which argues for an asynchronous interface. This is especially helpful for non-blocking locks, which do not offer the option of returning a temporary "blocked" response followed by a callback that grants the request.
  • From the earlier discussion, even if the request is for a blocking lock, the file system must anticipate a return from the lock() method without having fully acquired the lock. We also need to anticipate cases where a process on the client is interrupted and the client cancels the lock.

Tasks

  • Design and implement an asynchronous interface to the private lock() method
  • Have LOCKD and NFSD test for the presence of a private lock() method and invoke the method when it is present

Progress

Acquiring a cluster file system lock may require comminication with remote hosts. To avoid blocking lock manager threads during such communication, we allow the results to be returned asynchronously.

If a file system specific lock() invocation decides that it must block, e.g., because of a delay incurred in the course of granting a non-blocking lock request, the file system returns -EINPROGRESS. Later, the file system returns the result of the lock request through a callback registered in the lock_manager_operations struct.

An FL_CANCEL flag is added to the struct file_lock to indicate to the file system that the caller wants to cancel the provided lock.

New routines vfs_lock_file, vfs_test_lock, and vfs_cancel_lock replace posix_lock_file, posix_test_file, and posix_cancel_lock in LOCKD and the NFSv4 server. They invoke the private lock() method if it exists, otherwise they invoke the posix lock() method.

Status

Our solution has been tested with the GPFS file system. Patches have been submitted to the Linux community, and we are responding to comments.

The lack of a consumer in the Linux kernel, e.g., a cluster file system with byte-range locking, has impeded acceptance, but with GFS2 now included in the Linux 2.6.19-rc1 kernel, we have reason for optimism.

Personal tools