From Linux NFS

Revision as of 16:43, 19 January 2022 by Bfields (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.

You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".

The "fsid=" option is required on any export of an NFS filesystem.

For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.

known issues

fsid= required, crossmnt broken

The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.

Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.

That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.

In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.

One idea might be an NFS proxy-only mode where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.

reboot recovery

NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)

But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.

Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: reboot recovery for re-export servers

Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.

filehandle limits

NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. See more details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits]

If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.

The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.

If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.

Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.

errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients

When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.

Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.

If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.

Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.

Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.

unnecessary GETATTRs

We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.

Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.

re-export not reading more than 128K at a time

For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/

open DENY bits ignored

NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.

This is probably not too hard to fix, but also probably not a high priority.

Known problems that we've fixed

Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)
Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.
File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)
delegations and leases should work; this could probably use some testing.

Use cases

Scaling read bandwidth

You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.

Hiding latency of distant servers

You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. CernVM-FS occupies a similar niche. fscache should help here too.

NFS version support

It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.

NFS re-export