Proposed Device Management Design
From Linux NFS
Rules from RFC 5661
- Device IDs are not guaranteed to be valid across metadata server restarts.
- A device ID is unique per client ID and layout type.
- Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)
- The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID
- It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.
Overview of current design
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.
Summary of design changes
- Change scope of deviceID/data-server cache from per mounted file system to per clientid.
- Allows for sharing of device IDs and storage devices
- Add reference counting to device ID for each layout that references it.
- Reap device ID upon last reference.
- Change from rw spinlocks to RCU
- As per kernel Documentation which requests no new rwlocks.
- Share device ID cache with all layout types
- Move device ID lookup and update into generic client so that the RCU code is done once.
- Move data server cache to a stand alone cache
- Data server cache only updated on a GEDEVICEINFO call, or a umount. The I/O paths find the appropriate data server via array index lookups in the deviceid structure. Therefore, there no need for an RCU/rwspinlocks, or hlist.
- Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.
- Assumes layoutget code only caches layouts with resolved device IDs.
- Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.
- Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.
- Only cache the first data server in the multipath_list4 array.
- Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception
- Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check