General troubleshooting recommendations
From Linux NFS
Depending on your configuration, there's a number of ways that NFS can fail to work. Sometimes it can be difficult to determine exactly why it is not working. This page describes some general techniques for diagnosing the issue.
If you cannot resolve your problem and plan to report it to the developer, see Reporting bugs.
General NFSv4 Issues
First, doublecheck that you've followed all the directions for setting up NFSv4 at the CITI NFSv4 site. Some of the steps are different from setting up NFSv3, and if you miss a step, things won't work as you expect. In particular, these are the most common problems:
- Did you mount rpc_pipefs and nfsd?
- Have you patched util-linux so that you get a version of mount supporting NFSv4? (If you get errors like 'mount: wrong fs type, bad option, bad superblock', then you may need to upgrade util-linux.)
- Are you using the right mount syntax (it's slightly different between NFSv3 and NFSv4; see below.)
Check server's NFSv4 capability
Make sure your server has NFSv4 available:
rpcinfo -p `hostname`
That should show the versions of NFS available. Also check that the client and server are running the appropriate NFSv4 processes:
ps aux | grep rpc
As a minimum, the server should show:
rpc.mountd rpc.idmapd rpc.nfsd
And the client should show
rpc.idmapd
Check server's exports
Doublecheck is that your server is exporting what you think it is. On the server, run the command:
exportfs -v
If you need to make modifications, edit /etc/exports and re-export using the command
exportfs -r
Remember that pseudo-filesystems in NFSv4 work very differently than NFSv3. Review the Using NFSv4 directions if you have questions.
Check server mount functionality
Try mounting the nfs4 export on the server itself by mounting localhost:/. This will isolate whether the problem is with the server configuration.
Check client mount functionality
If your server is exporting something like /home, in NFSv3 you'd mount it as
mount -t nfs -o nfsvers=3 host:/home /mnt/host_home
But NFSv4 uses a "pseudofilesystem". This means you would do this mount using the following syntax:
mount -t nfs4 host:/ /mnt
Note that you mount / even if your export is something like /home/foo/bar.
Getting detailed debug output of the client/server interactions
NFS and RPC Trace Debugging
You can capture more information about exactly what the client or server thinks is going on by enabling trace debugging. Trace debugging puts messages on the console and in /var/log/messages as the client or server goes through its paces so you can track progress and have some idea what request is being processed.
The debugging value is a bit mask that indicates which types of events you'd like to see traced. For information on the flag values, look in include/linux/nfs_fs.h, include/linux/lockd/debug.h, include/linux/sunrpc/debug.h, or include/linux/nfsd/debug.h.
To set the debugging value, you use a sysctl like so:
sudo sysctl -w sunrpc.nfs_debug=3
and to turn off debugging, just do this:
sudo sysctl -w sunrpc.nfs_debug=0
See also sunrpc.nfsd_debug, sunrpc.rpc_debug, and sunrpc.nlm_debug.
Sometimes this kind of tracing can produce voluminous output. To ensure that your system log daemon can handle the traffic, make these adjustments:
- When you build your kernel, set the CONFIG_LOG_BUF_SHIFT option to a larger value than is recommended for your hardware. That will allow the kernel to buffer more log messages.
- Edit /etc/syslog.conf and place a "-" in front of "/var/log/messages" -- so you get "-/var/log/messages". That will switch syslogd into async mode to allow it to keep up.
You may also consider enabling serial console support. This will cause all printk()'s to be delayed by the time it takes to write the message on the serial port. While this means that kernel logging can now easily keep up with trace message logging, it will also introduce a significant change in timing that may cause your problem to become unreproducible!
Capturing a Network Trace
If you suspect the problem may involve some sort of miscommunication between the client and server, it can be useful for debugging purposes to dump the communication stream:
Start `tcpdump -s 9000 -w /tmp/dump.out port 2049` on the client, then conduct the client/server interaction. Review the /tmp/dump.out file (or include it with your bug report).
Useful tips:
- If you build your own kernels, enable CONFIG_PACKET_MMAP (Under Device Drivers --> Networking Support --> Network Options) to help tcpdump to keep up with traffic.
- Use a tmpfs file system for the tcpdump output file. tcpdump will keep up more easily, especially with gigabit speed transfer rates.
- Capture a trace on both ends if you suspect a network problem. Comparing the traces will show what each side of the communication is seeing.
- Leave off the "port 2049" to capture DNS, NIS, LDAP, or Kerberos traffic, if you suspect one of these auxiliary protocols is causing misbehavior.
- Don't forget about tcpslice and tethereal's command line parsers if you have a really big trace and you need to split it into manageable chunks.
Kernel Stack Traceback
If you have hung processes, capture a stack traceback to show where the processes are waiting in the kernel. You will need to build your kernel with the CONFIG_MAGIC_SYSRQ option (under Kernel Hacking) to enable stack traceback.
First, look in /etc/sysctl.conf to see if kernel.sysrq is set to 1. If not, then run this command:
echo 1 > /proc/sys/kernel/sysrq
Next, trigger a stack traceback via this command:
echo t > /proc/sysrq-trigger
Look on your console or in /var/log/messages for the output.
Another option, which doesn't require rebuilding your kernel, is to grab the contents of /proc/self/wchan for all the processes on your system. This doesn't give a full traceback, but it will show where each process is waiting, which is sometimes useful. A simple bash script to do this might look like this:
for i in /proc/*/wchan do echo "Process" $i cat $i echo " " done
Making Sense of a Kernel Oops report
Tip: to get a clean oops report, make sure you've enabled the CONFIG_FRAME_POINTER option under Kernel Hacking when you build your kernel. Then, when you install, copy the System.map file from your build to your boot directory and name it "System.map-`uname -r`" so that the kernel can find it to resolve symbols properly.
"Reboot" the NFSv4 server without shutting down the machine
Just shut down rpc.nfsd and start it again.
Comparing results when mounting via NFSv3 and NFSv4
Find a file that is differing between v3 and v4, and look at the output from the `stat` utility.
Or use `ls -lid --type-style=full-iso` and `ls -lid --time=ctime --time-style=full-iso` if you don't have stat.
Kerberos issues
Check hostnames
Kerberos requires the hostname/domainname used in the keytab is correct. Run `hostname` and look in /etc/hosts to doublecheck that it is set properly. Compare with what you've listed in your keytab file.
Check keytabs
Run the following command to check your keytab:
klist -k
Check krb5 ccache file
If you see log messages regarding something like 'FILE:/tmp/krb5cc_machine_FOO.BAR.AD.ROOT', you can review the file after trying to do the mount via:
klist -e -f -c /tmp/krb5cc_machine_FOO.BAR.AD.ROOT
This will list info about your principals such as the valid/expire dates, encryption types, etc.