GFS2 cluster3 userland notes
From Linux NFS
August 6, 2008: the current cluster3 code from the cluster.git repo only builds against revision 1579 of openAIS. so, to get that:
- $ svn checkout --revision 1579 http://svn.fedorahosted.org/svn/openais
Um, this might not even be true tomorrow :) The reasons behind it are that openAIS is getting split into two pieces, with "corosync" becoming the main core system, and the new "openAIS" component handles the SA Forum APIs.
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.
- cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.
- cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of OpenAIS, a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.
- cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.
First off, you can save yourself a lot of hassle by not starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:
- /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd
- for ease of removal, you can find the original RPM package names like this: $ sudo rpm -q --whatprovides /etc/cluster/cluster.conf
Get the newest versions of things:
- you'll need libvolume_id-devel, but that's okay to get from an RPM.
- latest device-mapper source
- use svn to clone the openAIS repo:
- $ svn checkout http://svn.osdl.org/openais
- $ cd openais && svn export . ../openais-checkout/
- the build stuff is in the "trunk" subdirectory.
- use git to clone the cluster3 repo:
- $ git-clone http://sources.redhat.com/git/cluster.git
- their branch "master" is their ongoing cluster3 development.
- latest LVM2 source
- build the device-mapper first, shouldn't be a problem.
- next, openAIS; I keep having to futz with the DESTDIR string in their Makefile.inc -- it's not playing correctly with the --prefix option.
- before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.
- cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.
- $ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include
- last, build LVM2. make sure to specify the clvmd type, and i always disable LVM1 compatibility:
- $ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with ldd and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.
- you'll need a /etc/cluster/cluster.conf. this is a sample for a 3-node cluster.
- you'll also need a /etc/lvm/lvm.conf, but there's not really any tweaking you'll need to do other than make sure that the locking_type is set to DLM (3, IIRC).
- also, the /etc/init.d/ scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. cman.init and clvmd.init
- Note: cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):
- in /etc/cluster/cluster.conf, add the entry <group groupd_compat="0"/>
- when starting the daemons, start them all with -g0
- once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.
LVM can't see my volume group any longer??
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the aoe module, though -- but lvscan never saw it, despite that the other two nodes could see it.
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (/etc/lvm/cache/.cache), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's /dev/ entry that lvscan was able to see it. Good thing to keep in mind for future debugging: apparently it is possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with lvm or something.