Last time, I mentioned more about FreeBSD packaging; I’ve been gradually working on the OpenAFS port, and it is now far enough along that I have submitted it for inclusion in the Ports Collection (pending testing, review, etc.). Follow the PR for updates on issues with the packaging that get reported and fixed.
submitted
December 6th, 2010Anatomy of a FreeBSD port (part 5)
November 21st, 2010It’s been quite some time since I last posted about FreeBSD packaging; today I’m coming back to it to talk a bit about other things that can go in the files/ directory. I just recently got packaging for OpenAFS in good enough shape to submit to the FreeBSD Ports Collection (the PR is here); there’s a little bit of cleverness in the Makefile that I’ll skip for now, in favor of the rc scripts.
Long, long ago, in the early days of Unix (read: before my time), there was a single shell script /etc/rc (the Jargon file claims it to stand for “runcom”) that would be run during system startup, executing commands to set up the local environment, start daemons, etc.. Eventually it grew so huge that it was split up into multiple files, and eventually a large infrastructure was created so that each service would have a script in /etc/rc.d/ and the administrator had mechanisms for controlling which scripts would be run when. Many of these controls are placed in /etc/rc.conf, and the rc scripts for software from the Ports Collection go in /usr/local/etc/rc.d to keep them separate from the base system.
Instead of just being a shell script that is sourced at startup, modern usage involves invocations such as:
/usr/local/etc/rc.d/afsd onestart /usr/local/etc/rc.d/afsd forcestop /usr/local/etc/rc.d/afsd start
with multiple variations on the “start” and “stop” commands. In order to be started, the appropriate rc.conf variable must be set to enable that service; onestart is a way to start it manually regardless. In order for this to work, each rc script has to define several shell functions that hook into the (massive) rc.subr (that’s “subroutine”) infrastructure. Here’s what I ended up with in files/afsd.in:
#!/bin/sh # # we require afsserver for the (rare, untested) case when a client # and server are running on the same machine -- the client must not # start until the server is running. # # PROVIDE: afsd # REQUIRE: afsserver named
These keywords are used to order all the rc scripts on system startup (and shutdown) — dependencies are declared explicitly.
. /etc/rc.subr
name="afsd"
rcvar="afsd_enable"
start_cmd="afsd_start"
start_precmd="afsd_prestart"
stop_cmd="afsd_stop"
command="%%PREFIX%%/sbin/${name}"
kmod="libafs"
vicedir="%%PREFIX%%/etc/openafs"
The reason for the .in suffix on this file is because it has variable substitution applied to it. Here, %%PREFIX%% gets expanded to the current prefix that the port is being built with; this is usually /usr/local but can be other things.
load_rc_config "$name"
eval "${rcvar}=\${${rcvar}:-'NO'}"
This is us checking if we’re listed in rc.conf; default to disabled if not mentioned.
afsd_prestart()
This is one of the functions that hooks into rc.subr — AFS requires several configuration files and a kernel module to be in place before it can start, so we check that they’re all there and give a useful error if not. This is quite helpful for users who are not familiar with how to start afsd manually.
{
# not going very far without a kernel module
if ! kldstat -qm afs; then
echo "Loading AFS kernel module..."
if ! kldload $kmod; then
echo "Failed to enable kernel support. Aborting."
return 1;
fi
fi
# now we have a kernel module; check for conffiles
for file in cacheinfo ThisCell CellServDB; do
if [ ! -f ${vicedir}/${file} ]; then
echo "${vicedir}/${file} does not exist. Not starting AFS client."
return 1
fi
done
# need a mountpoint and a cache dir (well, if we have a disk cache)
for dir in $(awk -F: '{print $1, $2}' ${vicedir}/cacheinfo); do
if [ ! -d ${dir} ]; then
echo "${dir} does not exist. Not starting AFS client."
return 2
fi
done
}
afsd_start()
{
# you probably don't want to change these
afsd_default_args="-memcache -dynroot -fakestat-all -afsdb"
# either set explicit extra args or just a size; default medium
afsd_args=${afsd_args:-'MEDIUM'}
case ${afsd_args} in
LARGE)
afsd_args="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"
;;
MEDIUM)
afsd_args="-stat 2000 -dcache 800 -daemons 3 -volumes 70"
;;
SMALL)
afsd_args="-stat 300 -dcache 100 -daemons 2 -volumes 50"
;;
esac
${command} ${afsd_default_args} ${afs_args}
}
The actual start function. We check to see if we’ve been given extra arguments, using sane defaults if not. There are also a few things that you basically always want (for example, non-memcache is currently broken), which are listed in a local variable.
afsd_stop()
{
afsdir=$(awk -F: '{print $1}' ${vicedir}/cacheinfo)
umount ${afsdir}
_return=$?
[ "${_return}" -ne 0 ] && [ -n "${rc_force}" ] && umount -f ${afsdir}
kldunload ${kmod}
}
Stopping does not actually involve touching afsd at all — those processes will happily ignore whatever you throw at them. We must check that AFS is mounted (as someone might be erroneously running onestop), and then just run the umount command to stop things. We also check for whether force is being used, passing that on to umount if needed.
run_rc_command "$1"
This last line is very important! It looks very mundane, but it is how we actually interface with the rc system; rc.subr defines this function, which does all the necessary variable-name munging and calls the appropriate function(s) that we have defined.
At install time for the port, the substituted variables are replaced, and the script is installed into ${PREFIX}/etc/rc.d and added to the package list to be removed at deinstall time. All in all, we wrap a standard interface around the complicated afsd semantics.
and bad locking
November 15th, 2010Last time, I discussed some locking issues in OpenAFS and mentioned that fixing them uncovered a race condition elsewhere.
OpenAFS has a somewhat complicated locking strategy, but there are parts of it that rely on the afs_global_mtx, or the GLOCK for short. The GLOCK should not be held across sleeps, as this could cause the client to hang. But, it is needed for synchronization for some things that must sleep. So, the sleep routines backend to mtx_sleep, which drops and reacquires the mutex around the actual sleep. However, other threads may have acquired the GLOCK during the intervening time, so any checks which were made before the sleep must be made again (or the programmer must otherwise ensure that their values could not have changed). This was problematic in the afs_root function, where the GLOCK is used to serialize access to a global variable, afs_globalVp, which points to the vcache entry for the root AFS vnode. The relevant code is:
if (afs_globalVp && (afs_globalVp->f.states & CStatd)) {
tvp = afs_globalVp;
error = 0;
} else {
tryagain:
if (afs_globalVp) {
afs_PutVCache(afs_globalVp);
/* vrele() needed here or not? */
afs_globalVp = NULL;
}
The afs_PutVCache function sleeps, dropping the GLOCK in the process. So it is possible for some other thread to have come into this block of code at the same time, and also try to call afs_PutVCache on afs_glovalVp. When this happens, the core vputx routine used to implement afs_PutVCache sees that the reference count on the global vnode entry is negative, which violates an invariant of the VFS layer. (This means a kernel panic under my debugging options.)
The fix, then, is to make sure that all changes to afs_globalVp involve conditions and actions checked while the GLOCK is held and no sleeps have been made. If the thread sleeps, conditions must be re-checked.
For this block of code, the change is easy — set afs_globalVp to NULL before calling afs_PutVCache (storing the value in an intermediate variable), so that other threads will see that the removal has been queued, even if it has not actually taken place, yet. However, this is not the only bug of this form in the afs_root function — a reminder that great care must be taken with locking strategies, and that sleeps can come in surprising places.
more locking
November 7th, 2010Last time, I started off with some locking fixups for the OpenAFS client on FreeBSD, but left off in the middle. We had fixed the cosmetic errors coming from afs_FlushVCache, but not the underlying problem. We recall that the cosmetic issues arose due to incorrect locking around accesses to the v_usecount field of the vnode; it turns out that there are only a small number of places where this happens. Most of them are uninteresting, just a quick check and not much else. But in osi_VM_FlushVCache (which is FreeBSD-specific code), we check the use count once, and then again later on in the function, and then we call vgone(). vgone() is a heavyweight function call, marking a vnode as being free for reuse. As such, it requires some pretty heavy locking around calls to it — in particular, it requires an exclusive lock on the vnode. (vgone also acquires the vnode interlock internally.) However, this is not quite sufficient, as once vgone places the vnode on the free list, it is susceptible to being destroyed (the FreeBSD VFS layer runs a cleaner periodically). But we still have the vnode locked! We need to unlock it after vgone returns, and need some mechanism to ensure that it doesn’t go away in the meantime. This is done by placing a “hold” on the vnode, or increasing its “hold count”. This is a closely related idea to the use count (and in fact the usual way to increment the use count also increments the hold count), but needs to be tracked separately for implementation details like this. So, then, we put a hold on the vnode before sending it away (keeping the interlock held for efficiency), then do the unlock, and then drop the hold. This procedure is sufficiently internal to the VFS layer so as to not be documented in a man page; I learned it because a FreeBSD VFS expert directed me to the vlrureclaim function in sys/kern/vfs_subr.c. It is not quite the same, as it is iterating over a list of vnodes, but it does go and free up vnodes that are not currently being used, so the checks and locks it takes are a good example for my use case.
The extra bonus of going and implementing a proper set of instructions around vgone is that it allowed the removal of some duplicated work! In addition to osi_VM_FlushVCache, the osi_TryEvictVCache function was doing some checks and then calling vgone. Well, more properly, it was calling vgonel, which is not supposed to be an exported symbol but just happened to work due to an implementation detail of the kernel linker! It turns out that the checks it was doing are exactly the same ones done in osi_VM_FlushVCache, so the latter can be implemented in terms of the former, removing a goodly chunk of code. (It actually wants to be implemented in terms of afs_FlushVCache, which does some additional bookkeeping on the number of vcaches in use, but I didn’t realize this until after the code was committed; I ran into it while tracking down another issue.) This change allowed my (multi-threaded) testing to go far enough to expose a rare race condition elsewhere in the codebase, which we’ll cover next time.
locking
November 1st, 2010Things have been kind of locked up, here, for the past few weeks. But I have finally gotten around to getting some interesting work done, on the OpenAFS front. During the long run of a “buildworld” in AFS, I would eventually get a warning on the console that the afs_vop_reclaim() function (i.e. the actual function that gets used when the VOP_RECLAIM() operation is performed on a vnode of type “afs”) had hit an error condition where the routine that should have removed all AFS content from that vnode failed:
afs_vop_reclaim: afs_FlushVCache failed code 16
Code 16 is EBUSY, which is actually returned from several places in that function. Placing print statements before all of them, and triggering the bug again, reveals that the reference count of the vnode (as determined by AFS’s VREFCOUNT() macro) was too large, implying that someone else was probably trying to use that vnode. For extra fun, later on the buildworld step would fail, usually claiming that it could not find a particular header file. Using the fs getfid command (from a different computer), it was clear that the file existed, and the fid used to identify that file is the same as the one that we failed to flush properly. Clearly, this bug was leaving a corrupt vnode floating around, and this corruption was later crashing the build.
Now, what does VREFCOUNT actually do? The relevant block of code is
665 #if defined(AFS_XBSD_ENV) || defined(AFS_DARWIN_ENV) 666 #define vrefCount v->v_usecount [...] 673 #elif defined(AFS_XBSD_ENV) || defined(AFS_DARWIN_ENV) 674 #define VREFCOUNT(v) ((v)->vrefCount) 675 #define VREFCOUNT_GT(v, y) (AFSTOV(v)->v_usecount > (y))
(which is rather ugly); both VREFCOUNT and VREFCOUNT_GT use the v_usecount field of the vnode associated with the given vcache. Now, FreeBSD has a locking strategy for vnode elements (and for the vnodes themselves, but that’s getting ahead of ourselves), which is laid out in the sys/vnode.h system header.
/*
* Reading or writing any of these items requires holding the appropriate lock.
*
* Lock reference:
* c - namecache mutex
* f - freelist mutex
* G - Giant
* i - interlock
* m - mntvnodes mutex
* p - pollinfo lock
* s - spechash mutex
* S - syncer mutex
* u - Only a reference to the vnode is needed to read.
* v - vnode lock
[...]
/*
* Locking
*/
struct lock v_lock; /* u (if fs don't have one) */
struct mtx v_interlock; /* lock for "i" things */
struct lock *v_vnlock; /* u pointer to vnode lock */
int v_holdcnt; /* i prevents recycling. */
int v_usecount; /* i ref count of users */
u_long v_iflag; /* i vnode flags (see below) */
u_long v_vflag; /* v vnode flags */
int v_writecount; /* v ref count of writers */
As we can see, the locking strategy requires that accesses to v_usecount hold the vnode interlock; OpenAFS was failing to do so. Conveniently, there is a wrapper function vrefcnt() that takes the interlock, reads the use count into a local variable, and then returns the local variable. Changing the VREFCOUNT macros to use this function did eliminate the console warnings about afs_FlushVCache returning EBUSY … but it did not fix the buildworld. Still, we get compilation errors stemming from files that are mysteriously “missing”. The fix involves more locking, but the story is a bit more involved than I have space left, here; it’s on tap for next time.
Marmelade
October 4th, 2010A month or so back, I went to a friend’s “house-cooling”, that is, the party right before they moved across the country and wanted to give away all the stuff they weren’t moving with them. At this point, there basically wasn’t any furniture in the house, so we were standing and/or sitting on the floor. Various people had brought snacks and mixers to go with the alcohol collection (which was also up for grabs! Sadly, his dad got dibs on the scotch whiskey), which were quite delicious. Among the things I came home with were a different class of foodstuffs, though — marmelades. He had quite a collection of them, sometimes picking up very interesting things while traveling. Among the delicacies I acquired were both lemon and lime marmelade, and a clementine marmelade that, instead of having strips of zest, had whole slices of the fruit! Of course, since I got the collection used, as it were, many of the jars were almost gone, so I only got a little bit of use from them. They were nonetheless good enough that I have switched from putting maple syrup on my daily waffle breakfast to using jams, jellies, and marmelades. I have a relatively standard orange marmelade that I picked up from Trader Joe’s, but kind of wonder whether I could find something more interesting in the local area.
Such a clang
September 27th, 2010Recent versions of FreeBSD are shipping with a clang binary — the C compiler using the LLVM compiler backend. Very recent versions of FreeBSD even can be compiled with it and run normally. Clang is an exciting development, since it has lots of nice static analysis and very clear warnings (well, at least as compared to gcc) and is pretty easily extensible.
Of course, having acquired a clang binary that can compile the FreeBSD kernel, my first instinct was to throw the OpenAFS source at it, to see what sorts of new and exciting warnings it gives about the codebase. After a bit of a detour figuring out how to suppress the color (!) in the output (which doesn’t work very well with logging it to file), I did get a libafs.ko module built with clang, and went to try it out. Very quickly, I got an “unexpected FPU use in kernel mode” page fault, and perhaps not too surprisingly, attempting to get a crash dump caused the system to hang.
Unfortunately, my test system is remote, so my clang-y crash has taken the wind out of my pipes.
archive or archive
September 19th, 2010Another AFS-inspired post, though here it is just barely involved. Previously, I mentioned that I could recompile the entire operating system with a recent version of OpenAFS on FreeBSD. However, my first attempt to do so failed, with a rather curious error:
building static egacy library
ar: fatal: Numeric user ID too large
*** Error code 70
The ar(1) utility is used to generate static libraries, and many such libraries are generated during the buildworld process. However, the ar file format stores the uid and gid of the object files that comprise the archive, and there is a fixed-width field for storing the uid (six columns). This usually works just fine, since Unix user IDs are capped at 2^16 or so, which is only five columns. However, in AFS, this uids must be globally unique, and can be quite large — in particular, the protection database entry for daemon.freebuild (the kerberos principal I was using for testing) is 33554737, which decidedly does not fit into six columns!
I got around to doing some research, and none of the other systems I tested made a failure to represent the uid a fatal error: Linux simply truncated it to 335547, Solaris capped it at 600001, and OS X took the remainder modulo some power of 2 (between 8 and 25), leaving it as 217. FreeBSD’s libarchive infrastructure makes this a fairly easy patch to write, and I’m currently testing a patch that makes this condition non-fatal for submission to upstream.
lockers
September 13th, 2010MIT makes heavy use of the AFS network filesystem in its Athena computing environment. (This leads to my interest in OpenAFS support for FreeBSD, of course.) One nifty feature of this setup is the concept of a “locker”, which corresponds to a particular bucket of AFS storage. (Actually, there can be lockers backed with other filesystems, but those are quite rare these days.) AFS divides storage into volumes, which have a particular quota and are mounted at a particular point in the global /afs namespace.
What a locker does is gives an AFS volume an entry in a /mit namespace, collapsing a very broad tree into a nice flat area. I only attach the lockers that I’m interested in, so /mit doesn’t get very full.
For example, my home directory is /afs/athena.mit.edu/user/k/a/kaduk (quite a mouthful!), but is available at /mit/kaduk when I log in.
In addition to user lockers, there are also organization lockers (for student groups and the like), project lockers (for (usually software) projects), and more. Project lockers are useful in that software can be installed in them which is available on any Athena machine, without being locally installed on each machine’s hard drive. This is something of a feat when different Athena machines are based on different operating systems (or even different versions of the same OS) or have different word length. A key feature here of AFS is its use of sysnames, so that each machine has a list of well-recognized names that it recognizes as being a compatible software version. For example, on most current Athena cluster machines, the primary sysname is amd64_ubuntu1004, as these are 64-bit Ubuntu Lucid Lynx machines. But they could probably also run code for amd64_ubuntu910 machines, and maybe even i386_ubuntu1004 machines. A maintainer of locker software can deploy multiple copies of the same software, compiled for different systems, and make bin a symlink to arch/@sys/bin so that the appropriate version is selected automatically.
I have recently become a maintainer of locker software in earnest, installing the tmux utility in the bsd locker (after a suggestion in the SIPB office). Of course, I only have it installed for one sysname at the moment, so there’s more work to be done …
a world of success
August 30th, 2010Having finally made some advances on the OpenAFS front, I had achieved a state that was able to copy, read, and write a large data set without error, hang, or crash. However, I was unable to run executables from AFS, which presented a serious obstacle to passing the lazy man’s filesystem stress test: ‘make buildworld’. This target recompiles from scratch an entire build toolchain, and uses that (updated) toolchain to rebuild the entire operating system from scratch. As such, it can put a fair bit of load on a filesystem (and a CPU, for that matter).
Asking on the freebsd-fs@FreeBSD.org mailing list, a simple suggestion was made that would account for the displayed symptoms. (This involved two different mechanisms for tracking what is effectively a file’s size, and only one of them being updated.) After applying that fix, and a workaround for some locking issues, I now have an OpenAFS installation that can survive the buildworld informal stress test.
To be fair, it’s not perfect — attempting a parallel make with simultaneous compilation processes still causes a deadlock, but it’s a big milestone, and cause for some celebration.