Archive for November, 2010

Anatomy of a FreeBSD port (part 5)

Sunday, November 21st, 2010

It’s been quite some time since I last posted about FreeBSD packaging; today I’m coming back to it to talk a bit about other things that can go in the files/ directory. I just recently got packaging for OpenAFS in good enough shape to submit to the FreeBSD Ports Collection (the PR is here); there’s a little bit of cleverness in the Makefile that I’ll skip for now, in favor of the rc scripts.
Long, long ago, in the early days of Unix (read: before my time), there was a single shell script /etc/rc (the Jargon file claims it to stand for “runcom”) that would be run during system startup, executing commands to set up the local environment, start daemons, etc.. Eventually it grew so huge that it was split up into multiple files, and eventually a large infrastructure was created so that each service would have a script in /etc/rc.d/ and the administrator had mechanisms for controlling which scripts would be run when. Many of these controls are placed in /etc/rc.conf, and the rc scripts for software from the Ports Collection go in /usr/local/etc/rc.d to keep them separate from the base system.
Instead of just being a shell script that is sourced at startup, modern usage involves invocations such as:

/usr/local/etc/rc.d/afsd onestart
/usr/local/etc/rc.d/afsd forcestop
/usr/local/etc/rc.d/afsd start

with multiple variations on the “start” and “stop” commands. In order to be started, the appropriate rc.conf variable must be set to enable that service; onestart is a way to start it manually regardless. In order for this to work, each rc script has to define several shell functions that hook into the (massive) rc.subr (that’s “subroutine”) infrastructure. Here’s what I ended up with in files/afsd.in:

#!/bin/sh
#
# we require afsserver for the (rare, untested) case when a client
# and server are running on the same machine -- the client must not
# start until the server is running.
#
# PROVIDE: afsd
# REQUIRE: afsserver named

These keywords are used to order all the rc scripts on system startup (and shutdown) — dependencies are declared explicitly.

. /etc/rc.subr

name="afsd"
rcvar="afsd_enable"
start_cmd="afsd_start"
start_precmd="afsd_prestart"
stop_cmd="afsd_stop"
command="%%PREFIX%%/sbin/${name}"
kmod="libafs"
vicedir="%%PREFIX%%/etc/openafs"

The reason for the .in suffix on this file is because it has variable substitution applied to it. Here, %%PREFIX%% gets expanded to the current prefix that the port is being built with; this is usually /usr/local but can be other things.


load_rc_config "$name"
eval "${rcvar}=\${${rcvar}:-'NO'}"

This is us checking if we’re listed in rc.conf; default to disabled if not mentioned.


afsd_prestart()

This is one of the functions that hooks into rc.subr — AFS requires several configuration files and a kernel module to be in place before it can start, so we check that they’re all there and give a useful error if not. This is quite helpful for users who are not familiar with how to start afsd manually.

{
        # not going very far without a kernel module
        if ! kldstat -qm afs; then
                echo "Loading AFS kernel module..."
                if ! kldload $kmod; then
                        echo "Failed to enable kernel support. Aborting."
                        return 1;
                fi
        fi
        # now we have a kernel module; check for conffiles
        for file in cacheinfo ThisCell CellServDB; do
                if [ ! -f ${vicedir}/${file} ]; then
                        echo "${vicedir}/${file} does not exist.  Not starting AFS client."
                        return 1
                fi
        done
        # need a mountpoint and a cache dir (well, if we have a disk cache)
        for dir in $(awk -F: '{print $1, $2}' ${vicedir}/cacheinfo); do
                if [ ! -d ${dir} ]; then
                        echo "${dir} does not exist. Not starting AFS client."
                        return 2
                fi
        done
}

afsd_start()
{
        # you probably don't want to change these
        afsd_default_args="-memcache -dynroot -fakestat-all -afsdb"
        # either set explicit extra args or just a size; default medium
        afsd_args=${afsd_args:-'MEDIUM'}
        case ${afsd_args} in
        LARGE)
                afsd_args="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"
                ;;
        MEDIUM)
                afsd_args="-stat 2000 -dcache 800 -daemons 3 -volumes 70"
                ;;
        SMALL)
                afsd_args="-stat 300 -dcache 100 -daemons 2 -volumes 50"
                ;;
        esac
        ${command} ${afsd_default_args} ${afs_args}
}

The actual start function. We check to see if we’ve been given extra arguments, using sane defaults if not. There are also a few things that you basically always want (for example, non-memcache is currently broken), which are listed in a local variable.


afsd_stop()
{
        afsdir=$(awk -F: '{print $1}' ${vicedir}/cacheinfo)
        umount ${afsdir}
        _return=$?
        [ "${_return}" -ne 0 ] && [ -n "${rc_force}" ] && umount -f ${afsdir}
        kldunload ${kmod}
}

Stopping does not actually involve touching afsd at all — those processes will happily ignore whatever you throw at them. We must check that AFS is mounted (as someone might be erroneously running onestop), and then just run the umount command to stop things. We also check for whether force is being used, passing that on to umount if needed.


run_rc_command "$1"

This last line is very important! It looks very mundane, but it is how we actually interface with the rc system; rc.subr defines this function, which does all the necessary variable-name munging and calls the appropriate function(s) that we have defined.

At install time for the port, the substituted variables are replaced, and the script is installed into ${PREFIX}/etc/rc.d and added to the package list to be removed at deinstall time. All in all, we wrap a standard interface around the complicated afsd semantics.

and bad locking

Monday, November 15th, 2010

Last time, I discussed some locking issues in OpenAFS and mentioned that fixing them uncovered a race condition elsewhere.
OpenAFS has a somewhat complicated locking strategy, but there are parts of it that rely on the afs_global_mtx, or the GLOCK for short. The GLOCK should not be held across sleeps, as this could cause the client to hang. But, it is needed for synchronization for some things that must sleep. So, the sleep routines backend to mtx_sleep, which drops and reacquires the mutex around the actual sleep. However, other threads may have acquired the GLOCK during the intervening time, so any checks which were made before the sleep must be made again (or the programmer must otherwise ensure that their values could not have changed). This was problematic in the afs_root function, where the GLOCK is used to serialize access to a global variable, afs_globalVp, which points to the vcache entry for the root AFS vnode. The relevant code is:

    if (afs_globalVp && (afs_globalVp->f.states & CStatd)) {
        tvp = afs_globalVp;
        error = 0;
    } else {
tryagain:
        if (afs_globalVp) {
            afs_PutVCache(afs_globalVp);
            /* vrele() needed here or not? */
            afs_globalVp = NULL;
        }

The afs_PutVCache function sleeps, dropping the GLOCK in the process. So it is possible for some other thread to have come into this block of code at the same time, and also try to call afs_PutVCache on afs_glovalVp. When this happens, the core vputx routine used to implement afs_PutVCache sees that the reference count on the global vnode entry is negative, which violates an invariant of the VFS layer. (This means a kernel panic under my debugging options.)
The fix, then, is to make sure that all changes to afs_globalVp involve conditions and actions checked while the GLOCK is held and no sleeps have been made. If the thread sleeps, conditions must be re-checked.
For this block of code, the change is easy — set afs_globalVp to NULL before calling afs_PutVCache (storing the value in an intermediate variable), so that other threads will see that the removal has been queued, even if it has not actually taken place, yet. However, this is not the only bug of this form in the afs_root function — a reminder that great care must be taken with locking strategies, and that sleeps can come in surprising places.

more locking

Sunday, November 7th, 2010

Last time, I started off with some locking fixups for the OpenAFS client on FreeBSD, but left off in the middle. We had fixed the cosmetic errors coming from afs_FlushVCache, but not the underlying problem. We recall that the cosmetic issues arose due to incorrect locking around accesses to the v_usecount field of the vnode; it turns out that there are only a small number of places where this happens. Most of them are uninteresting, just a quick check and not much else. But in osi_VM_FlushVCache (which is FreeBSD-specific code), we check the use count once, and then again later on in the function, and then we call vgone(). vgone() is a heavyweight function call, marking a vnode as being free for reuse. As such, it requires some pretty heavy locking around calls to it — in particular, it requires an exclusive lock on the vnode. (vgone also acquires the vnode interlock internally.) However, this is not quite sufficient, as once vgone places the vnode on the free list, it is susceptible to being destroyed (the FreeBSD VFS layer runs a cleaner periodically). But we still have the vnode locked! We need to unlock it after vgone returns, and need some mechanism to ensure that it doesn’t go away in the meantime. This is done by placing a “hold” on the vnode, or increasing its “hold count”. This is a closely related idea to the use count (and in fact the usual way to increment the use count also increments the hold count), but needs to be tracked separately for implementation details like this. So, then, we put a hold on the vnode before sending it away (keeping the interlock held for efficiency), then do the unlock, and then drop the hold. This procedure is sufficiently internal to the VFS layer so as to not be documented in a man page; I learned it because a FreeBSD VFS expert directed me to the vlrureclaim function in sys/kern/vfs_subr.c. It is not quite the same, as it is iterating over a list of vnodes, but it does go and free up vnodes that are not currently being used, so the checks and locks it takes are a good example for my use case.
The extra bonus of going and implementing a proper set of instructions around vgone is that it allowed the removal of some duplicated work! In addition to osi_VM_FlushVCache, the osi_TryEvictVCache function was doing some checks and then calling vgone. Well, more properly, it was calling vgonel, which is not supposed to be an exported symbol but just happened to work due to an implementation detail of the kernel linker! It turns out that the checks it was doing are exactly the same ones done in osi_VM_FlushVCache, so the latter can be implemented in terms of the former, removing a goodly chunk of code. (It actually wants to be implemented in terms of afs_FlushVCache, which does some additional bookkeeping on the number of vcaches in use, but I didn’t realize this until after the code was committed; I ran into it while tracking down another issue.) This change allowed my (multi-threaded) testing to go far enough to expose a rare race condition elsewhere in the codebase, which we’ll cover next time.

locking

Monday, November 1st, 2010

Things have been kind of locked up, here, for the past few weeks. But I have finally gotten around to getting some interesting work done, on the OpenAFS front. During the long run of a “buildworld” in AFS, I would eventually get a warning on the console that the afs_vop_reclaim() function (i.e. the actual function that gets used when the VOP_RECLAIM() operation is performed on a vnode of type “afs”) had hit an error condition where the routine that should have removed all AFS content from that vnode failed:

afs_vop_reclaim: afs_FlushVCache failed code 16

Code 16 is EBUSY, which is actually returned from several places in that function. Placing print statements before all of them, and triggering the bug again, reveals that the reference count of the vnode (as determined by AFS’s VREFCOUNT() macro) was too large, implying that someone else was probably trying to use that vnode. For extra fun, later on the buildworld step would fail, usually claiming that it could not find a particular header file. Using the fs getfid command (from a different computer), it was clear that the file existed, and the fid used to identify that file is the same as the one that we failed to flush properly. Clearly, this bug was leaving a corrupt vnode floating around, and this corruption was later crashing the build.
Now, what does VREFCOUNT actually do? The relevant block of code is

 665 #if defined(AFS_XBSD_ENV) || defined(AFS_DARWIN_ENV)
 666 #define vrefCount   v->v_usecount
[...]
 673 #elif defined(AFS_XBSD_ENV) || defined(AFS_DARWIN_ENV)
 674 #define VREFCOUNT(v)          ((v)->vrefCount)
 675 #define VREFCOUNT_GT(v, y)    (AFSTOV(v)->v_usecount > (y))

(which is rather ugly); both VREFCOUNT and VREFCOUNT_GT use the v_usecount field of the vnode associated with the given vcache. Now, FreeBSD has a locking strategy for vnode elements (and for the vnodes themselves, but that’s getting ahead of ourselves), which is laid out in the sys/vnode.h system header.

/*
 * Reading or writing any of these items requires holding the appropriate lock.
 *
 * Lock reference:
 *      c - namecache mutex
 *      f - freelist mutex
 *      G - Giant
 *      i - interlock
 *      m - mntvnodes mutex
 *      p - pollinfo lock
 *      s - spechash mutex
 *      S - syncer mutex
 *      u - Only a reference to the vnode is needed to read.
 *      v - vnode lock
[...]
        /*
         * Locking
         */
        struct  lock v_lock;                    /* u (if fs don't have one) */
        struct  mtx v_interlock;                /* lock for "i" things */
        struct  lock *v_vnlock;                 /* u pointer to vnode lock */
        int     v_holdcnt;                      /* i prevents recycling. */
        int     v_usecount;                     /* i ref count of users */
        u_long  v_iflag;                        /* i vnode flags (see below) */
        u_long  v_vflag;                        /* v vnode flags */
        int     v_writecount;                   /* v ref count of writers */

As we can see, the locking strategy requires that accesses to v_usecount hold the vnode interlock; OpenAFS was failing to do so. Conveniently, there is a wrapper function vrefcnt() that takes the interlock, reads the use count into a local variable, and then returns the local variable. Changing the VREFCOUNT macros to use this function did eliminate the console warnings about afs_FlushVCache returning EBUSY … but it did not fix the buildworld. Still, we get compilation errors stemming from files that are mysteriously “missing”. The fix involves more locking, but the story is a bit more involved than I have space left, here; it’s on tap for next time.