Archive for July, 2010

linking

Monday, July 26th, 2010

As I mentioned last time, I’ve been working to upgrade the software and development environment on our research group’s new login node.
One of the main troubles that when going from a serial-execution code to the MPI parallelized compiler, linkage errors popped up about the symbols fftw_version, fftw_malloc, and fftw_free. In addition to being defined in the fftw library, these symbols are also defined in an Intel MKL library. For some reason the change of compiler caused complaints about the duplicate symbols; I don’t see how the addition of the MPI additions to the fftw libraries could have affected this, as they do not involve the relevant symbols (as verified by objdump(1)). After playing around with the configure options for both fftw and lam (the MPI library), I eventually broke down and compiled a version of the fftw libraries that did not define those three symbols. This actually turned out to be quite more exciting that it ought to have been … it turns out that fftw likes to compile and run some test programs as part of its build. When you go and remove symbols from its libraries, these programs don’t like to compile, which causes the build to fail. It doesn’t help that the build system is quite autogoo-ified, making it hard to manually follow dependencies and see which make targets (if any) would just build the libraries and not test them. I ended up running a full build of the clean package (using debuild to get the debian packaging additions), keeping a log of the build to a file. Then, I went in and removed the definitions of these three symbols, replacing those statements with ‘extern’ declarations of them. After removing the compiled binary files of interest (and the final libraries), I failed to find ‘make’ invocations that would build the libraries I was really interested, so I just ran ‘make malloc.o’ and such by hand, and then copy/pasted the ar invocation to create the static libraries I wanted from the log of the normal build. I could then manually copy these libraries into a deployed tree (I just copied an existing one for the other libraries I wasn’t modifying), and build the software that has been giving me such trouble. The link step went uneventfully, and then I went to go and actually test this code.
Initial signs were promising, as the test jobs for the code I’ve already written worked okay, but jobs that explicitly tested the fft routines produced all NaNs as the result. (I am assuming that this stems from the MKL fftw_malloc’s failure to sufficiently align the memory it returns for use of the pipelined assembly instructions that fftw uses internally, but have not checked thoroughly.) For bonus points, if I pointed a serial compilation of my software at those libraries, the build failed, claiming that these symbols were not implemented!
This actually led me to note that the FFTW_LIBS were listed after the INTEL_LIBS on the link line, and thus the fftw library not picking up fftw_malloc from the Intel libraries made sense (since we recall that earlier libraries on the link line are not searched for symbols when resolving a particular object). Moving FFTW_LIBS before INTEL_LIBS allowed that static build to finish (though it still produced broken code). More interestingly, leaving FFTW_LIBS before INTEL_LIBS allowed the parallel compilation to link while using the normal (well, except for the underscores) fftw libraries. The most plausible explanation I have is that mpic++ and icpc differ in their treatment of -Wl,--begin-group and -Wl,--end-group, as INTEL_LIBS requires repeated searching to resolve inter-library dependencies, and the original complaint was for multiple implementations within the grouped objects. But that’s kind of a stretch — I’d really like to know what’s actually going on.
Now that the linking is mostly settled out, I have a new problem to deal with — the lam runtime environment for the parallel calculations is bailing on me after only one part of multi-part jobs. The displayed error is “bufferd (getroute): invalid node”. Source-diving the line printing the error is not too hard, but trying to back out where the error originates is not so easy. (The code itself is moderately interesting, doing things such as having functions that return function pointers, which results in a type signature of void(*(bufferd()))().) Playing around with my PATH and the hardcoded calls to mpirun results in several ominous WARNINGS about version mismatch between the runtime and the libraries that the binary were compiled with. I do have some reasons for having multiple version of the lam code installed on this machine, but I think I’ll save those for another post (hopefully one where I have solved all my problems and gotten back to the more interesting business of actually doing chemistry!).

packaging

Monday, July 19th, 2010

Our research group is in the process of upgrading our main login node from a two-core Core2 Duo (3 GHz) to an eight-core Xeon machine (2.5 GHz); as one of the group’s system administrators, I am faced with the exciting task of porting all of the group’s software from the old machine to the new one. Along with the hardware upgrade, we are also upgrading from the EOL’d Ubuntu Gutsy Gibbon to the latest Ubuntu Long-Term Support release, Lucid Lynx.
Basically none of the actual scientific software we use (things like Q-Chem, CHARMM, GAMESS, TURBOMOLE and the like) is packaged for Debian/Ubuntu, so there is a great maze of hacked software deployment on the old machine. And, as with most amateur sysadmin projects, it is horrendously under-documented. Luckily, some of the actual binaries are statically-linked, and can thus be copied over as-is and will continue to work. Unfortunately, much of the work our group does involves new method development, and writing new code into the software requires recompiling it.
We have traditionally compiled our production software using the Intel compiler collection in preference to the GNU offerings, as one expects the processor manufacturer’s compiler to produce the most efficient code at high optimization levels. This, in turn leads to an amusing cascade of circumstances as all the pieces don’t quite fit together ….
In addition to the Intel compilers, our applications also link against the fftw (“Fastest Fourier Transform in the West”) libraries, and the lammpi implementation of the MPI (Message-Passing Interface) parallelization API. But! Instead of calling into fftw from Fortran, as it seems to assume will be the case, the existing code base calls fftw from C++, and does not append a trailing underscore to the symbols it uses, as is commonly done when interlinking C++ (well, C) and Fortran object files. This means that the fftw package available from Ubuntu is unsuitable for us, as the libraries it distributes have trailing underscores.
However, I can still leverage the existing Debian packaging: apt-get source fftw2 gives me the source code and packaging files for the several related packages that we will need. Within that directory, I can modify the debian/rules Makefile to use gfortran -no-underscoring instead of ordinary gfortran, and get libraries that will actually link against the existing codebase. (I ended up using dpkg --extract instead of just installing the resulting packages, so as to keep the unmodified Ubuntu packages in their normal place; the modified packages live in a subdirectory in /opt.)
With that out of the way, I could compile and link the codebase for single-threaded use. This is good, but our calculations that will be running for a month or two really want to benefit from a parallelization speedup, so on to MPI!
The recommended way to use MPI is to use the distributed set of compiler wrappers (e.g. mpic++) instead of the standard compiler; this takes care of linking in the appropriate MPI libraries as needed. Unfortunately, it turns out that these compiler wrappers hardcode at their compile-time which backend compiler to use. In the case of the Ubuntu packaged versions, this means gcc. It turns out that the particular set of compiler arguments that our build system normally passes to icc (the Intel C compiler) do not error out when passed to gcc, so the compilation process would proceed just fine until the first time that it attempted to link individual object files into an archive. This failed because the linker could not find any of the object files it was supposed to be looking for. The root cause of this is quite hilarious — the Intel compiler takes -openmp, enabling Intel’s parallelization technology. The GNU compiler interprets this as -o penmp, so that each object file produced is named “penmp” (overwriting the previous file). The linker, of course, can’t find filename.o, since it doesn’t exist!
Again, we can leverage the Ubuntu packaging, running apt-get source to get the lam4-dev source, and change the rules file to use icc and friends. Of course, there are a few wrenches in the works … the intel compilers want some environment preparation before being executed, in the form of a shell script that is sourced in. The debuild utility, used to build the package files, sanitizes the build environment by default, so any setup done in the shell invoking debuild is lost. A solution is to create wrapper scripts that do the setup before calling the actual Intel binaries; I chose to put these in /usr/local/bin, so as to not interfere with the system namespace. However, debuild doesn’t include that in the PATH, so I ended up putting symlinks in /usr/bin anyway. Even this is insufficient, though, as the rules file calls into dh_shlibdeps which checks the dependency information for the various shared library targets that the package provides. The code compiled with the Intel compilers links into Intel libraries (provided with the compiler suite), but these libraries are not packaged, and have no versioning information available to the Debian utilities. In this case, dh_shlibdeps errors out, which causes the entire build process to fail. A quick hack around this is to just ignore the lam libraries when doing the dependency checks, by passing -Xlam to dh_shlibdeps. This allows the lam4-dev package to build with (and backend to) the Intel compilers; I then extracted these packages into /opt and expected things to Just Work.
Alas, it proved not to be the case. All the application code compiled, but at the final link stage (some 45 minutes in), the link failed with undefined symbol references, complaining about the lam libraries installed in /usr/lib. That is, the ones installed by the standard lam4-dev package, not my custom-built version. I haven’t had much of a chance to debug the root cause of this, but hope that it can be solved by passing a non-standard --prefix argument to configure in the rules file (so that the runtime code will use a more appropriate library search path). We’ll see what actually happens.

On Garlic

Sunday, July 11th, 2010

I tend to be a fan of spicy and flavorful food, and one of the flavorings that I especially like is garlic. A couple weeks ago at dinner, I mentioned that it really seems that garlic as an ingredient should be measured in heads, not cloves. (E.g., I have this recipe for black bean stew that calls for three heads of garlic (ca. 30 cloves).) My friend Karl replied with the catchphrase “I am intrigued by your proposition and wish to subscribe to your newsletter”; feel free to think of this as issue one.

Garlic Rosemary Chicken with risotto

1 quart chicken stock
1.5 pounds frozen chicken pieces
1 head garlic (peeled)
1 tsp rosemary
1 tsp basil
4 tbsp butter
1 7/8 cups rice

Combine the chicken stock, garlic, and herbs in a saucepan and add the chicken pieces. Bring to a simmer, and poach the chicken, partially covered, for about 45 minutes (adjusting for their size as necessary). If there’s a lot of scum, skim it off while the chicken poaches. Remove the chicken and the cloves of garlic, and add 2 tbsp of butter to the stock, and the rice. (The stock and rice should be in a 2:1 ratio, but some of the stock will have evaporated by this point.) The rice will need frequent stirring, especially towards the end of its cooking. Melt the remaining 2 tbsp of butter in a skillet, and fry the chicken pieces and garlic cloves until golden brown. The chicken should not end up too crispy, so it may be necessary to remove it a bit before the garlic, which should end up a beautiful golden brown. The garlic will be quite soft from the poaching, so handle it carefully. Be sure to stir the rice while this is happening, you don’t want it to burn!
When the rice is done (technically, I wouldn’t really call it a risotto, but it is a decent approximation), it will be like a thick glossy sauce; at this point, declare the garlic (and chicken) done, and enjoy the feast!

a bowl of risotto with chicken and fried garlic

I am kind of tempted to do this recipe again with two units of garlic instead of one, as it disappears very quickly at the end. Those three lonely cloves of garlic aren’t enough to make it through all the rice ….

integrity

Monday, July 5th, 2010

As we saw last time, OpenAFS on FreeBSD (amd64 architecture) suffered from some serious corruption issues, being susceptible to page faults in kernel mode on small-valued (but largely non-NULL) addresses. It turns out that the corruption stems from the OpenAFS kernel module (libafs.ko) being compiled with different arguments to the gcc compiler than the main FreeBSD kernel. (The libafs.ko build procedure has been largely unchanged since the days of FreeBSD 4.X, being updated only when things break; the FreeBSD kernel build system has received more love.) In particular, the main kernel build passed -mno-red-zones, whereas until very recently the libafs.ko build did not. This argument has the effect of disabling the “Red Zone” feature of the x86_64 ABI, in which an extra 128-byte region at the end of the stack is available for use without adjusting the location of the stack pointer. In effect, libafs was calling into the kernel, and the kernel was stomping all over its data! Of course, the kernel is not at fault, here, libafs needed to be more careful about where it was storing things, but it is an amusing way to get corruption — I don’t trust the libafs code, and mostly trust the main kernel, but it was the kernel that smashed my stack. (The culprit is unlikely to be an actual function call into the main kernel, as gcc should synchronize the stack pointer before function calls, but rather an interrupt handler that triggered at an unfortunate moment.)

In my excitement of having found the bug I had been tracking for the past several weeks, I went and deleted all (fifteen or so) saved kernel coredumps that I had from the issue, so I can’t actually show the object code that caused the crash I quoted in last week’s post (it seems that I was using different compiler flags at some point, as the object code that is currently produced for afs_vop_close does not put the address of afs_global_mtx on the stack). Anyway, I can at least give an idea of what the differences look like:
--- osi_vnodeops.S-bad
+++ osi_vnodeops.S-good
@@ -1,22 +1,22 @@
0000000000001560 :
- 1560: 48 89 5c 24 e0 mov %rbx,0xffffffffffffffe0(%rsp)
- 1565: 48 89 6c 24 e8 mov %rbp,0xffffffffffffffe8(%rsp)
- 156a: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 1571
- 156d: R_X86_64_PC32 .LC9+0xfffffffffffffffc
- 1571: 4c 89 64 24 f0 mov %r12,0xfffffffffffffff0(%rsp)
- 1576: 4c 89 6c 24 f8 mov %r13,0xfffffffffffffff8(%rsp)
- 157b: 48 83 ec 28 sub $0x28,%rsp
- 157f: 48 8b 1d 00 00 00 00 mov 0(%rip),%rbx # 1586
- 1582: R_X86_64_GOTPCREL afs_global_mtx+0xfffffffffffffffc
- 1586: 48 8b 47 08 mov 0x8(%rdi),%rax
- 158a: 31 f6 xor %esi,%esi
- 158c: 48 89 fd mov %rdi,%rbp
- 158f: b9 87 02 00 00 mov $0x287,%ecx
+ 1560: 48 83 ec 28 sub $0x28,%rsp
+ 1564: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 156b
+ 1567: R_X86_64_PC32 .LC9+0xfffffffffffffffc
+ 156b: 31 f6 xor %esi,%esi
+ 156d: 48 89 5c 24 08 mov %rbx,0x8(%rsp)
+ 1572: 48 8b 1d 00 00 00 00 mov 0(%rip),%rbx # 1579
+ 1575: R_X86_64_GOTPCREL afs_global_mtx+0xfffffffffffffffc
+ 1579: b9 87 02 00 00 mov $0x287,%ecx
+ 157e: 48 89 6c 24 10 mov %rbp,0x10(%rsp)
+ 1583: 4c 89 64 24 18 mov %r12,0x18(%rsp)
+ 1588: 48 89 fd mov %rdi,%rbp
+ 158b: 4c 89 6c 24 20 mov %r13,0x20(%rsp)
+ 1590: 48 8b 47 08 mov 0x8(%rdi),%rax
1594: 48 89 df mov %rbx,%rdi
1597: 4c 8b 60 18 mov 0x18(%rax),%r12
159b: e8 00 00 00 00 callq 15a0
159c: R_X86_64_PLT32 _mtx_assert+0xfffffffffffffffc
15a0: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 15a7
15a3: R_X86_64_PC32 .LC9+0xfffffffffffffffc
15a7: 31 f6 xor %esi,%esi
15a9: b9 87 02 00 00 mov $0x287,%ecx

The differences for this particular function (which is rather short) are limited to the beginning of the function, where values are stored into registers and on the stack. The “bad” version just starts storing values from registers down onto the stack ( mov %rbx,0xffffffffffffffe0(%rsp) is storing the value from register %rbx into the memory location pointed to by 0xffffffffffffffe0 (that is, -32) plus the value stored in %rsp, the stack pointer. Only after pushing some values onto the stack does the code decrement the stack pointer. The "good" version is good and decrements the stack pointer (the stack grows down, we recall) before storing anything to it. This difference is key, as without a "red zone", the kernel is free to write to anything below the stack pointer while fielding a scheduler interrupt, say. If the "bad" afs_vop_close was interrupted in the first few instructions, it could end up using whatever bogus values the kernel left on the stack, instead of what it put there; this can easily lead to page faults trying to access things in very low portions of address space.

With this bug squashed, I've been free to fix a couple other minor issues, and now my AFS client is sufficiently stable that I can copy half a gigabyte of data into AFS without trouble --- things are looking good for the future.