integrity

As we saw last time, OpenAFS on FreeBSD (amd64 architecture) suffered from some serious corruption issues, being susceptible to page faults in kernel mode on small-valued (but largely non-NULL) addresses. It turns out that the corruption stems from the OpenAFS kernel module (libafs.ko) being compiled with different arguments to the gcc compiler than the main FreeBSD kernel. (The libafs.ko build procedure has been largely unchanged since the days of FreeBSD 4.X, being updated only when things break; the FreeBSD kernel build system has received more love.) In particular, the main kernel build passed -mno-red-zones, whereas until very recently the libafs.ko build did not. This argument has the effect of disabling the “Red Zone” feature of the x86_64 ABI, in which an extra 128-byte region at the end of the stack is available for use without adjusting the location of the stack pointer. In effect, libafs was calling into the kernel, and the kernel was stomping all over its data! Of course, the kernel is not at fault, here, libafs needed to be more careful about where it was storing things, but it is an amusing way to get corruption — I don’t trust the libafs code, and mostly trust the main kernel, but it was the kernel that smashed my stack. (The culprit is unlikely to be an actual function call into the main kernel, as gcc should synchronize the stack pointer before function calls, but rather an interrupt handler that triggered at an unfortunate moment.)

In my excitement of having found the bug I had been tracking for the past several weeks, I went and deleted all (fifteen or so) saved kernel coredumps that I had from the issue, so I can’t actually show the object code that caused the crash I quoted in last week’s post (it seems that I was using different compiler flags at some point, as the object code that is currently produced for afs_vop_close does not put the address of afs_global_mtx on the stack). Anyway, I can at least give an idea of what the differences look like:
--- osi_vnodeops.S-bad
+++ osi_vnodeops.S-good
@@ -1,22 +1,22 @@
0000000000001560 :
- 1560: 48 89 5c 24 e0 mov %rbx,0xffffffffffffffe0(%rsp)
- 1565: 48 89 6c 24 e8 mov %rbp,0xffffffffffffffe8(%rsp)
- 156a: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 1571
- 156d: R_X86_64_PC32 .LC9+0xfffffffffffffffc
- 1571: 4c 89 64 24 f0 mov %r12,0xfffffffffffffff0(%rsp)
- 1576: 4c 89 6c 24 f8 mov %r13,0xfffffffffffffff8(%rsp)
- 157b: 48 83 ec 28 sub $0x28,%rsp
- 157f: 48 8b 1d 00 00 00 00 mov 0(%rip),%rbx # 1586
- 1582: R_X86_64_GOTPCREL afs_global_mtx+0xfffffffffffffffc
- 1586: 48 8b 47 08 mov 0x8(%rdi),%rax
- 158a: 31 f6 xor %esi,%esi
- 158c: 48 89 fd mov %rdi,%rbp
- 158f: b9 87 02 00 00 mov $0x287,%ecx
+ 1560: 48 83 ec 28 sub $0x28,%rsp
+ 1564: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 156b
+ 1567: R_X86_64_PC32 .LC9+0xfffffffffffffffc
+ 156b: 31 f6 xor %esi,%esi
+ 156d: 48 89 5c 24 08 mov %rbx,0x8(%rsp)
+ 1572: 48 8b 1d 00 00 00 00 mov 0(%rip),%rbx # 1579
+ 1575: R_X86_64_GOTPCREL afs_global_mtx+0xfffffffffffffffc
+ 1579: b9 87 02 00 00 mov $0x287,%ecx
+ 157e: 48 89 6c 24 10 mov %rbp,0x10(%rsp)
+ 1583: 4c 89 64 24 18 mov %r12,0x18(%rsp)
+ 1588: 48 89 fd mov %rdi,%rbp
+ 158b: 4c 89 6c 24 20 mov %r13,0x20(%rsp)
+ 1590: 48 8b 47 08 mov 0x8(%rdi),%rax
1594: 48 89 df mov %rbx,%rdi
1597: 4c 8b 60 18 mov 0x18(%rax),%r12
159b: e8 00 00 00 00 callq 15a0
159c: R_X86_64_PLT32 _mtx_assert+0xfffffffffffffffc
15a0: 48 8d 15 00 00 00 00 lea 0(%rip),%rdx # 15a7
15a3: R_X86_64_PC32 .LC9+0xfffffffffffffffc
15a7: 31 f6 xor %esi,%esi
15a9: b9 87 02 00 00 mov $0x287,%ecx

The differences for this particular function (which is rather short) are limited to the beginning of the function, where values are stored into registers and on the stack. The “bad” version just starts storing values from registers down onto the stack ( mov %rbx,0xffffffffffffffe0(%rsp) is storing the value from register %rbx into the memory location pointed to by 0xffffffffffffffe0 (that is, -32) plus the value stored in %rsp, the stack pointer. Only after pushing some values onto the stack does the code decrement the stack pointer. The "good" version is good and decrements the stack pointer (the stack grows down, we recall) before storing anything to it. This difference is key, as without a "red zone", the kernel is free to write to anything below the stack pointer while fielding a scheduler interrupt, say. If the "bad" afs_vop_close was interrupted in the first few instructions, it could end up using whatever bogus values the kernel left on the stack, instead of what it put there; this can easily lead to page faults trying to access things in very low portions of address space.

With this bug squashed, I've been free to fix a couple other minor issues, and now my AFS client is sufficiently stable that I can copy half a gigabyte of data into AFS without trouble --- things are looking good for the future.

Comments are closed.