The tail of an off-by-one error in GHC's linker

This is the story of an odd off-by-one error in GHC’s internal static linker for the Mach-O file format on AArch64. This is our beloved macOS/iOS platform!

For a while we have been observing odd non-deterministic failures in CI. We’d see a bunch of test for the aarch64-darwin runner fail, but rarely the same, and if re-run different tests would fail. The only observable correlation was that the failing tests all somehow had to do with iserv, ghc’s external interpreter. This did hint towards an issue in the linker, but no clear indication what was really the fundamental problem.

While Matthew Pickering and I eventually managed to isolate tests from the test-suite that had slightly higher failure rates (e.g. they’d fail every other or so time), getting the failure into the debugger didn’t help much. We’d just have a crash in invalid memory. SIGBUS. The program counter would have advanced too far to be useful, we didn’t know where we came from. This is especially the case for code GHC generates for Haskell. We don’t really “return”, but have knowledge of where we’d eventually want to continue, thus assembly generated by GHC makes lots of use of (un)conditional branch instructions, but rarely branch and link instructions that would record the caller in the link register; and permit to return.

We don’t really have rr yet on aarch64-darwin to my knowledge, nor does lldb support something akin to gdb’s record feature. And of course we don’t have gcc/gdb on aarch64-darwin just yet.

So what we really want is to be able to record where we come from, but we also do not want to mess with the link register, otherwise we’d likely break unrelated code. Let’s go swimming. While meditating on this issue for a bit, I did remember that we reserved a register r16 for spill/reload offset computations that were too large to handle. (AArch64 can’t do arbitrary offsets as immediate, so we occasionally need to add two registers together to obtain the offset value we want). We only care about the registers value for a branch instruction, thus it’s outside of any spill/reload computations. Looks like we can use r16. After modifying the aarch64 codegen to write the current program location into r16 prior to any branch instruction via ADR x16 ., we should now be able to inspect r16 when we hit SIGBUS in the debugger again. After a rebuild, and a few tries to get the crash into the debugger, we are presented with the following debug session:

(lldb) p/x $x16
(unsigned long) $11 = 0x000000010ff99d54
(lldb) p/x $pc
(unsigned long) $12 = 0x0000000108384658
(lldb) dis -s $x16-0x20 -c 20
    0x10ff99d34: cmp    x18, x28
    0x10ff99d38: adr    x16, #0x0
    0x10ff99d3c: b.lo   0x10ff99dbc
    0x10ff99d40: adrp   x18, 0
    0x10ff99d44: add    x18, x18, #0xde0          ; =0xde0
    0x10ff99d48: stur   x18, [x20, #-0x10]
    0x10ff99d4c: stur   x24, [x20, #-0x8]
    0x10ff99d50: sub    x20, x20, #0x10           ; =0x10
    0x10ff99d54: adr    x16, #0x0
    0x10ff99d58: b      0x108384658
    0x10ff99d5c: mov    w18, #0x10
    0x10ff99d60: str    x18, [x19, #0x388]
    0x10ff99d64: adr    x16, #0x0
    0x10ff99d68: b      0x10ff9e57c
    0x10ff99d6c: udf    #0x1
    0x10ff99d70: udf    #0x0
    0x10ff99d74: udf    #0x1e
    0x10ff99d78: udf    #0x0
    0x10ff99d7c: add    x21, x21, #0x10           ; =0x10
    0x10ff99d80: ldr    x18, [x19, #0x358]

We can see we ended up in 0x108384658, which is where the unconditional branch instruction in 0x10ff99d58, which is following our adr x16, 0x0 instruction as expected.

I’ve ran the experiment a few more times to make sure all the crashes looked sufficiently similar and we’d unlikely have to deal with a variety of odd crashes that just happened to present themselves quite similar.

Now we have an address, but we don’t know where this actually came from. GHC’s Run Time System has a function called printLoadedObjects, which when invoked will print a table of all object files the internal linker loaded into memory. (Again I couldn’t get lldb to execute that function, so I ended up just printing all loaded objects after each symbol resolution pass…)

The output of printLoadedObjects looks like this (just a lot more)

_build/stage1/lib/aarch64-osx-ghc-9.3.20210621/template-haskell-2.18.0.0/libHStemplate-haskell-2.18.0.0.a(Internal.o)
      sec  0[alloc: 2; kind: 0]: 0x10ff70000 - 0x10ff9e538; mmaped: 0x10ff70000 - 0x10ffa8000
      sec  1[alloc: 2; kind: 1]: 0x10efe0000 - 0x10efe1098; mmaped: 0x10efe0000 - 0x10efe4000
      sec  2[alloc: 2; kind: 4]: 0x10efe4000 - 0x10efe40b6; mmaped: 0x10efe4000 - 0x10efe8000

We know we called from 0x10ff99d58, which is in the first (section 0), of the Internal.o file from template-haskell. Disassembling Internal.o to verify we found the right location yields the following:

_templatezmhaskell_LanguageziHaskellziTHziLibziInternal_litE_info:
0000000000029d30	sub	x18, x20, #0x10
0000000000029d34	cmp	x18, x28                        ; Latency: 2
0000000000029d38	adr	x16, #0x0
0000000000029d3c	b.lo	0x29dbc
0000000000029d40	adrp	x18, 0 ; 0x29000
0000000000029d44	add	x18, x18, #0x0
0000000000029d48	stur	x18, [x20, #-0x10]              ; Latency: 4
0000000000029d4c	stur	x24, [x20, #-0x8]               ; Latency: 4
0000000000029d50	sub	x20, x20, #0x10
0000000000029d54	adr	x16, #0x0
0000000000029d58	b	0x29d58
0000000000029d5c	mov	w18, #0x10
0000000000029d60	str	x18, [x19, #0x388]              ; Latency: 4
0000000000029d64	adr	x16, #0x0
0000000000029d68	b	0x29d68
0000000000029d6c	udf	#0x1
0000000000029d70	udf	#0x0
0000000000029d74	udf	#0x1e
0000000000029d78	udf	#0x0

This looks rather promising and is exactly where we had our crash. So what happened?

Interlude: Relocations

Relocations are records in object files that the linker can use to resolve symbols. This is important as we do not know where the final symbol might end up in memory, and thus need a way to reference a symbol for which we do not know its resting place. We therefore record symbolic pointers. The linkers job is then to resolve those pointers. The pointer contains a location where we need to patch up an address, the name of the symbol that we want to point to, and one of a few pre-defined relocation types. The linker has to iterate over all relocation after it loaded an object, for each object it loads.

We can ask otool for relocations, and providing -v we’ll get some easier to read nameing as well.

address  pcrel length extern type    scattered symbolnum/value
00029d58 True  long   True   BR26    False     _stg_gc_unpt_r1

So for 0x29d58 (relative to the resting place of the object file in memory), there is a BR26 (Branch 26) relocation, that points to the Extern symbol _stg_gc_unpt_r1. And this is what we’d end up patching up in the object file. This somehow goes wrong.

Let’s look at the broken relocation again: We jump to 0x108384658 from 0x10ff99d58, so the relative value in the branch instruction is 0x108384658 - 0x10ff99d58 = 0xf83EA900. Note the leading 0xf?, yes that’s a negative value. Let’s look at it in binary and we see

1111 1000 0011 1110 1010 1001 0000 0000
|         |         |         |       |
31        23        15        7       0

The BR26 relocation can encode 26 bits of relocation information (e.g. encode them right in the Branch instruction). However due to guaranteed alignment on 4 bytes, we do not need to encode the last two bits as they will always be zero. This gives us a range of 28. Which however needs to include the sign bit. So we can really only encode values in the range of +-2^27. For values outside of this range we’ll create a forward jump to the final destination, that is reachable. At this point we might already have a good idea why this failed. And yes, the linker did check for +-2^28, and thus when we just happened to hit a positive value with the 28th bit set, we’d write it into the relocation, and call it a day. Just for the CPU to read it as a negative jump and jump to a random address.

Thus the final fix was just an additional -1, to the range check for relocations of the BR26 type. And thus we can conclude the mystery of the random crashes on aarch64-darwin (macOS), and the story of an odd off-by-one error in GHC’s in-memory linker.

On a final note: if you like this write up and want to support this work as well as GHC’s Continuous Integration, and you also happen hold Cardanos Crypto Currency ADA, I’d like to invite you to stake your ADA with the ZW3RK pool. It is a very competitive 1% pool, and operational rewards will go towards operating part of GHC’s CI infrastructure and write ups like these.