exclusive store workaround fails on some AArch64 hardware with multi-block sequences
Splitting from #1698 for the purpose of finding a quick workaround.
Focusing on memtrace_simple, though other tools like drmemtrace are similar:
The workaround we're currently using, which worked on the ARM hardware we tested on, of skipping the adding of a clean call before an exclusive store, is not sufficient on some AArch64 hardware. I'm seeing spinning when adding a clean call in front of the OP_ldaxr in the 3rd block in this sequence:
interp: start_pc = 0x0000ffff80ba53e8
0x0000ffff80ba53e8 a9bc7bfd stp %x29 %x30 %sp $0xffffffffffffffc0 -> -0x40(%sp)[16byte] %sp
0x0000ffff80ba53ec 52800021 movz $0x0001 lsl $0x00 -> %w1
0x0000ffff80ba53f0 910003fd add %sp $0x0000 lsl $0x00 -> %x29
0x0000ffff80ba53f4 a90153f3 stp %x19 %x20 -> +0x10(%sp)[16byte]
0x0000ffff80ba53f8 d0000893 adrp <rel> 0x0000ffff80cb7000 -> %x19
0x0000ffff80ba53fc f90013f5 str %x21 -> +0x20(%sp)[8byte]
0x0000ffff80ba5400 aa0003f5 orr %xzr %x0 lsl $0x00 -> %x21
0x0000ffff80ba5404 91286260 add %x19 $0x0a18 lsl $0x00 -> %x0
0x0000ffff80ba5408 b9003fbf str %wzr -> +0x3c(%x29)[4byte]
0x0000ffff80ba540c 885ffc02 ldaxr (%x0)[4byte] $0x1f $0x1f -> %w2
0x0000ffff80ba5410 7100005f subs %w2 $0x0000 lsl $0x00 -> %wzr
0x0000ffff80ba5414 54000061 b.ne $0x0000ffff80ba5420
end_pc = 0x0000ffff80ba5418
interp: start_pc = 0x0000ffff80ba5418
0x0000ffff80ba5418 88037c01 stxr %w1 $0x1f -> (%x0)[4byte] %w3
0x0000ffff80ba541c 35ffff83 cbnz $0x0000ffff80ba540c %w3
end_pc = 0x0000ffff80ba5420
interp: start_pc = 0x0000ffff80ba540c
0x0000ffff80ba540c 885ffc02 ldaxr (%x0)[4byte] $0x1f $0x1f -> %w2
0x0000ffff80ba5410 7100005f subs %w2 $0x0000 lsl $0x00 -> %wzr
0x0000ffff80ba5414 54000061 b.ne $0x0000ffff80ba5420
end_pc = 0x0000ffff80ba5418