drreg state restore problems with DR spill slots
First, drreg fails to properly identify DR spill slots, b/c it assumes the TLS slots are ordered low to high.
Second, when that bug is fixed, it leads to restoring regs it wasn't before (b/c it thought they weren't its spills) which leads to an assert:
<Application /work/dr/git/build_x64_dbg_tests/clients/bin64/tool.drcacheoff.burst_threads (16044) DynamoRIO usage error : dr_read_saved_reg(): drcontext does not belong to current thread>
#5 0x00000000004b2803 in external_error (file=0x73aac8 "/work/dr/git/src/core/lib/instrument.c", line=5734,
msg=0x73c910 "dr_read_saved_reg(): drcontext does not belong to current thread") at /work/dr/git/src/core/utils.c:200
#6 0x00000000005be4f1 in dr_read_saved_reg (drcontext=0x475c2080, slot=SPILL_SLOT_1) at /work/dr/git/src/core/lib/instrument.c:5733
#7 0x00000000006d6580 in get_spilled_value (drcontext=0x475c2080, slot=4) at /work/dr/git/src/ext/drreg/drreg.c:253
#8 0x00000000006db73e in drreg_event_restore_state (drcontext=0x475c2080, restore_memory=0 '\000', info=0x7f3feacec130)
at /work/dr/git/src/ext/drreg/drreg.c:1767
#9 0x00000000006def9f in drmgr_restore_state_event (drcontext=0x475c2080, restore_memory=0 '\000', info=0x7f3feacec130)
at /work/dr/git/src/ext/drmgr/drmgr.c:1864
#10 0x00000000005b54dc in instrument_restore_state (dcontext=0x475c2080, restore_memory=false, info=0x7f3feacec130)
at /work/dr/git/src/core/lib/instrument.c:1715
#11 0x00000000005c91c6 in recreate_app_state_internal (tdcontext=0x475c2080, mcontext=0x7f3feacec800, just_pc=false, owning_f=0x0,
restore_memory=false) at /work/dr/git/src/core/translate.c:1208
#12 0x00000000005c9598 in recreate_app_state (tdcontext=0x475c2080, mcontext=0x7f3feacec800, restore_memory=false, f=0x0)
at /work/dr/git/src/core/translate.c:1331
#13 0x00000000005a2c99 in at_safe_spot (trec=0x474ae128, mc=0x7f3feacec800, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT)
at /work/dr/git/src/core/synch.c:616
#14 0x00000000005a4156 in synch_with_thread (id=16047, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE,
desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT, flags=17) at /work/dr/git/src/core/synch.c:1034
#15 0x00000000005a5801 in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT,
threads_out=0x7f3feacece90, num_threads_out=0x7f3feacece8c, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=20)
at /work/dr/git/src/core/synch.c:1414
#16 0x00000000005a7baf in detach_on_permanent_stack (internal=true, do_cleanup=true) at /work/dr/git/src/core/synch.c:2058
That assert should be removed: dcontext for another thread is a valid usage scenario for synchall.
The underlying issues is that drreg sees what looks like an rax spill but no restore:
0x4ef7373f: movabs %rax,%gs:0x8
0x4ef7374a: lahf
0x4ef7374b: seto %al
0x4ef7374e: cmp %gs:0x8,%rcx
0x4ef73757: jne 0x4ef3830f
Here is the source, x64 stay-on-trace cmp:
+267 L3 65 48 a3 00 00 00 00 mov %rax -> %gs:0x00[8byte]
00 00 00 00
+278 L3 48 b8 a4 45 61 4c 60 mov $0x00007f604c6145a4 -> %rax
7f 00 00
+288 L3 65 48 a3 08 00 00 00 mov %rax -> %gs:0x08[8byte]
00 00 00 00
+299 L3 9f lahf -> %ah
+300 L3 0f 90 c0 seto -> %al
+303 L3 65 48 3b 0c 25 08 00 cmp %rcx %gs:0x08[8byte]
00 00
+312 L4 @0x0000000054505698 0f 85 19 63 f8 ff jnz $0x00000000543f8f0f <shared_trace_cmp_ret>
+318 L3 65 48 8b 0c 25 10 00 mov %gs:0x10[8byte] -> %rcx
00 00
+327 L3 65 48 a1 00 00 00 00 mov %gs:0x00[8byte] -> %rax
00 00 00 00
So that's not a spill: DR is just using that TLS slot instead of a 3rd scratch reg. That's TLS_REG1_SLOT, which is the 3rd and final one exposed to clients.
Note that drreg is already handling DR slots not lasting across app instrs by non-lazily restoring, but it looks like we have fundamental issues with state restore and DR slots. For simple DR uses like a local spill and restore (such as for rip-rel, though that uses a slot not exposed to clients), a state restore in between would simply be restored twice, once by DR and once by drreg, with no ill effects.
But, we have this no-restore DR use for trace cmp, and another tool component could use DR slots too.
Is the only safe thing to do to require enough dedicated drreg-only raw-TLS slots to avoid ever using DR slots? Just avoiding the 3rd slot would reduce the likelihood but not eliminate the theoretical problems.
It would be nice if, at least for recreate and not stored-info state restores, we could combine extra info during recreation with the raw mcontext snapshot