CRASH during detach in now-native thread
An app with a statically-linked DR often crashes during detach:
<Starting application myapp (595523)>
<Attached to 653 threads in application myapp (595523)>
<Detaching from application myapp (595523)>
SIGSEGV received by PID 595523 (TID 598205)
PC: @ 0x4b3dc5d (unknown) safe_read_tls_magic
@ 0xfee3683 880 appSignalHandler()
@ 0x7f8574c47390 1440 (unknown)
@ 0x4bf5836 992 master_signal_handler_C
@ 0x4b3d883 321120 (unknown)
(gdb) thread find 598205
Thread 1 has target id 'LWP 598205'
(gdb) dps $rsp $rsp+8000
<...>
0x00007f76be49ddc8 0x000000000fee36d3 appSignalHandler
<...>
0x00007f76be49e6c8 0x0000000004be9559 get_thread_private_dcontext + 89 in section .text
0x00007f76be49e6d0 0x00007f76be49eab0 No symbol matches (void *)$retaddr.
0x00007f76be49e6d8 0x0000000004bf5836 master_signal_handler_C + 54 in section .text
0x00007f76be49e6e0 0x0000000000000000 No symbol matches (void *)$retaddr.
<...>
0x00007f76be49eab8 0x0000000004b3d883 dynamorio_sigreturn in section .text
(gdb) dps 0x00007f76be49eab8 0x00007f76be49eab8+512
0x00007f76be49eab8 0x0000000004b3d883 dynamorio_sigreturn in section .text
0x00007f76be49eac0 0x0000000000000001 No symbol matches (void *)$retaddr.
0x00007f76be49eac8 0x0000000000000000 No symbol matches (void *)$retaddr.
sp 0x00007f76be49ead0 0x00007f76be48f000 No symbol matches (void *)$retaddr.
<...>
rsp 0x00007f76be49eb60 0x00007f76be4ecf50 No symbol matches (void *)$retaddr.
rip 0x00007f76be49eb68 0x00000000073fc4dc appfunc
<...>
sig 0x00007f76be49ebf0 0x000000000000001b No symbol matches (void *)$retaddr.
0x00007f76be49ebf8 0x0000000000000080 No symbol matches (void *)$retaddr.
Just SIGPROF arriving at random point of thread that's been detached and is now native. Our handler is still in place, and it calls get_thread_private_dcontext().
Looking back down the stack at the SIGSEGV:
rsp 0x00007f76be49e1e0 0x00007f76be49e6c8 No symbol matches (void *)$retaddr.
rip 0x00007f76be49e1e8 0x0000000004b3dc5d safe_read_tls_magic in section .text
<...>
sig 0x00007f76be49e270 0x000000000000000b No symbol matches (void *)$retaddr.
0x00007f76be49e278 0x0000000000000001 No symbol matches (void *)$retaddr.
(gdb) x/2i 0x0000000004b3dc5d
0x4b3dc5d <safe_read_tls_magic>: mov %gs:0x60,%eax
So it's the expected fault after we've removed our segment. So why didn't the SIGSEGV just go to our safe_read_tls_magic check and from there go to safe_read_tls_magic_recover? Is it a race where we removed our handler before the SIGSEGV was delivered, and that's why it went to the app? We remove it once we detach from the final thread: actually once we also do thread exit from the detaching thread, right?
(gdb) p dynamo_exited
$2 = 0
(gdb) p doing_detach
$3 = 256
(gdb) p dynamo_detaching_flag
$4 = -1
(gdb) p dynamo_exited_and_cleaned
$5 = 0
(gdb) p num_known_threads
$6 = 0
(gdb) p dynamo_initialized
$7 = 0
(gdb) p do_once_generation
$8 = 2
It looks like dynamo_exit_post_detach() has run, though maybe the detacher made further progress while the fault was being processed.
Proposal: check doing_detach in master_signal_handler_C and if true, and it's some alarm signal, just drop it on the floor? Or try to invoke app handler if it's not SIGUSR2 (or a fault?).