CRASH and HANG attaching on AMD processors
While regular DR operation seems to work fine on AMD processors, attach runs into strange problems which stem from to-be-attached threads taking on the gs base of the master thread, causing all kinds of confusion in the signal handler for the SIGUSR2 sent to take them over:
$ ./api.startstop
<Starting application api.startstop (129458)>
<unable to determine lib path for cross-arch execve>
<Initial options = -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<(1+x) Handling our fault in a TRY at 0x0000000000658809>
in dr_client_main
retakeover for cur-native thread 129458
TAKEOVER: 10 threads to take over
For master thread tid=129458, 0=>gs base 0x00000000528af000
TAKEOVER: publishing takeover records
TAKEOVER: will signal thread 129459
TAKEOVER: will signal thread 129460
TAKEOVER: will signal thread 129461
TAKEOVER: will signal thread 129462
TAKEOVER: will signal thread 129463
TAKEOVER: will signal thread 129464
TAKEOVER: will signal thread 129465
TAKEOVER: will signal thread 129466
TAKEOVER: will signal thread 129467
TAKEOVER: will signal thread 129468
For SIGUSR2 in tid=129459, 0=>gs base is 0x00000000528af000
For SIGUSR2 in tid=129467, 0=>gs base is 0x00000000554d3000
For SIGUSR2 in tid=129461, 0=>gs base is 0x0000000055512000
For SIGUSR2 in tid=129464, 0=>gs base is 0x0000000055512000
For SIGUSR2 in tid=129462, 0=>gs base is 0x00000000554a4000
For SIGUSR2 in tid=129460, 0=>gs base is 0x0000000000000000
TAKEOVER: received signal in thread 129467
TAKEOVER: received signal in thread 129464
For SIGUSR2 in tid=129466, 0=>gs base is 0x00000000554a4000
For SIGUSR2 in tid=129465, 0=>gs base is 0x0000000055512000
For SIGUSR2 in tid=129463, 0=>gs base is 0x0000000000000000
<For SIGUSR2 in tid=129459, clearing instead of 129458's; res=0>
TAKEOVER: received signal in thread 129459
TAKEOVER: received signal in thread 129466
TAKEOVER: received signal in thread 129465
TAKEOVER: received signal in thread 129463
For SIGUSR2 in tid=129468, 0=>gs base is 0x0000000000000000
TAKEOVER: received signal in thread 129461
TAKEOVER: received signal in thread 129468
TAKEOVER: received signal in thread 129460
TAKEOVER: received signal in thread 129462
<Application api.startstop (129458). Internal Error: DynamoRIO debug check failure: dr/git/src/core/unix/os.c:1981 !is_thread_tls_initialized()
(Error occurred @14 frags)
version 7.0.17906, custom build
-stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
This reproduces every time.
Other experiments show that on AMD, the gs base as read by arch_prctl is a bogus non-zero value at process startup, while on Intel processors it's always 0.
In our signal handler, on receiving SIGUSR2, when dcontext is NULL or when dcontext->owning_thread
doesn't match get_sys_thread_id()
, setting the gs base to any non-zero value fixes the problem. Setting it to zero does not fix the problem.
It seems like there's a problem where the kernel fails to clear gs base on a context switch: we set it in our master thread and it bleeds over into one or more of the threads we signal. However, I looked at the kernel context switch code and I do not yet see how it could happen. Plus, a simple app that sets gs base in one thread and switches to another thread does not exhibit the problem. I'm still trying to nail down a minimal reproducer without DR.