HANG on Appveyor hitting 1-hour timeout with logs ending much sooner
While moving Appveyor to VS2017/Win10 (#2924 (closed)) I hit several cases where Appveyor failed because the build hit their 1-hour time limit. Yet the full log ended at much less than 1 hour: 40 minutes, 20 minutes, and in the latest case 17 minutes. It ended during the 64-bit build a few times, and during the 32-bit build a few times. I ran locally to see if I could repro any popups or something and I did find a few, which I fixed in PR #3999. I thought it was all behind us but now I just saw another timeout:
https://ci.appveyor.com/project/DynamoRIO/dynamorio/builds/30450667
I've attached the full log. It ends with:
[00:17:28] test 261
[00:17:28] Start 261: code_api,disable_traces|client.events
[00:17:28]
[00:17:28] 261: Test command: C:\projects\dynamorio\build\build_debug-internal-32\bin32\drrun.exe "-s" "90" "-quiet" "-debug" "-use_dll" "C:/projects/dynamorio/build/build_debug-internal-32/lib32/debug/dynamorio.dll" "-exit0" "-stderr_mask" "0xC" "-msgbox_mask" "0" "-dumpcore_mask" "0x7d" "-staged" "-code_api" "-disable_traces" "-c" "C:/projects/dynamorio/build/build_debug-internal-32/suite/tests/bin/client.events.dll.dll" "--" "C:/projects/dynamorio/build/build_debug-internal-32/suite/tests/bin/client.events.exe" "C:/projects/dynamorio/build/build_debug-internal-32/suite/tests/bin/client.events.appdll.dll"
[00:17:28] 261: Test timeout computed to be: 600
[00:17:28] 261: appdll initialized
[00:17:28] 261: exception event redirect
[00:17:28] 261: Redirect success!
[00:17:28] 261: exit event 2
[00:17:28] 261: exit event 1
[00:17:28] 261: module load event 1 is called 1 time
[00:17:28] 261: module load event 2 is called 1 time
[00:17:28] 261: thread init event 1 is called 1 time
[00:17:28] 261: thread init event 2 is called 1 time
[00:17:28] 261: bb event 1 is called 1 time
[00:17:28] 261: bb event 2 is called 1 time
[00:17:28] 261: delete event 1 is called 1 time
[00:17:28] 261: delete event 2 is called 1 time
[00:17:28] 261: filter syscall event 1 is called 1 time
[00:17:28] 261: filter syscall event 2 is called 1 time
[00:17:28] 261: pre syscall event 1 is called 1 time
[00:17:28] 261: pre syscall event 2 is called 1 time
[00:17:28] 261: post syscall event 1 is called 1 time
[00:17:28] 261: post syscall event 2 is called 1 time
The time prefix is the total time since the build started: so if the log itself is not truncated, it sat there for 43 minutes from that point.
Two tests have not finished:
$ egrep 'Test #|Start [0-9]' ~/Downloads/log\ \(24\).txt | cut -c 31-90 | awk '{print $1}' | sed 's/\r$//' | sort | uniq -u
code_api,disable_traces|client.events
code_api|tool.drcachesim.invariants
The invariants test:
[00:14:25] Start 221: code_api|tool.drcachesim.invariants
[00:14:25] 221: Test command: C:\projects\dynamorio\build\build_debug-internal-32\bin32\drrun.exe "-s" "90" "-quiet" "-debug" "-use_dll" "C:/projects/dynamorio/build/build_debug-internal-32/lib32/debug/dynamorio.dll" "-exit0" "-stderr_mask" "0xC" "-msgbox_mask" "0" "-dumpcore_mask" "0x7d" "-staged" "-code_api" "-t" "drcachesim" "-ipc_name" "drtestpipe_invariants" "-test_mode" "--" "C:/projects/dynamorio/build/build_debug-internal-32/suite/tests/bin/client.winxfer.exe"
[00:14:25] 221: Test timeout computed to be: 600
[00:14:25] 221: About to create thread
[00:17:15] 221: <Timeout expired - 1st wait, possible deadlock (or you were debugging)
[00:17:15] 221: version 7.91.18291, custom build
[00:17:15] 221: -no_dynamic_options -client_lib 'C:\projects\dynamorio\build\build_debug-internal-32/clients/lib32/debug/drmemtrace.dll;0;-ipc_name drtestpipe_invariants -test_mode' -code_api -probe_api -dumpcore_mask 125 -stderr_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -staged -native_exec_
Looks very similar to an earlier 1-hour timeout where I thought it was from a VCRT abort messagebox I saw over Appveyor RDP (could not repro locally; did not attempt to debug over RDP). I configured VCRT to not popup on abort via _CrtSetReportMode in PR #2999. That abort was:
---------------------------
Microsoft Visual C++ Runtime Library
---------------------------
Debug Assertion Failed!
Program: ...io\build\build_debug-internal-64\clients\bin64\drcachesim.exe
File: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\include\vector
Line: 1742
Expression: vector subscript out of range
For information on how your program can cause an assertion
failure, see the Visual C++ documentation on asserts.
(Press Retry to debug the application)
---------------------------
Abort Retry Ignore
---------------------------
In a local run there was one other msgbox I did not yet try to suppress, b/c the suite kept going and the app was killed properly:
win32.tls.exe popped this up but it disappeared after ~30s:
---------------------------
Microsoft Visual C++ Runtime Library
---------------------------
Debug Error!
Program: ...d_suite\build_debug-internal-32\suite\tests\bin\win32.tls.exe
abort() has been called
(Press Retry to debug the application)
---------------------------
Abort Retry Ignore
---------------------------
Several other tests have timed out (our CTest 90s timeout) with their last output 'About to create thread' on this VS2017/Win10 setup, presumably covered by #4058: 32-bit drutil-test; 64-bit callback; 32-bit drmgr-test up above.
So we have several action items:
- Add the _CrtSetReportMode to tools.h to avoid popups in test apps too
- Try to repro the hang in either client.events or drcachesim or the invariants test app
- If the app dies, on Windows drcachesim seems to wait forever for the pipe: can we add a timeout there
- Figure out and fix the underlying cause of the drcachesim vector abort()
- Figure out and fix the underlying cause of the win32.tls.exe abort()
- Try to repro and fix the problem at 'About to create thread'
Some of that may overlap with #4058.