For drmemtrace -trace_after_instrs, if the threshold is large (>10M), we use per-thread counters and only check the global value every 10K instructions in each thread. The approximate result is fine for these use cases.
Tested on the original app where the global counter contention results in a 20x slowdown over plain DR. The new scheme has a 14% slowdown over plain DR on this app, a massive speedup.
Fixes #5026 (closed)