drmemtrace instruction delay counting 20x slowdown on multi-threaded AArch64 apps

Xref #4487 (closed) where inlined instruction counting was added for drmemtrace's delay feature which greatly improved single-threaded performance.

However, running on a large multi-threaded AArch64 app, I'm seeing huge slowdowns from instruction counting: on the order of 20x versus plain DR!

I tried some experiments:

Count down and use TBZ to avoid aflags clobbering: doesn't make much of a difference so it's not the bottleneck, but we should keep this approach and it should help once the bottleneck is gone.
Remove drx_insert_counter_update() to avoid re-acquiring the address and value: again not much difference vs the big bottleneck.
s/ldar/ldr;s/stlr/str => 2x faster, so the atomicity is part of the problem, but not the whole story.
Use thread-local counters => nearly all of the 20x is gone (just a quick experiment here w/o any conditional; final perf will be a little higher) so the fundamental issue is contention on the single global counter.

My proposal is to use thread-private counters, count down, use TBZ to see when they hit 0, and only then update the global value and see if it hit the threshold. Maybe a thread would update the global every 1K or 10K instructions. We'd use this mode for any large delay values and we'd document it won't hit it exactly which should be fine when asking to delay 4 billion instructions. For small values (say, <10M) we'd use the current change-global-every-time approach.