Adds a thread filtering feature to drcachesim, driven by a new API routine drmemtrace_filter_threads() which decides for each thread whether to trace it. In the tracer, untraced threads have no trace buffer, which serves as the flag indicating to skip instrumentation. This flag is checked via the same mechanism as skipping the clean call (the code is shared here as a new utility insert_conditional_skip()), using jecxz on x86.
There are some complexities with the extra internal control for regular and cache-filter modes where we need to insert barriers and sometimes wait for spills we don't truly need. The benefit is that in the end the instrumentation is 3x faster than an approach of always filling the per-thread buffers and filtering the threads only on i/o.
Adds two new bursty trace tests of the thread filter, one with and one without the cache filter.
Fixes #2820 (closed)