Changes the reuse_distance and reuse_time tools to operate on each thread separately, and to then aggregate the final results across the threads.
Implements the new parallel analysis interface to perform these computations in parallel, which reduces the substantial additional overhead of per-thread computation.
Adds a new checked-in offline multi-threaded (x64-only) trace and adds reuse_distance and reuse_time tests using it.
Updates the documentation's sample reuse_distance and reuse_time output.
Fixes #3327 (closed)