Adds concurrent post-processing of raw thread files by default. Adds a new option -jobs for selecting the worker thread pool size. Uses a simple static round-robin work assignment of traced threads to worker threads.
Splits the decode cache into a separate cache per worker, to keep it lockless.
Adds a per-traced-thread last_decode_pc mini-cache to speed up rep string loop processing.
Issue: #3230 (closed) Fixes #3129 (closed)