drcachesim: optimize cache simulator
Created by: zhaoqin
Currently, the cache simulator is ~500x of native execution, the overhead including profiling overhead, communication overhead, but the cache simulator's overhead dominates the overall slowdowns.
One simple optimization is to parallel the cache simulator by splitting the memory into sub-regions and runs a cache simulator for each sub-region.