improve memory trace performance

This is something of a broad issue covering analyzing and improving the performance of the following:

drcachesim's traces, both for online and offline trace analysis
the memtrace and instrace samples

Xref #1929 (closed): memtrace slowness due to unbuffered printf Xref #790: online trace compression

I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:

memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)

mcf test:

% cd /extsw/spec2006/mcf-test
% /usr/bin/time ./mcf_base.gcc-64bit inp.in
1.63user 0.03system 0:01.67elapsed 99%CPU (0avgtext+0avgdata 159668maxresident)k
% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
18.41user 10.45system 5:05.96elapsed 9%CPU (0avgtext+0avgdata 164712maxresident)k

=> clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs. slowdown: 183x

More like 70x on laptop, and higher %CPU:

% /usr/bin/time  ./mcf_base.clang-64bit inp.in
2.00user 0.09system 0:02.09elapsed 99%CPU (0avgtext+0avgdata 159648maxresident)k
% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
15.18user 21.44system 2:23.12elapsed 25%CPU (0avgtext+0avgdata 165128maxresident)k

Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk ratio; also slower native)?

Disabling dr_write_file:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
9.28user 0.04system 0:09.41elapsed 99%CPU (0avgtext+0avgdata 164908maxresident)k

=> 5.6x That's PC, read or write, size, and address. It should be easy to improve by 2x by removing read/write and size (statically recoverable) and only including PC once per bb or even less.

But it's much worse on other spec. Taking too long to do a ref run of everything but bmarks at the point I killed the run, 9 hours in:

% spec2k6cmp result/CINT2006.045.ref.txt result/res-native-allc-int
400.perlbench         0.00 (     0 /    207)
401.bzip2            16.28 (  5029 /    309)
403.gcc              22.69 (  4447 /    196)
429.mcf               4.27 (   994 /    233)
445.gobmk            18.95 (  5761 /    304)
456.hmmer            25.61 (  7095 /    277)
458.sjeng            18.37 (  5788 /    315)
462.libquantum        0.00 (     0 /    299)
464.h264ref           0.00 (     0 /    327)
471.omnetpp           0.00 (     0 /    271)
473.astar             0.00 (     0 /    267)
483.xalancbmk         0.00 (     0 /    150)
average              17.70

Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"

Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5. Simple test: I see >20x gzip compression (though w/ naive starting format):

% /work/dr/git/exports/bin64/drrun -c /work/dr/git/build_x64_dbg/api/bin/libmemtrace_x86_binary.so -- ls
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log
12M /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log
% gzip /work/dr/git/build_x64_dbg/api/bin/*.log
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log*
528K /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log.gz

Removing the PC field:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
13.26user 11.07system 3:31.28elapsed 11%CPU (0avgtext+0avgdata 164684maxresident)k

=> still i/o bound: 11% CPU. produces a 31GB file. slowdown: 126x

On laptop:

% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
13.14user 15.50system 1:15.94elapsed 37%CPU (0avgtext+0avgdata 164736maxresident)k

Up to 37%CPU, and 36x slowdown.

drcachesim tracer performance => 2x slower b/c of icache entries

Switching from mcf test to bzip2 test b/c it's a little closer to the 18x average performance for the memtrace sample not writing to disk and so is more representative:

native:

% /usr/bin/time ./bzip2_base.gcc-64bit dryer.jpg 2
1.95user 0.00system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 17176maxresident)k

No disk writes at all:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
30.23user 0.01system 0:30.49elapsed 99%CPU (0avgtext+0avgdata 24300maxresident)k

That's 15.6x.

% /usr/bin/time /work/dr/git/build_x64_rel/bin64/drrun -c /work/dr/git/build_x64_rel/clients/lib64/release/libdrmemtrace.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
60.32user 0.01system 1:00.34elapsed 99%CPU (0avgtext+0avgdata 25564maxresident)k

30.8x! 2x vs memtrace, b/c it's including icache info, presumably.

Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack it: b/c we only care about 32-bit?).

Packing trace_entry_t w/o any other changes to the struct:

55.70user 0.01system 0:55.72elapsed 99%CPU (0avgtext+0avgdata 25560maxresident)k

Also compressing size+type from 4 bytes into 2 bytes: (Might need extra escape entry for memsz > 256)

54.11user 0.00system 0:54.12elapsed 100%CPU (0avgtext+0avgdata 25264maxresident)k

Also shrinking pc/addr field to 4 bytes:

53.57user 0.01system 0:53.59elapsed 99%CPU (0avgtext+0avgdata 25060maxresident)k

Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):

48.63user 0.01system 0:48.99elapsed 99%CPU (0avgtext+0avgdata 24952maxresident)k

10.7x = Also removing all instr entries (thus there's no PC at all):

20.88user 0.01system 0:21.05elapsed 99%CPU (0avgtext+0avgdata 24896maxresident)k

Having the instr bundles and all the instr boundary info coming from the tracer seems worth it for online simulation, where having the simulator go dig it up from disassembly of a giant binary is going to be slower than the tracer providing it. But for offline, it does seem like we want to really optimize the tracing -- thus we need a split tracer!

14.3x = Adding back one instr entry per bb (1st instr in bb):

27.94user 0.00system 0:27.95elapsed 100%CPU (0avgtext+0avgdata 24768maxresident)k

Significant cost for instr-entry-per-bb: 33% more expensive. Maybe we can leverage traces to bring it down, having one instr entry per trace + a bit per cbr + an extra entry per mbr crossed?!?

#790: try online compression with zlib

With the private loader, should be able to just use zlib library directly.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
484.16user 1.91system 8:13.96elapsed 98%CPU (0avgtext+0avgdata 165552maxresident)k

It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower! 295x vs native, 1.6x vs uncompressed. 98% CPU, too.

try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD

Z_BEST_SPEED

Have to use the deflate interface directly and the zlib compression format. The gz interface uses the gzip compression format and apparently has no interface to set the speed vs size.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
181.29user 1.91system 3:10.70elapsed 96%CPU (0avgtext+0avgdata 165220maxresident)k

It produces a 4.5GB file and is significantly faster than uncompressed, but it's still 114x vs native.

On laptop it makes a 4.3GB file (should have saved to see if really different) and:

229.51user 2.56system 3:53.45elapsed 99%CPU (0avgtext+0avgdata 167372maxresident)k

So even Z_BEST_SPEED is slower than uncompressed on an SSD!