memtrace is very slow due to dr_fprintf not being buffered

Even though memtrace_x86.c only prints data to the file every 8K memrefs, it still prints them one at time with a separate dr_fprintf call for each one.

Native is instantaneous to gzip this 1.5MB file (created by concatening 10 copies of this build dir's Makefile together):

# /usr/bin/time gzip /tmp/z ; gunzip /tmp/z.gz
0.01user 0.00system 0:00.01elapsed 94%CPU (0avgtext+0avgdata 1684maxresident)k

Here's memtrace_x86 from TOT:

# /usr/bin/time bin64/drrun -c api/bin/libmemtrace_x86.so -- gzip /tmp/z ; gunzip /tmp/z.gz
18.65user 18.61system 0:40.09elapsed 92%CPU (0avgtext+0avgdata 6684maxresident)k

And here's with buffering, basically calling fdopen(data->log, "w") and calling fprintf on the resulting FILE* instead of using dr_fprintf:

# /usr/bin/time bin64/drrun -c api/bin/libmemtrace_x86.so -- gzip /tmp/z ; gunzip /tmp/z.gz
5.81user 0.47system 0:08.66elapsed 72%CPU (0avgtext+0avgdata 7176maxresident)k

That is a big difference, 4.6x. The trace log file created is 1.5GB.

Here's the client with the printing removed, showing all the time spent is there and not in the instrumentation:

# /usr/bin/time bin64/drrun -c api/bin/libmemtrace_x86.so -- gzip /tmp/z ; gunzip /tmp/z.gz          
0.32user 0.00system 0:00.33elapsed 97%CPU (0avgtext+0avgdata 6604maxresident)k

In some cases the perf difference is 20x or more between using dr_fprintf or using fprintf: for a naive implementation of instruction tracing that prints on every single instruction.

We do document that dr_fprintf is not buffered, but then we use it in examples like memtrace. We should either add buffering to memtrace, switch it to use fprintf from libc, or add buffering to dr_fprintf. We should also review all other places where dr_fprintf is used and we may want to beef up warnings in the docs if we decide not to add buffering.