Redo rep string expansion to avoid the instruction fetch on subsequent iterations
This is related to #4915 where we may remove the instr fetches for subsequent rep string iterations, currently marked as special "non-fetched" instructions, from the offline drcachesim trace post-processing. Here though the goal is remove the instruction fetch for all clients by changing the expansion.
Today the single-block-external-loop expansion results in a string operation instruction in each iteration:
$ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1`
Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
0x401018: 2, rep movs
0x40200e: 1, r
0x402000: 1, w
0x401018: 2, rep movs
0x40200f: 1, r
0x402001: 1, w
0x401018: 2, rep movs
0x402010: 1, r
0x402002: 1, w
0x401018: 2, rep movs
0x402011: 1, r
0x402003: 1, w
0x401018: 2, rep movs
0x402012: 1, r
0x402004: 1, w
There are 3 ways we could remove it (aside from an offline post-processing model as in #4915):
A) Use runtime state:
if TLS.flag == 0
record instr fetch
set TLS.flag to 1
<today's expansion but without emulated instr fetch>
if loop is going to be not taken (check ecx)
set TLS.flag to 0
loop
This is expensive.
B) Unroll the 1st iter in the expansion. This adds a bunch more branches and we'll want to try to ensure it won't mess up drreg or other linear-flow libraries.
C) Unroll the 1st iter across 2 blocks: 1st iter block plus rest-of-loop block. This may be harder than B due to needing unique PC entries.
Probably B would be the way to go.