ldrex..strex pair constraints challenge instrumentation and even core operation
A ldrx..strex pair has some constraints that make inserting instrumentation in between, or even core DR operation, challenging:
From the ARMv8 manual, section E2.10.5:
An implementation might clear an exclusive monitor between the LoadExcl
instruction and the StoreExcl, instruction without any application-related
cause. For example, this might happen because of cache evictions. Software
must, in any single thread of execution, avoid having any explicit memory
accesses or cache maintenance instructions between the LoadExcl instruction
and the associated StoreExcl instruction.
Implementations can benefit from keeping the LoadExcl and StoreExcl
operations close together in a single thread of execution. This minimizes
the likelihood of the exclusive monitor state being cleared between the
LoadExcl instruction and the StoreExcl instruction. Therefore, for best
performance, ARM strongly recommends a limit of 128 bytes between
LoadExcl and StoreExcl instructions in a single thread of execution.
One-time switches back to dispatch (during code discovery, or synchall or something) should be fine, because the ldrex..strex code has to loop and be prepared to fail a few times. The problems are all related to inserted memory operations that execute every single time.
Thus we have problems with:
- Stolen register mangling in between or on ldrex/strex themselves
- Inserted instrumentation
There could be multiple cbr's in between ldrex and strex, so do we give up on getting into a single bb? Should we study sequences we see in the wild to see if having some assumptions would make the problem drastically easier? Should we have a special trace head trigger and stop supporting -disable_traces?
One suggestion is adding a check for each instrumentation memory reference to skip if it's in the middle of ldrex/strex: but this gets tricky without allowing memory refs itself.