ldrex..strex pair constraints challenge instrumentation and even core operation

A ldrx..strex pair has some constraints that make inserting instrumentation in between, or even core DR operation, challenging:

From the ARMv8 manual, section E2.10.5:

  An implementation might clear an exclusive monitor between the LoadExcl
  instruction and the StoreExcl, instruction without any application-related
  cause. For example, this might happen because of cache evictions.  Software
  must, in any single thread of execution, avoid having any explicit memory
  accesses or cache maintenance instructions between the LoadExcl instruction
  and the associated StoreExcl instruction.

  Implementations can benefit from keeping the LoadExcl and StoreExcl
  operations close together in a single thread of execution. This minimizes
  the likelihood of the exclusive monitor state being cleared between the
  LoadExcl instruction and the StoreExcl instruction. Therefore, for best
  performance, ARM strongly recommends a limit of 128 bytes between
  LoadExcl and StoreExcl instructions in a single thread of execution.

One-time switches back to dispatch (during code discovery, or synchall or something) should be fine, because the ldrex..strex code has to loop and be prepared to fail a few times. The problems are all related to inserted memory operations that execute every single time.

Thus we have problems with:

Stolen register mangling in between or on ldrex/strex themselves
Inserted instrumentation

There could be multiple cbr's in between ldrex and strex, so do we give up on getting into a single bb? Should we study sequences we see in the wild to see if having some assumptions would make the problem drastically easier? Should we have a special trace head trigger and stop supporting -disable_traces?

One suggestion is adding a check for each instrumentation memory reference to skip if it's in the middle of ldrex/strex: but this gets tricky without allowing memory refs itself.