Add more atomic support: 64-bit add for 32-bit code; atomic reads and writes for aarchxx
We have several use cases where we want better atomic support from DR:
- 32-bit support for dr_atomic_add64()
- 32-bit support for DRX_COUNTER_LOCK for 64-bit counters in drx_insert_counter_update()
- atomic and relatively prompt loads and stores for aarchxx for global flags accessed from gencode