Adds 32-bit and 64-bit load and store atomics to core DR and exports them to the client API.
Adds missing barriers to the existing store atomic macros on both ARM and AArch64.
Adds a sanity test that at least exercises the functions. Adds the test sequence to client.thread and client.tls (only client.tls is currently enabled on AArchXX).
Uses dr_atomic_load* in drwrap in place of a previously overkill dr_atomic_add*.
These will also help with #2502.
The next step is to add gencode versions of these, either to the core or drx.