clean call optimizations on AArch64: out-of-line, analyze and reduce cxt sw, inline
We put in place basic unoptimized clean call support on ARM but all of the performance features are missing. This issue covers adding out-of-line calls (maybe less nec b/c the register lists on ARM reduce code size) and analysis of callees to reduce context switch size and inline the callee where possible.