reduce per-thread memory usage on UNIX

We've optimized memory usage for each thread on Windows in the past, but not on UNIX.

After some investigation this is what we have for each thread for x86_64:

dstack, 64K w/ guards
DR TLS, 12K w/ guards (so 4K wasted with 16K vmm blocks): os_tls_init
privlib TLS, 16K w/ guards: privload_tls_init (from os_tls_app_seg_init, os_tls_init)
local_heap, 32K w/ guards
nonpersistent_heap, 32K w/ guards
signal special heap, 32K w/ guards
alt sig stack, 64K w/ guards

So that's 7 requests, 2 of which are 64K == 9 32K == 18 16K units == 288K per thread. That's too much: a large app with 1000 threads will need 288M which will likely be much more memory than for its code cache or heap.

Where is the gencode? Private gencode never created for x86_64. Where is thread-private cache? Never created b/c no traces and no inlined syscalls?

Things to consider:

turn off guard pages by default to save memory?? robustness-performance tradeoff: maybe let users make that decision, and document the -guard_pages option
if reset is disabled, don't split nonpersistent heap
combine the 2 TLS into one or, -vmm_block_size 4K (see below): no kernel min size forcing us to add code complexity to share allocs like there is on Windows
shrink vmm unit to 4K to avoid waste and making it easy to shrink other things incrementally? What is cost of smaller block size? Extra space+time for vmm block data struct? bitmap_element_t blocks[BITMAP_INDEX(MAX_VMM_HEAP_UNIT_SIZE / MIN_VMM_BLOCK_SIZE)]; So hardcoded for max + min no matter the runtime options. It was 512*1024K/16K = 32K / 32 = 1024 uints = 4K space. With 4K min block size it's a 16K array. vmm_dump_map is 4x bigger in the logfile. We could do 8K instead of 4K: only the DR TLS would waste a page right now, but we want to shrink the thread heap down to 4K too.
shrink local heap: almost always only used for IR, right? plus, will just make new unit if need more Measure how much is used? Clients would be expected to not use much, and a new unit will be made if they really want a lot. I would think 4K should be plenty. But make that Linux-only? Windows should take 64K and commit one page at a time (pretty sure it's already doing that).
shrink signal heap or stack? What is size of signal frame on skylake? Size of sigpending_t?
shrink dstack: that's up to client