reduce per-thread memory usage on UNIX
We've optimized memory usage for each thread on Windows in the past, but not on UNIX.
After some investigation this is what we have for each thread for x86_64:
- dstack, 64K w/ guards
- DR TLS, 12K w/ guards (so 4K wasted with 16K vmm blocks): os_tls_init
- privlib TLS, 16K w/ guards: privload_tls_init (from os_tls_app_seg_init, os_tls_init)
- local_heap, 32K w/ guards
- nonpersistent_heap, 32K w/ guards
- signal special heap, 32K w/ guards
- alt sig stack, 64K w/ guards
So that's 7 requests, 2 of which are 64K == 9 32K == 18 16K units == 288K per thread. That's too much: a large app with 1000 threads will need 288M which will likely be much more memory than for its code cache or heap.
Where is the gencode? Private gencode never created for x86_64. Where is thread-private cache? Never created b/c no traces and no inlined syscalls?
Things to consider:
-
turn off guard pages by default to save memory?? robustness-performance tradeoff: maybe let users make that decision, and document the -guard_pages option
-
if reset is disabled, don't split nonpersistent heap
-
combine the 2 TLS into one or, -vmm_block_size 4K (see below): no kernel min size forcing us to add code complexity to share allocs like there is on Windows
-
shrink vmm unit to 4K to avoid waste and making it easy to shrink other things incrementally? What is cost of smaller block size? Extra space+time for vmm block data struct? bitmap_element_t blocks[BITMAP_INDEX(MAX_VMM_HEAP_UNIT_SIZE / MIN_VMM_BLOCK_SIZE)]; So hardcoded for max + min no matter the runtime options. It was 512*1024K/16K = 32K / 32 = 1024 uints = 4K space. With 4K min block size it's a 16K array. vmm_dump_map is 4x bigger in the logfile. We could do 8K instead of 4K: only the DR TLS would waste a page right now, but we want to shrink the thread heap down to 4K too.
-
shrink local heap: almost always only used for IR, right? plus, will just make new unit if need more Measure how much is used? Clients would be expected to not use much, and a new unit will be made if they really want a lot. I would think 4K should be plenty. But make that Linux-only? Windows should take 64K and commit one page at a time (pretty sure it's already doing that).
-
shrink signal heap or stack? What is size of signal frame on skylake? Size of sigpending_t?
-
shrink dstack: that's up to client