Rather than issuing a separate write for each thread's initial header, we merge it into its first buffer at the cost of a conditional in the buffer dump routine. This will help with strategies where writing smaller units than a full buffer are costly.