Switch raw2trace from a per-instr hashtable to a per-block hashtable

For #3977 I changed raw2trace to store per-instruction data with the block start PC as part of the key, to handle duplicated instructions in multiple blocks. However, raw2trace was still doing a hashtable lookup on an individual instruction basis, and that table lookup took 10% of cpu time (with i/o disabled). Plus, when using the best C++ STL table, that lookup rose to 28% of the time (see #2056 (closed) for the initial measurements; I re-measured with the new block,instr keys), and in fact showed a significant slowdown:

hashtable_t in HEAD:

6.39user 0.09system 0:06.49elapsed 99%CPU (0avgtext+0avgdata 10964maxresident)k
6.33user 0.07system 0:06.41elapsed 99%CPU (0avgtext+0avgdata 10964maxresident)k
6.45user 0.06system 0:06.57elapsed 99%CPU (0avgtext+0avgdata 10904maxresident)k

c++11 std::unordered_map.find, max_load_factor 0.5, init size 1<<16, custom hash+cmp:

7.61user 0.09system 0:07.71elapsed 99%CPU (0avgtext+0avgdata 11360maxresident)k
7.51user 0.04system 0:07.55elapsed 99%CPU (0avgtext+0avgdata 11516maxresident)k
7.44user 0.07system 0:07.60elapsed 99%CPU (0avgtext+0avgdata 11428maxresident)k

As part of #3316 (closed) I'm trying to remove all reliance on the full DR library from raw2trace. The use of the drcontainers hashtable_t is one such reliance. Since it would be rather hacky to try and make a version of drcontainers with no DR dependencies (it has persistence support and uses DR locks), instead I'm proposing refactoring the raw2trace hashtables to store per-block info, query per block, remember the last block, and store per-instr info in a vector inside the block. We do have access to the instr count and instr index for all callers, if we store the index in a note for the elision walk.

I implemented this and the speedup is nice (remember this is w/o i/o):

5.22user 0.11system 0:05.34elapsed 99%CPU (0avgtext+0avgdata 10728maxresident)k
5.21user 0.05system 0:05.27elapsed 99%CPU (0avgtext+0avgdata 10704maxresident)k
5.30user 0.06system 0:05.37elapsed 99%CPU (0avgtext+0avgdata 10664maxresident)k

The time spent in hashtable_lookup is now 0.95%. So we should be able to swap to the C++ table without noticeable overhead, making it easier to move raw2trace to use drdecode.