Packet "time_base" manipulations are tricky
While trying to encode frames (with a correct PTS) to an H.264 stream I kept encountering problems when I submit the "flush" packets to the muxer, complaining that the packets had non-monotonically increasing PTS.
After some digging around I have come to the conclusion that PyAV's storing of a "time_base" associated with packets and its attempts to manipulate the pts / dts / duration are harmful.
The root of the problem comes from how PyAV attempts to associate a "time_base" with a packet (a concept that ffmpeg's AVPacket does not have) during encoding. In essence CodecContext.encode works like this:
- send a frame to the encoder
- check if it has some packets for us
- if it does, try to put either the frame's time_base or the stream's into the packet (setup_encoded_packet)
- return the packets
The underlying assumption here is that the packets we received are related to the frame we just submitted, but this is not necessarily the case. The reason there is a "flushing" mechanism is precisely because the codec may be performing buffering.
H.264 is a good example here : if you look at the timeline you will see there is clearly some buffering going on. You get no packets for a long time even though you're submitting frames, and conversely the flushing produces packets even though you didn't submit a frame.. so setup_encoded_packet is working on a wrong assumption.
Also it's not clear to me why setup_encoded_packet isn't crashing when given a "None" frame, as we reach into frame._time_base?