[FFmpeg-devel] AVCHD/H.264 decoder: further development/corrections

Sun Jan 25 20:08:06 CET 2009

Hello * (especially h264 maintainers),

in the last few days I tried to find out what has to be done in H.264 
decoder in order to correctly support AVCHD files from full-HD 
camcorders (which is IMHO quite an important use case). I admit, I'm a 
bit selfish there, since I want to get my Panasonic HDC-SD9 
fully-supported :-). But I'm also willing to invest more time and to fix 
the code. However, I need some advice, preferably somewhat more detailed.

I identified the following problems and potential solutions:

   1. Inconsistency between packets returned via av_read_frame() and
      actually delivered full frames from avcodec_decode_video()
   2. Key frame calculation and seeking
   3. Reporting frame type to libavformat

Now the details:

*1. Inconsistency between packets and decoded frames*

H.264 decoder returns AVPackets via av_read_frame(), which contain 
either a full frame or just a field (half frame). The former case is not 
problematic, since decoded frames are 1:1 to returned packets. It is 
problematic, though, when the decoder returns packets, which DO NOT 
correspond to a full frame. This is the case of interlaced AVCHD video 
as produced by various full-HD camcorders (at least Panasonic, Sony and 
Canon). H.264 standard allows namely coding by field, so one picture in 
H.264 terms (as currently returned as AVPacket from av_read_frame()) can 
contain either a single field, two fields (frame) or even repeated 
fields (so in total 1-3 fields per AVPacket).

I'd concentrate first on H.264 pictures having 1 to 2 fields only, since 
the other case (3 fields per picture) is probably not that interesting 
now (it is used to quasi stretch FPS from original cinema material to 
television frame rates).

Although the decoder itself takes this into account, the interface in 
libavformat doesn't. Thus, currently only video having full frames per 
packet decodes really correctly (and this also only with not-yet-applied 
patch concerning frame types). Reason: av_read_frame() doesn't return 
whole frames, although it is documented so.

*Potential solution:* For field pictures, delay returning a packet from 
h264_parse(), until the second field picture is also read. The decoder 
should take then care of decoding both fields correctly and returning a 
full frame for each packet.

*Alternative solution:* Return field packet from h264_parse() 
immediately, but somehow tell libavformat that the packet does not 
represent a full frame and second field has to be read as well. Read it 
in libavformat, extending the existing packet. Thus, av_read_frame() 
returns then full frame.

*No solution:* Leave libavformat and h264_parse as-is and take care of 
second half-frame in ffmpeg.c and other libavformat users. This won't 
work, as we would need to adjust API and thus every single program using 
ffmpeg to correctly handle field frames. Further, libavformat computes 
wrong DTS/PTS for the second field (equal to DTS/PTS of the first field 
of _next_ frame instead of in-between, since second field doesn't 
specify DTS/PTS at all), which causes do_video_out() to drop and 
duplicate frames, producing very jerky video.

*No solution 2:* Communicate to libavformat the fact which field of full 
frame the returned packet contains and adjust DTS/PTS calculation in 
compute_pkt_fields() appropriately, returning last_DTS+duration/2 and 
last_PTS+duration/2 for DTS/PTS of the second field. Again, this is API 
change, since av_read_frame() would not return full frames. Though it 
works in ffmpeg.c, it is unclear if it works in other programs using 
libavformat (probably not).

Now the question: Which solution is the "right" one? I'd go for the 
first one or possibly for the alternative. The first proposed solution 
seems to be most "compatible", since we don't need to extend AVPacket to 
address the issue.

Your opinions? Or eventually a different idea?

*2. Key frame calculation and seeking*

H.264 is different to other video codecs, since it doesn't have fixed 
key frames. Instead, several reference pictures from the history can be 
used to decode a particular picture. There are IDR pictures, which are 
effectively key frames, but these seem not to be really used. AVCHD 
files from camcorders have exactly one IDR frame at the beginning of the 
file.

Other than that, the stream provides information (SEI recovery point) on 
how many frames need to be decoded before the video synchronizes 
starting at the given point. There is already field 
AVPacket.convergence_duration, which is supposed to address exactly this 
(until now unused in h264, though).

My suggestion is to report key frames for IDR pictures and for 
appropriate frames after SEI recovery point (after counting down number 
of frames given in recovery point SEI message).

Alternatively, key frames could be reported for IDR pictures and for 
pictures having recovery point. In this case, the application would have 
to handle it via AVPacket.convergence_duration. Unfortunately, noone 
seems to handle convergence_duration in an application, and I don't 
believe anyone would like to. So IMHO this is a no-go.

My suggestion for current av_seek_frame() would be to do the following 
for streams needing convergence_duration (is there a flag for it 
already?): When seeking to a certain PTS, seek to a frame with given PTS 
and then roll *backward* until the frame with last recovery point with 
convergence_duration <= distance is found (how to find it most optimal?) 
and then re-decode all reference frames (i.e., leaving out unneeded 
B-frames) into dummy buffers from this point on until just before given 
PTS. So the next av_read_frame() will read a key frame, which can be 
decoded correctly. In this way, the application doesn't have to handle 
convergence_duration by itself.

Michael suggested new seeking API. Maybe this should be addressed there 
via a flag (seek to frame with recovery point and use 
convergence_duration in application or let libavformat do the decode up 
to key frame as described above), but for now, an alternative needs to 
be implemented for current seeking API.

Further, I'd propose keeping a small cache of (PTS, position, 
convergence_duration) triples for frames containing SEI recovery point 
message, so the seeking around "current" location would be faster. 
Reason: video editing software, where we often need to seek one frame 
forward/backward.

Your opinions/suggestions?

*3. Reporting frame type to libavformat*

This is a minor thing, but still important for correct computation of 
PTS/DTS and key frame flags. compute_pkt_fields() relies on having the 
information about picture type (I/P/B-frame). However, H.264 doesn't 
have strict I/P/B frames, there is even a possibility to have mixed-type 
slices inside of one frame. Indeed, my camcorder produces in interlaced 
mode top field as I-slice and bottom field as P-slice referring to the 
top field.

So my suggestion is, report picture type I-frame for key frames (which 
are key frames is discussed above) and report P-frame for all frames 
containing only P- and I- slices. Other frames containing also B-slices 
will be reported as B-frames.

Your opinions/suggestions?

Thanks in advance.

Regards,

Ivan