[FFmpeg-devel] [PATCH]?avcodec_decode_audio3?and?multiple?frames in a packet

Thu Sep 17 13:37:10 CEST 2009

Hi,

On Mittwoch, 16. September 2009, Michael Niedermayer wrote:
> On Wed, Sep 16, 2009 at 09:24:24PM +0200, Sascha Sommer wrote:
> > Hi,
> >
> > On Mittwoch, 16. September 2009, Michael Niedermayer wrote:
> > > On Wed, Sep 16, 2009 at 05:51:03PM +0200, Sascha Sommer wrote:
> > > > Hi,
> > > >
> > > > On Mittwoch, 16. September 2009, Michael Niedermayer wrote:
> > > > > On Wed, Sep 16, 2009 at 02:52:29PM +0200, Sascha Sommer wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Samstag, 12. September 2009, Justin Ruggles wrote:
> > > > > > > Michael Niedermayer wrote:
> > > > > > > > On Fri, Sep 11, 2009 at 12:13:02PM +0200, Sascha Sommer
> > > > > > > > wrote: [...]
> > > > > > > >
> > > > > > > >> Previously a value of 0 meant that no frame was decoded.
> > > > > > > >
> > > > > > > > no
> > > > > > > > you read the docs backward, it says if no frame was decoded
> > > > > > > > return 0 it does not say that 0 means no frames have been
> > > > > > > > decoded, it could equally well mean 0 bytes used
> > > > > > >
> > > > > > > Ah, good.  So, although the current text is technically correct
> > > > > > > if interpreted that way, it is ambiguous.  Why do we need to
> > > > > > > have a 0 return value also possibly mean no frames have been
> > > > > > > decoded?   If frame_size_ptr is set to 0, that always means no
> > > > > > > frames have been decoded, without regard to the return value. 
> > > > > > > And a return value of 0 should mean zero bytes were used,
> > > > > > > without regard to what frame_size_ptr is set to.  They seem
> > > > > > > mutually exclusive to me...
> > > > > >
> > > > > > I agree. The return value controls the number of input bytes,
> > > > > > frame_size_ptr the number of output bytes. I don't see why 0
> > > > > > needs to be returned when no frame was outputted.
> > > > >
> > > > > what exactly did the decoder then do with the data?
> > > > > and what was that data it did not decode?
> > > >
> > > > It maybe skipped it? For example when the packet contained only DSE
> > > > syntax elements in AAC. I did not check the spec if this can happen
> > > > or for what these Data Stream Elements are actually used but as we
> > > > already found out ffmpeg does
> > > > ? ? ? ? ? ? ? ? avpkt.data += ret;
> > > > ? ? ? ? ? ? ? ? avpkt.size -= ret;
> > > > So this will decode always the same data.
> > >
> > > yes
> > > a packet that is input to a decoder MUST in general contain 1 frame.
> >
> > 1 frame that can be decoded by the decoder, right?
> > But can we under all circumstances know what the decoder produces out of
> > this 1 input frame?
>
> no it can have an error and return a negative value representing the type
> of error

Maybe we are talking about different things. Let me try to explain again where 
I have a problem with understanding the current version of 
avcodec_decode_audio3:

avcodec_decode_audio3 says the following:
 * @return On error a negative value is returned, otherwise the number of 
bytes
 * used or zero if no frame could be decompressed.

How is that written in C (probably the wrong interpretation)?

        if(error)       //On error a negative value is returned
            return error;
        if(number_of_bytes_used) //otherwise the number of bytes used
            return number_of_bytes_used;
        if(!*data_size) //or zero if no frame could be decompressed
            return 0;
         ... //what to return here?
      I think  this can be reached with CODEC_CAP_DELAY when avpkt->size == 0
      So there should also be return 0

So this looks to me the same as
        if(error)
            return error;
        if(number_of_bytes_used)
            return number_of_bytes_used;
and the "or zero if no frame could be decompressed." can be removed from the 
API description. When number_of_bytes_used is 0 return 0.

However, judging from the discussion, I think the API description has to be 
seen as.

           if(error)
               return error;
            if(!*data_size)
               return 0;
            return number_of_bytes_used.

This would mean that 0 is always returned when there is no frame, no matter 
what the value of number_of_bytes_used is.

ffmpeg.c contains the following code:

    while (avpkt.size > 0 || (!pkt && ist->next_pts != ist->pts)) {
...
                ret = avcodec_decode_audio3(ist->st->codec, samples, 
&data_size,
                                            &avpkt);
                if (ret < 0)
                    goto fail_decode;
                avpkt.data += ret;
                avpkt.size -= ret;
                /* Some bug in mpeg audio decoder gives */
                /* data_size < 0, it seems they are overflows */
                if (data_size <= 0) {
                    /* no audio frame */
                    continue;
                }
...

So when there is no frame (ret = 0, data_size <= 0) ffmpeg will call 
avcodec_decode_audio3 again with the same avpkt without incrementing 
avpkt.data and without decreasing avpkt.size.
Now it depends on the decoder how it handles that.
I think there could be an endless loop when the packet is interpreted the same 
way as during the first call or there might be a bitstream error when the 
decoder expected a follow-up packet.

If I understand you correctly ("a packet that is input to a decoder MUST in 
general contain 1 frame" and "my definition of frame contains a positive 
number of samples, not 0 and not 
negative"), the previously described case is not allowed to happen.
So why is this not a real error then?

I always assumed this could happen and this is why I thought that the API not 
only needs to be clarified but also should be changed so that it always 
returns the number of consumed bytes when there was no error to stay 
consistent with the case when a frame is returned and to have an unambiguous 
indicator of how many bytes have been used from the packet. When "no bytes 
used from the input packet" is the same as "bytes used from the input packet 
but no frame returned", the data_size variable needs to be checked to decide 
if the next packet is to be decoded or the same packet is to be decoded 
again.
Otherwise it would have been enough to check if all bytes from a packet have 
been consumed or if an error occurred.

>
> > > Now, there are formats that use inseperable frames intermingled where
> > > a decoder hs to be feeded with more than 1 frame.
> > > packets that contain no frame at all would have a duration of 0, would
> > > have dts equal to the next packet would _not_ have a pts because they
> > > are not presented, practically no container could store them ...
> >
> > Still there could be a more or less "raw" format X that contains only the
> > information that it contains an audio frame of format Y with len Z bytes.
> > Audio format Y could have variable samples per frame, allowing even 0
> > samples
>
> We also could consider changing the API once these X-Y-Z turns up in
> reality besides my definition of frame contains a positive number of
> samples, not 0 and not negative
>

Maybe frame was not even the right name in the description of X-Y-Z. 
What comes out of the demuxer are packets. And my "definition" of a packet is 
simply a block of data. Such blocks of data can become something else when 
there is a decoder that knows how to understand the packet.
With wmapro and other codecs one such packet can become multiple frames and I 
assumed that also the other direction is possible: Sometimes such a packet 
may not contain enough data for a frame without that condition being an 
error.

I can't name real formats similar to X-Y-Z where frames merely change the 
state of the decoder and do not result in output samples.
I checked the AAC spec and a packet always has to result in samples so my 
example with a packet consisting of pure DSE elements does not happen.
I was unable to create a wmapro stream that had a first packet that contained 
one and a half compressed frames (for wmapro the imdct output for the first 
frame is discarded so this would cause such a no samples case)
I was unable to seek to a wmapro packet that did not contain at least one 
complete frame so maybe this really is not allowed in wmapro then.
However changing avcodec_decode_audio3 to:

       ret = avctx->codec->decode(avctx, samples, frame_size_ptr, avpkt);
        avctx->frame_number++;

        if (*frame_size_ptr == 0) {
           return 0;
        }

seems to result in an endless loop for 
http://samples.mplayerhq.hu/A-codecs/lossless/luckynight.flac
So at least ffmpeg.c would need to be changed then and there should a better 
description on how to handle the return values of avcodec_decode_audio3. This 
is not clear from the description.

> > (these frames could for example contain some decode tables for the next
> > frames)
> > As format X is meant to be generic it does not know what audio format Y
> > stores in its frames and can't concatenate the frames so that always one
> > frame is outputted.
> >
> > > please dont change the API to support this, at least not without first
> > > explaining me how it could work on the muxer side or even how the
> > > demuxer side should deal with all the special cases for timestamps, it
> > > surely does not currently
> >
> > I think when the format is decoded, we only need to think about what
> > comes out of the decoder and here, when a new sample is outputted, the
> > sum of all previously outputted samples can be used to calculate the pts.
>
> not in reality, no sound recording device has a perfect clock, the sample
> rate is not exact nor is it perfectly constant.
> you might be lucky of course that its good enough or that it has been
> resampled before muxing but if not your sum of samples can differ from the
> correct pts.
> of course its good enough for a single packet but possibly not a whole file

Yes but either we have pts stored in the container and we can use it or we 
don't have pts and then we have to calculate it otherwise. And if the 
inventor of X-Y-Z thinks that pts is to be calculated from the number of 
outputted samples, it should not matter if the decoder sometimes returns 
without a frame as long as all returned frames have the correct pts.

> > Input packets
> > with duration 0 do not need to result in an output packet, or how are
> > they currently handled?
>
> like a bug in the demuxer, besides duration==0 means unknown duration in
> the code not zero duration.
>

I did not know that.

> > Muxing the compressed data is another story but please let's figure out
> > how to do the decoding first.
>
> both muxing and decoding work in ffmpeg since a few years, a change to the
> API will have to keep both functioning. So what is it that you meant?
>
> (yes i do feel a little upset about the API change discussion prior to
>  ANY exlpanation why the current would be worse than the proposed)
> I would really prefer if you first would describe a _real_ situation in
> which the current is insufficent.
>

It was not my intention to upset you, sorry. I think that the description of 
avcodec_decode_audio3 is unclear and by trying to understand it
I had the impression that there are use cases that are not handled well by the 
current API. However it looks like they only need to be handled differently 
in ffmpeg.c and the API can stay as is.

Regards

Sascha