[FFmpeg-devel] [PATCHv3] VP4 video decoder

Sun May 19 17:20:06 EEST 2019

On 5/19/2019 4:22 AM, Peter Ross wrote:
> On Fri, May 17, 2019 at 08:13:51PM +0200, Reimar Döffinger wrote:
>> On Fri, May 17, 2019 at 08:09:45PM +1000, Peter Ross wrote:
>>> ah, i see what you did there! it works perfectly, just missing
>>> UPDATE_CACHE at the start and in the loop.
>>>
>>> test results on i7 decoding 3 minute long 4k video with vp4.
>>
>> Looks fairly close to noise to me, though for me
>> it seemed a bit more obvious how the encoding
>> works from it (which was the primary reason to suggest it).
>> If one really wanted to optimize it for performance the
>> arrangement of the conditions can probably be improved, e.g.
>> the 0x1ff check is now the very first one even though
>> it is the least likely one (but avoids duplicating code
>> or needing crazy goto or loop constructs and thus
>> is more readable), and depending on probabilities
>> doing the range checks in a more tree-like structure
>> might also be better.
>> But as said, optimizing this has probably at most
>> curiosity value :)
> 
> i like your the while loop, it makes it more obvious
> i don't have enough source sequences (or interest), do generate
> those probabilities.
> 
> another run, on original raspberry pi with 720p sequence.
> 
> vp4 patch v3:
> bench: utime=110.393s stime=5.232s rtime=115.835s
> bench: utime=112.869s stime=4.981s rtime=118.235s
> bench: utime=111.737s stime=5.220s rtime=117.168s
> bench: utime=113.265s stime=5.250s rtime=118.730s
> bench: utime=112.638s stime=5.120s rtime=117.938s
> bench: utime=110.732s stime=5.190s rtime=116.133s
> bench: utime=111.218s stime=5.013s rtime=116.444s
> bench: utime=111.096s stime=4.768s rtime=116.076s
> bench: utime=111.318s stime=5.073s rtime=116.603s
> 
> your patch + UPDATE_CACHE:
> bnnch: utime=111.583s stime=5.421s rtime=117.251s
> bench: utime=112.145s stime=4.799s rtime=117.155s
> bench: utime=111.235s stime=5.552s rtime=116.967s
> bench: utime=112.169s stime=5.248s rtime=117.628s
> bench: utime=112.424s stime=5.178s rtime=117.813s
> bench: utime=112.721s stime=5.182s rtime=118.115s
> bench: utime=112.753s stime=5.162s rtime=118.125s
> bench: utime=111.587s stime=5.267s rtime=117.065s
> bench: utime=112.641s stime=4.952s rtime=117.805s
> 
> averaging the results your patch is a only marginally slower.
> but basically no differnce.
> 
> converting your patch to use ordinary get_bits/show_bits however
> does make it slower. average approx 0.4 seconds slower:
> 
> bench: utime=114.315s stime=5.292s rtime=119.820s
> bench: utime=111.625s stime=5.481s rtime=117.317s
> bench: utime=112.706s stime=5.063s rtime=117.982s
> bench: utime=111.707s stime=5.221s rtime=117.137s
> bench: utime=113.653s stime=5.451s rtime=119.318s
> bench: utime=113.559s stime=5.332s rtime=119.104s
> bench: utime=113.149s stime=5.261s rtime=118.621s
> bench: utime=112.254s stime=5.401s rtime=117.868s
> bench: utime=112.954s stime=5.532s rtime=118.698s

Is it really important? VP4 samples are apparently not even available in
the wild. This decoder is IMO more important as an open source reference
implementation for the codec than a speed focused implementation, so
simplicity and readability should be the priority, which standard
get_bits() calls provide.

A 0.4 second speed boost on a 720p sample on a raspberri pi is not worth
making the code harder to read and port (You can easily port a get_bits
call to different bit reader frameworks, something you can't do if you
instead use the internal macro definitions).