[FFmpeg-devel] [PATCH] avoid mb_xy recalculation in h264

Sun May 11 03:03:40 CEST 2008

On Sat, May 10, 2008 at 07:40:44PM -0400, Alexander Strange wrote:
>
> On May 10, 2008, at 7:56 AM, Michael Niedermayer wrote:
>
>> On Wed, May 07, 2008 at 10:42:04PM -0400, Alexander Strange wrote:
>>>
>>> On May 7, 2008, at 3:15 PM, Michael Niedermayer wrote:
>>>
>>>> On Wed, May 07, 2008 at 02:49:20PM -0400, Alexander Strange wrote:
>>>>>
>>>>> On May 7, 2008, at 8:53 AM, Michael Niedermayer wrote:
>>>>>
>>>>>> On Wed, May 07, 2008 at 03:05:47AM -0400, Alexander Strange wrote:
>>>>>>> This line appears in a lot of h264.c functions:
>>>>>>> const int mb_xy= s->mb_x + s->mb_y*s->mb_stride;
>>>>>>>
>>>>>>> It only changes once per MB, so we can add it to the context and
>>>>>>> only
>>>>>>> recalculate it in decode_slice().
>>>>>>> I put it at the end of H264Context, but it could be moved up
>>>>>>> earlier if
>>>>>>> someone likes it there. The position doesn't seem to affect
>>>>>>> performance
>>>>>>> much; it can't be moved next to mb_x/mb_y since that would put it
>>>>>>> in
>>>>>>> MpegEncContext, and none of the other codecs use it.
>>>>>>>
>>>>>>> Patches:
>>>>>>> 1- does that, updating it for MBAFF when needed
>>>>>>> 2- removes some newly unused variables
>>>>>>> 3- does the same thing to svq3.c
>>>>>>>
>>>>>>> Before: avg 8.799s max 8.8s min 8.794s
>>>>>>> After: avg 8.645s max 8.651s min 8.641s
>>>>>>
>>>>>> Knowing the cpu and compiler would also be usefull.
>>>>>
>>>>> gcc 4.0.1, core 2 x86-32 Darwin
>>>>>
>>>>> I guess it's about the same as the usual distro compiler, but this
>>>>> isn't
>>>>> really a sensitive messing-with-the-register-allocator patch.
>>>>
>>>> How does the change compare to a splited (litterally duplicated)
>>>> decode_cabac_residual()
>>>> with the mb_xy only calculated for the "cat"s which actually need it?
>>>> That is one for cat=0/3 and one for the others
>>>> or
>>>> with the mb_xy calc moved in the if(cat=0/3) which use it?
>>>
>>> Actually, the compiled decode_cabac_residual is barely different at
>>> all. gcc loads h->mb_xy, immediately spills it to the same place in
>>> the stack, and the rest of the function is the same. I suppose
>>> changing mb_xy to h->mb_xy for the DC cats might be an improvement.
>>>
>>> Diffing the asm shows:
>>> * less imull
>>> * hl_motion is different (mb_x load moved into an if())
>>> * write_back_intra_pred_mode() is inlined and removed
>>> * slightly different stack sizes (hard to know if that means anything)
>>>
>>> Making write_back_intra_pred_mode() always_inline doesn't change cavlc
>>> time, but explains about half the speed change for cabac.
>>
>> My question is still the same, how does adding mb_xy into the context
>> compare (speedwise/benchmark) to changing the decode_cabac_residual()
>> so it does not use mb_xy 90% of the time?
>>
>> I think the main speed difference (besides inline changes) is in
>> decode_cabac_residual()
>> its kinda logic if you consider that mb_xy is calculated 16 times alone
>> for the luma blocks but not used at all. Changing that to loading it from 
>> the
>> context is not correct. decode_cabac_residual() should not touch mb_xy
>> when it does not need it.
>>
>> That is what id like to see is
>> How much speed difference there is between svn and the calculation of
>> mb_xy moved into the cats which need it?
>> How much speed gain we can achive by making write_back_intra_pred_mode()
>> inline if any.
>> What speed differenc remains between the best combination of above and in
>> addition putting mb_xy in the context?
>>
>> What iam trying to do is chop the changes up into small parts and see 
>> their
>> individual contriution, so we do not commit something which might have a
>> negative speed impact by mistake.
>> Also if we know if X is better inlined or not inlined we can force gcc
>> to always do what is better.
>
> I tried it with diffferent benchmarking methods and compilers (gcc 4.2).
>
> As is -
> avg 10.933s min 10.932s max 10.934s
>
> Moving mb_xy calculation into if (cat == 0/3) -
> avg 10.89s min 10.888s max 10.892s
>
> Using h->mb_xy just in if (cat == 0/3) (attached patch) -
> avg 10.888s min 10.887s max 10.891s
>
> Original patch -
> avg 10.856s min 10.855s max 10.859s
>

> Patch + attached patch -
> avg 10.826s min 10.823s max 10.828s

ok, (seperate commits thouh where possible)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080511/4c6e2f5a/attachment.pgp>