[FFmpeg-devel] [PATCH 5/7] ARM: NEON optimised H.264 8x8 and 16x16 qpel MC
Måns Rullgård
mans
Mon Dec 8 18:25:44 CET 2008
"Ian Caulfield" <ian.caulfield at gmail.com> writes:
> 2008/12/8 Ian Caulfield <ian.caulfield at gmail.com>:
>> 2008/12/8 M?ns Rullg?rd <mans at mansr.com>:
>>> "Ian Caulfield" <ian.caulfield at gmail.com> writes:
>>>
>>>> 2008/12/5 Mans Rullgard <mans at mansr.com>:
>>>>
>>>>> +
>>>>> + vshl.i16 q3, q1, #4
>>>>> + vshl.i16 q1, q1, #2
>>>>> + vshl.i16 q15, q2, #2
>>>>> + vadd.i16 q1, q1, q3
>>>>> + vadd.i16 q2, q2, q15
>>>>> +
>>>>> + vshl.i16 q3, q9, #4
>>>>> + vshl.i16 q9, q9, #2
>>>>> + vshl.i16 q15, q10, #2
>>>>> + vadd.i16 q9, q9, q3
>>>>> + vadd.i16 q10, q10, q15
>>>>> +
>>>>> + vsub.i16 q1, q1, q2
>>>>> + vsub.i16 q9, q9, q10
>>>>
>>>> Is this any faster? I don't know what the interlocking will be like,
>>>> nor whether you have a spare register to hold the scalar... (or even
>>>> if setting up the scalars would make it slower)
>>>>
>>>> vmul.i16 q1, q1, <scalar set to 6>
>>>> vmul.i16 q9, q9, <scalar set to 6>
>>>> vmls.i16 q1, q2, <scalar set to 3>
>>>> vmls.i16 q9, q10, <scalar set to 3>
>>>
>
> On further inspection, I think the following vadds could be merged in thus:
>
> vmla.i16 q0, q1, <scalar set to 20>
> vmla.i16 q8, q9, <scalar set to 20>
> vmls.i16 q0, q2, <scalar set to 5>
> vmls.i16 q8, q10, <scalar set to 5>
You're probably right. I'll look into it.
Thanks for taking the time to find these things.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list