[FFmpeg-devel] [PATCH] NEON: put_pixels_clamped

Wed Apr 29 10:57:37 CEST 2009

On Apr 16, 2009, at 4:23 PM, M?ns Rullg?rd wrote:

> David Conrad <lessen42 at gmail.com> writes:
>
>> On Apr 16, 2009, at 3:44 PM, David Conrad wrote:
>>
>>> On Apr 16, 2009, at 3:32 PM, M?ns Rullg?rd wrote:
>>>
>>>> David Conrad <lessen42 at gmail.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> Apparently this is used for some wmv3 files in addition to the
>>>>> signed
>>>>> variant.
>>>>> < 1% faster decode.
>>>>>
>>>>>
>>>>> commit 38cac0d21d8308e077bb762d712ab7c19e8c826d
>>>>> Author: David Conrad <davedc at Kozue.local>
>>>>> Date:   Thu Apr 16 14:30:29 2009 -0400
>>>>>
>>>>>  NEON: put_pixels_clamped
>>>>>
>>>>> diff --git a/libavcodec/arm/dsputil_neon.c b/libavcodec/arm/
>>>>> dsputil_neon.c
>>>>> index 37425a3..9b95130 100644
>>>>> --- a/libavcodec/arm/dsputil_neon.c
>>>>> +++ b/libavcodec/arm/dsputil_neon.c
>>>>> @@ -42,6 +42,7 @@ void ff_put_pixels8_xy2_no_rnd_neon(uint8_t *,
>>>>> const uint8_t *, int, int);
>>>>> void ff_avg_pixels16_neon(uint8_t *, const uint8_t *, int, int);
>>>>>
>>>>> void ff_add_pixels_clamped_neon(const DCTELEM *, uint8_t *, int);
>>>>> +void ff_put_pixels_clamped_neon(const DCTELEM *, uint8_t *, int);
>>>>> void ff_put_signed_pixels_clamped_neon(const DCTELEM *, uint8_t *,
>>>>> int);
>>>>>
>>>>> void ff_put_h264_qpel16_mc00_neon(uint8_t *, uint8_t *, int);
>>>>> @@ -180,6 +181,7 @@ void ff_dsputil_init_neon(DSPContext *c,
>>>>> AVCodecContext *avctx)
>>>>>   c->avg_pixels_tab[0][0] = ff_avg_pixels16_neon;
>>>>>
>>>>>   c->add_pixels_clamped = ff_add_pixels_clamped_neon;
>>>>> +    c->put_pixels_clamped = ff_put_pixels_clamped_neon;
>>>>>   c->put_signed_pixels_clamped =  
>>>>> ff_put_signed_pixels_clamped_neon;
>>>>>
>>>>>   c->put_h264_chroma_pixels_tab[0] = ff_put_h264_chroma_mc8_neon;
>>>>> diff --git a/libavcodec/arm/dsputil_neon_s.S b/libavcodec/arm/
>>>>> dsputil_neon_s.S
>>>>> index f16293d..159ee64 100644
>>>>> --- a/libavcodec/arm/dsputil_neon_s.S
>>>>> +++ b/libavcodec/arm/dsputil_neon_s.S
>>>>> @@ -273,6 +273,30 @@ function ff_put_h264_qpel8_mc00_neon,  
>>>>> export=1
>>>>>       pixfunc2 put_ pixels8_y2,   _no_rnd, vhadd.u8
>>>>>       pixfunc2 put_ pixels8_xy2,  _no_rnd, vshrn.u16, 1
>>>>>
>>>>> +function ff_put_pixels_clamped_neon, export=1
>>>>> +        vld1.64         {d16-d19}, [r0,:128]!
>>>>> +        vqmovn.u16      d0, q8
>>>>> +        vld1.64         {d20-d23}, [r0,:128]!
>>>>> +        vqmovn.u16      d1, q9
>>>>> +        vqmovn.u16      d2, q10
>>>>> +        vld1.64         {d24-d27}, [r0,:128]!
>>>>> +        vqmovn.u16      d3, q11
>>>>> +        vqmovn.u16      d4, q12
>>>>> +        vld1.64         {d28-d31}, [r0,:128]!
>>>>> +        vqmovn.u16      d5, q13
>>>>> +        vqmovn.u16      d6, q14
>>>>> +        vst1.64         {d0},      [r1,:64], r2
>>>>> +        vqmovn.u16      d7, q15
>>>>> +        vst1.64         {d1},      [r1,:64], r2
>>>>> +        vst1.64         {d2},      [r1,:64], r2
>>>>> +        vst1.64         {d3},      [r1,:64], r2
>>>>> +        vst1.64         {d4},      [r1,:64], r2
>>>>> +        vst1.64         {d5},      [r1,:64], r2
>>>>> +        vst1.64         {d6},      [r1,:64], r2
>>>>> +        vst1.64         {d7},      [r1,:64], r2
>>>>> +        bx              lr
>>>>> +        .endfunc
>>>>
>>>> Shouldn't those be vqmovun.s16?  I'd also try to interleave them  
>>>> with
>>>> the loads and stores a bit more for better dual-issue  
>>>> opportunities.
>>>
>>> Unsigned pixels; MMX does the same (packuswb for put_pixels_clamped
>>> vs. packsswb for put_signed_pixels_clamped)
>>
>> Oops, you're right, I didn't read packuswb. New patch attached.
>>
>>> Also, the loads take two issue cycles since they're loading 4
>>> registers; shouldn't they be able to dual issue on both cycles?
>
> On Cortex-A8 NEON instructions with more than one issue cycle can
> dual-issue on the first or the last cycle but not both.

As discussed on IRC, all the multicycle NEON instructions I've checked  
can dual issue on both their first and last cycles.
I can't measure any speed difference doing it this way, but it looks a  
bit more consistent with the other put_/add_ pixels_clamped.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ffmpeg-neon-put_pixels_clamped.txt
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090429/fd4bba0a/attachment.txt>
-------------- next part --------------