[FFmpeg-devel] [PATCH] swscale/arm: add yuv2planeX_8_neon

Mon Apr 11 18:16:21 CEST 2016

On Mon, Apr 11, 2016 at 4:18 PM, Matthieu Bouron <matthieu.bouron at gmail.com>
wrote:

>
>
> On Mon, Apr 11, 2016 at 9:58 AM, Benoit Fouet <benoit.fouet at free.fr>
> wrote:
>
>> Hi,
>>
>> (again, thanks to both of you for documenting all this assembly /NEON
>> code)
>>
>> On 09/04/2016 10:22, Matthieu Bouron wrote:
>>
>>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>>>
>>> ---
>>>
>>> Hello,
>>>
>>> The following patch add yuv2planeX_8_neon function for the arm
>>> platform.  It is
>>> currently restricted to 8-bit per component sources until I fix fate
>>> issues
>>> with 10-bit sources (the dnxhd-*-10bit tests fail but I haven't figured
>>> out yet
>>> where it comes from).
>>>
>>> Matthieu
>>>
>>> ---
>>>   libswscale/arm/Makefile  |  1 +
>>>   libswscale/arm/output.S  | 78
>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>>   libswscale/arm/swscale.c |  7 +++++
>>>   libswscale/utils.c       |  3 +-
>>>   4 files changed, 88 insertions(+), 1 deletion(-)
>>>   create mode 100644 libswscale/arm/output.S
>>>
>>> [...]
>>>
>>> diff --git a/libswscale/arm/output.S b/libswscale/arm/output.S
>>> new file mode 100644
>>> index 0000000..4437447
>>> --- /dev/null
>>> +++ b/libswscale/arm/output.S
>>> @@ -0,0 +1,78 @@
>>>
>>
>> [...]
>>
>>
>> +function ff_yuv2planeX_8_neon, export=1
>>> +    push {r4-r12, lr}
>>> +    vpush {q4-q7}
>>> +    ldr                 r4, [sp, #104]
>>>  @ dstW
>>> +    ldr                 r5, [sp, #108]
>>>  @ dither
>>> +    ldr                 r6, [sp, #112]
>>>  @ offset
>>> +    vld1.8              {d0}, [r5]
>>>  @ load 8x8-bit dither values
>>> +    tst                 r6, #0
>>>  @ check offsetting which can be 0 or 3 only
>>> +    beq                 1f
>>> +    vext.u8             d0, d0, d0, #3
>>>  @ honor offseting which can be 3 only
>>> +1:  vmovl.u8            q0, d0
>>>  @ extend dither to 16-bit
>>> +    vshll.u16           q1, d0, #12
>>> @ extend dither to 32-bit with left shift by 12 (part 1)
>>> +    vshll.u16           q2, d1, #12
>>> @ extend dither to 32-bit with left shift by 12 (part 2)
>>> +    mov                 r7, #0
>>>  @ i = 0
>>> +2:  vmov.u8             q3, q1
>>>  @ initialize accumulator with dithering values (part 1)
>>> +    vmov.u8             q4, q2
>>>  @ initialize accumulator with dithering values (part 2)
>>> +    mov                 r8, r1
>>>  @ tmpFilterSize = filterSize
>>> +    mov                 r9, r2
>>>  @ srcp
>>> +    mov                 r10, r0
>>> @ filterp
>>> +3:  ldr                 r11, [r9], #4
>>> @ get pointer @ src[j]
>>> +    ldr                 r12, [r9], #4
>>> @ get pointer @ src[j+1]
>>> +    add                 r11, r11, r7, lsl #1
>>>  @ &src[j][i]
>>> +    add                 r12, r12, r7, lsl #1
>>>  @ &src[j+1][i]
>>> +    vld1.16             {q5}, [r11]
>>> @ read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
>>> +    vld1.16             {q6}, [r12]
>>> @ read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
>>> +    ldr                 r11, [r10], #4
>>>  @ read 2x16-bit coeffs (X, Y) at (filter[j], filter[j+1])
>>> +    vmov.16             q7, q5
>>>  @ copy 8x16-bit @ src[j  ][i + {0..7}] for following inplace zip
>>> instruction
>>> +    vmov.16             q8, q6
>>>  @ copy 8x16-bit @ src[j+1][i + {0..7}] for following inplace zip
>>> instruction
>>> +    vzip.16             q7, q8
>>>  @ A,I,B,J,C,K,D,L,E,M,F,N,G,O,H,L
>>>
>>
>> nit: O,H,P
>
>
> Fixed.
>
> Patch updated fixing fate issues with 10-bit sources (the code was not
> honoring offsetting: tst r6, #0 has been replaced with cmp r6, #0).
> If there is no objection, I will push the patch in the next hours.
>

Patch applied.

Matthieu