[FFmpeg-devel] [PATCH] swscale/arm: add yuv2planeX_8_neon
Matthieu Bouron
matthieu.bouron at gmail.com
Mon Apr 11 18:16:21 CEST 2016
On Mon, Apr 11, 2016 at 4:18 PM, Matthieu Bouron <matthieu.bouron at gmail.com>
wrote:
>
>
> On Mon, Apr 11, 2016 at 9:58 AM, Benoit Fouet <benoit.fouet at free.fr>
> wrote:
>
>> Hi,
>>
>> (again, thanks to both of you for documenting all this assembly /NEON
>> code)
>>
>> On 09/04/2016 10:22, Matthieu Bouron wrote:
>>
>>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>>>
>>> ---
>>>
>>> Hello,
>>>
>>> The following patch add yuv2planeX_8_neon function for the arm
>>> platform. It is
>>> currently restricted to 8-bit per component sources until I fix fate
>>> issues
>>> with 10-bit sources (the dnxhd-*-10bit tests fail but I haven't figured
>>> out yet
>>> where it comes from).
>>>
>>> Matthieu
>>>
>>> ---
>>> libswscale/arm/Makefile | 1 +
>>> libswscale/arm/output.S | 78
>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>> libswscale/arm/swscale.c | 7 +++++
>>> libswscale/utils.c | 3 +-
>>> 4 files changed, 88 insertions(+), 1 deletion(-)
>>> create mode 100644 libswscale/arm/output.S
>>>
>>> [...]
>>>
>>> diff --git a/libswscale/arm/output.S b/libswscale/arm/output.S
>>> new file mode 100644
>>> index 0000000..4437447
>>> --- /dev/null
>>> +++ b/libswscale/arm/output.S
>>> @@ -0,0 +1,78 @@
>>>
>>
>> [...]
>>
>>
>> +function ff_yuv2planeX_8_neon, export=1
>>> + push {r4-r12, lr}
>>> + vpush {q4-q7}
>>> + ldr r4, [sp, #104]
>>> @ dstW
>>> + ldr r5, [sp, #108]
>>> @ dither
>>> + ldr r6, [sp, #112]
>>> @ offset
>>> + vld1.8 {d0}, [r5]
>>> @ load 8x8-bit dither values
>>> + tst r6, #0
>>> @ check offsetting which can be 0 or 3 only
>>> + beq 1f
>>> + vext.u8 d0, d0, d0, #3
>>> @ honor offseting which can be 3 only
>>> +1: vmovl.u8 q0, d0
>>> @ extend dither to 16-bit
>>> + vshll.u16 q1, d0, #12
>>> @ extend dither to 32-bit with left shift by 12 (part 1)
>>> + vshll.u16 q2, d1, #12
>>> @ extend dither to 32-bit with left shift by 12 (part 2)
>>> + mov r7, #0
>>> @ i = 0
>>> +2: vmov.u8 q3, q1
>>> @ initialize accumulator with dithering values (part 1)
>>> + vmov.u8 q4, q2
>>> @ initialize accumulator with dithering values (part 2)
>>> + mov r8, r1
>>> @ tmpFilterSize = filterSize
>>> + mov r9, r2
>>> @ srcp
>>> + mov r10, r0
>>> @ filterp
>>> +3: ldr r11, [r9], #4
>>> @ get pointer @ src[j]
>>> + ldr r12, [r9], #4
>>> @ get pointer @ src[j+1]
>>> + add r11, r11, r7, lsl #1
>>> @ &src[j][i]
>>> + add r12, r12, r7, lsl #1
>>> @ &src[j+1][i]
>>> + vld1.16 {q5}, [r11]
>>> @ read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H
>>> + vld1.16 {q6}, [r12]
>>> @ read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
>>> + ldr r11, [r10], #4
>>> @ read 2x16-bit coeffs (X, Y) at (filter[j], filter[j+1])
>>> + vmov.16 q7, q5
>>> @ copy 8x16-bit @ src[j ][i + {0..7}] for following inplace zip
>>> instruction
>>> + vmov.16 q8, q6
>>> @ copy 8x16-bit @ src[j+1][i + {0..7}] for following inplace zip
>>> instruction
>>> + vzip.16 q7, q8
>>> @ A,I,B,J,C,K,D,L,E,M,F,N,G,O,H,L
>>>
>>
>> nit: O,H,P
>
>
> Fixed.
>
> Patch updated fixing fate issues with 10-bit sources (the code was not
> honoring offsetting: tst r6, #0 has been replaced with cmp r6, #0).
> If there is no objection, I will push the patch in the next hours.
>
Patch applied.
Matthieu
More information about the ffmpeg-devel
mailing list