[FFmpeg-devel] [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time for the yuv420p and nv{12, 21} formats

Wed Mar 30 23:36:34 CEST 2016

Hi,

Le 26/03/2016 13:05, Matthieu Bouron a écrit :
> On Sat, Mar 26, 2016 at 2:09 AM, Michael Niedermayer <michael at niedermayer.cc
>> >wrote:
>> >On Fri, Mar 25, 2016 at 11:46:01PM +0100, Matthieu Bouron wrote:
>>> > >From: Matthieu Bouron<matthieu.bouron at stupeflix.com>
>>> > >
>>> > >---
>>> > >  libswscale/arm/yuv2rgb_neon.S | 89
>> >++++++++++++-------------------------------
>>> > >  1 file changed, 24 insertions(+), 65 deletions(-)
>> >
>> >breaks build
>> >
>> >  make distclean ; ../configure --cross-prefix=/usr/arm-linux-gnueabi/bin/
>> >--cc='ccache arm-linux-gnueabi-gcc-4.5' --extra-cflags='-mfpu=neon
>> >-mfloat-abi=softfp' --cpu=cortex-a8 --arch=armv7 --target-os=linux
>> >--enable-cross-compile && make -j12
>> >
>> >CC      libavutil/arm/float_dsp_init_arm.o
>> >src/libswscale/arm/yuv2rgb_neon.S: Assembler messages:
>> >src/libswscale/arm/yuv2rgb_neon.S:269: Error: thumb conditional
>> >instruction should be in IT block -- `subeq r6,r6,r0'
>> >src/libswscale/arm/yuv2rgb_neon.S:269: Error: thumb conditional
>> >instruction should be in IT block -- `addne r6,r7'
>> >
> [...]
>
> Patch updated with the relevant it instructions added. It still does build
> on my rpi2 setup but is not tested on the same setup as yours.
> Can you confirm it builds/works on your setup ?
>
> If it works, i will send an updated version of the next patch (07/10) to
> resolve the conflicts.
>
> Matthieu
>
> 0006-swscale-arm-yuv2rgb-only-process-one-line-at-a-time-.patch
>
>
>  From 7b3affff405b2b483fb16f549b69ce6f21d8a946 Mon Sep 17 00:00:00 2001
> From: Matthieu Bouron<matthieu.bouron at stupeflix.com>
> Date: Wed, 23 Mar 2016 11:26:13 +0000
> Subject: [PATCH 06/10] swscale/arm/yuv2rgb: only process one line at a time
>   for the yuv420p and nv{12,21} formats
>
> ---
>   libswscale/arm/yuv2rgb_neon.S | 92 +++++++++++++------------------------------
>   1 file changed, 27 insertions(+), 65 deletions(-)
>
> diff --git a/libswscale/arm/yuv2rgb_neon.S b/libswscale/arm/yuv2rgb_neon.S
> index ef7b0a6..6aeccae 100644
> --- a/libswscale/arm/yuv2rgb_neon.S
> +++ b/libswscale/arm/yuv2rgb_neon.S
> @@ -105,16 +105,6 @@
>       compute_16px        r2, d14, d15, \ofmt
>   .endm
>   
> -.macro process_2l_16px ofmt
> -    compute_premult     d28, d29, d30, d31
> -
> -    vld1.8              {q7}, [r4]!                                    @ first line of luma
> -    compute_16px        r2, d14, d15, \ofmt
> -
> -    vld1.8              {q7}, [r12]!                                   @ second line of luma
> -    compute_16px        r11, d14, d15, \ofmt
> -.endm
> -
>   .macro load_args_nvx
>       push                {r4-r12, lr}
>       vpush               {q4-q7}
> @@ -127,13 +117,9 @@
>       ldr                 r10,[sp, #128]                                 @ r10 = y_coeff
>       vdup.16             d0, r10                                        @ d0  = y_coeff
>       vld1.16             {d1}, [r8]                                     @ d1  = *table
> -    add                 r11, r2, r3                                    @ r11 = dst + linesize (dst2)
> -    add                 r12, r4, r5                                    @ r12 = srcY + linesizeY (srcY2)

Nit: this lets r11 and r12 unused by the NV conversions. It should be 
possible not to push/pop them
If not (which I would certainly understand), what would you think about 
moving the registers save out of the 'load_args_*' macro?
It seems weird to have all the push/vpush that are not factored, and the 
pop/vpop that is done in only one place, at the end of each function.

[snip]

Looks good to me anyway (as well as the remainder of the series).

-- 
Ben