[FFmpeg-devel] [PATCH v2 8/9] swscale/arm/yuv2rgb: save a few instructions by processing the luma line interleaved

Thu Mar 31 15:22:23 CEST 2016

On Thu, Mar 31, 2016 at 11:17 AM, Benoit Fouet <benoit.fouet at free.fr> wrote:

> Hi,
>
> On 28/03/2016 21:19, Matthieu Bouron wrote:
>
>> ---
>>   libswscale/arm/yuv2rgb_neon.S | 88
>> +++++++++++++++++--------------------------
>>   1 file changed, 34 insertions(+), 54 deletions(-)
>>
>> diff --git a/libswscale/arm/yuv2rgb_neon.S b/libswscale/arm/yuv2rgb_neon.S
>> index 124d7d3..6b911c8 100644
>> --- a/libswscale/arm/yuv2rgb_neon.S
>> +++ b/libswscale/arm/yuv2rgb_neon.S
>>
>> [...]
>>
>> @@ -94,25 +67,29 @@
>>   .ifc \ofmt,bgra
>>       compute_rgba        d8, d7, d6, d9, d12, d11, d10, d13
>>   .endif
>> +
>> +    vzip.8              d6, d10
>> +    vzip.8              d7, d11
>> +    vzip.8              d8, d12
>> +    vzip.8              d9, d13
>>
>
> Adding a comment to explain the resulting interleaving would be nice

Added locally:

+    vzip.8              d6, d10                                        @
d6 = R1R2R3R4R5R6R7R8 d10 = R9R10R11R12R13R14R15R16
+    vzip.8              d7, d11                                        @
d7 = G1G2G3G4G5G6G7G8 d11 = G9G10G11G12G13G14G15G16
+    vzip.8              d8, d12                                        @
d8 = B1B2B3B4B5B6B7B8 d12 = B9B10B11B12B13B14B15B16
+    vzip.8              d9, d13                                        @
d9 = A1A2A3A4A5A6A7A8 d13 = A9A10A11A12A13A14A15A16

>
>
>       vst4.8              {q3, q4}, [\dst,:128]!
>>       vst4.8              {q5, q6}, [\dst,:128]!
>> -
>>   .endm
>>     .macro process_1l ofmt
>> -    compute_premult     d28, d29, d30, d31
>> -    vld1.8              {q7}, [r4]!
>> -    compute             r2, d14, d15, \ofmt
>> +    compute_premult
>> +    vld2.8              {d14, d15}, [r4]!
>> +    compute             r2, \ofmt
>>   .endm
>>     .macro process_2l ofmt
>> -    compute_premult     d28, d29, d30, d31
>> +    compute_premult
>>   -    vld1.8              {q7}, [r4]!
>> @ first line of luma
>> -    compute             r2, d14, d15, \ofmt
>> +    vld2.8              {d14, d15}, [r4]!                              @
>> q7 = Y (interleaved)
>> +    compute             r2, \ofmt
>>   -    vld1.8              {q7}, [r12]!
>>  @ second line of luma
>> -    compute             r11, d14, d15, \ofmt
>> +    vld2.8              {d14, d15}, [r12]!                             @
>> q7 = Y (interleaved)
>> +    compute             r11, \ofmt
>>   .endm
>>
>>
>
> What about adding a level of macro here? Something like:
> .macro process_1l_internal ofmt src_addr res
>     compute_premult
>     vld2.8            {d14, d15}, [\src_addr]!
>     compute        \res, \ofmt
> .endm
>
> (again, the naming could be changed, according to your own taste :-) )
>
> This way, we would get:
> .macro process_1l ofmt
>     process_1l_internal \ofmt, r4, r2
> .endm
>
> .macro process_2l ofmt
>     process_1l_internal \ofmt, r4,  r2
>     process_1l_internal \ofmt, r12, r11
> .endm

Added locally:
process_1l_16px_internal added to the macro-ify patch and then renamed to
process_1l_internal in a later patch.

Thanks,
Matthieu