[FFmpeg-devel] [PATCH v2 8/9] swscale/arm/yuv2rgb: save a few instructions by processing the luma line interleaved
Matthieu Bouron
matthieu.bouron at gmail.com
Thu Mar 31 15:22:23 CEST 2016
On Thu, Mar 31, 2016 at 11:17 AM, Benoit Fouet <benoit.fouet at free.fr> wrote:
> Hi,
>
> On 28/03/2016 21:19, Matthieu Bouron wrote:
>
>> ---
>> libswscale/arm/yuv2rgb_neon.S | 88
>> +++++++++++++++++--------------------------
>> 1 file changed, 34 insertions(+), 54 deletions(-)
>>
>> diff --git a/libswscale/arm/yuv2rgb_neon.S b/libswscale/arm/yuv2rgb_neon.S
>> index 124d7d3..6b911c8 100644
>> --- a/libswscale/arm/yuv2rgb_neon.S
>> +++ b/libswscale/arm/yuv2rgb_neon.S
>>
>> [...]
>>
>> @@ -94,25 +67,29 @@
>> .ifc \ofmt,bgra
>> compute_rgba d8, d7, d6, d9, d12, d11, d10, d13
>> .endif
>> +
>> + vzip.8 d6, d10
>> + vzip.8 d7, d11
>> + vzip.8 d8, d12
>> + vzip.8 d9, d13
>>
>
> Adding a comment to explain the resulting interleaving would be nice
Added locally:
+ vzip.8 d6, d10 @
d6 = R1R2R3R4R5R6R7R8 d10 = R9R10R11R12R13R14R15R16
+ vzip.8 d7, d11 @
d7 = G1G2G3G4G5G6G7G8 d11 = G9G10G11G12G13G14G15G16
+ vzip.8 d8, d12 @
d8 = B1B2B3B4B5B6B7B8 d12 = B9B10B11B12B13B14B15B16
+ vzip.8 d9, d13 @
d9 = A1A2A3A4A5A6A7A8 d13 = A9A10A11A12A13A14A15A16
>
>
> vst4.8 {q3, q4}, [\dst,:128]!
>> vst4.8 {q5, q6}, [\dst,:128]!
>> -
>> .endm
>> .macro process_1l ofmt
>> - compute_premult d28, d29, d30, d31
>> - vld1.8 {q7}, [r4]!
>> - compute r2, d14, d15, \ofmt
>> + compute_premult
>> + vld2.8 {d14, d15}, [r4]!
>> + compute r2, \ofmt
>> .endm
>> .macro process_2l ofmt
>> - compute_premult d28, d29, d30, d31
>> + compute_premult
>> - vld1.8 {q7}, [r4]!
>> @ first line of luma
>> - compute r2, d14, d15, \ofmt
>> + vld2.8 {d14, d15}, [r4]! @
>> q7 = Y (interleaved)
>> + compute r2, \ofmt
>> - vld1.8 {q7}, [r12]!
>> @ second line of luma
>> - compute r11, d14, d15, \ofmt
>> + vld2.8 {d14, d15}, [r12]! @
>> q7 = Y (interleaved)
>> + compute r11, \ofmt
>> .endm
>>
>>
>
> What about adding a level of macro here? Something like:
> .macro process_1l_internal ofmt src_addr res
> compute_premult
> vld2.8 {d14, d15}, [\src_addr]!
> compute \res, \ofmt
> .endm
>
> (again, the naming could be changed, according to your own taste :-) )
>
> This way, we would get:
> .macro process_1l ofmt
> process_1l_internal \ofmt, r4, r2
> .endm
>
> .macro process_2l ofmt
> process_1l_internal \ofmt, r4, r2
> process_1l_internal \ofmt, r12, r11
> .endm
Added locally:
process_1l_16px_internal added to the macro-ify patch and then renamed to
process_1l_internal in a later patch.
Thanks,
Matthieu
More information about the ffmpeg-devel
mailing list