[FFmpeg-devel] [PATCH 5/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv

Sun Jul 2 00:28:17 EEST 2023

On Sun, 18 Jun 2023, Logan.Lyu wrote:

> Hi, Martin,
>
> I modified it according to your comments. Please review again.

> From 47b7f7af634add7680b56a216fff7dbe1f08cd11 Mon Sep 17 00:00:00 2001
> From: Logan Lyu <Logan.Lyu at myais.com.cn>
> Date: Sun, 28 May 2023 10:35:43 +0800
> Subject: [PATCH 5/5] lavc/aarch64: new optimization for 8-bit
>  hevc_epel_uni_w_hv
> 
> Signed-off-by: Logan Lyu <Logan.Lyu at myais.com.cn>
> ---
>  libavcodec/aarch64/hevcdsp_epel_neon.S    | 694 ++++++++++++++++++++++
>  libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
>  2 files changed, 700 insertions(+)
> 
> diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
> index 8b6f396a0b..355679af29 100644
> --- a/libavcodec/aarch64/hevcdsp_epel_neon.S
> +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
> @@ -717,6 +717,700 @@ function ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1
>          ret
>  endfunc
> 
> +.macro epel_uni_w_hv_start
> +        mov             x15, x5         //denom
> +        mov             x16, x6         //wx
> +        mov             x17, x7         //ox
> +        add             w15, w15, #6    //shift = denom+6
> +
> +
> +        ldp             x5, x6, [sp]
> +        ldr             x7, [sp, #16]
> +
> +        stp             q12, q13, [sp, #-128]!
> +        stp             q14, q15, [sp, #32]
> +        stp             q8, q9,   [sp, #64]
> +        stp             q10, q11, [sp, #96]

Only need to back up 64 bytes, by backing up d8-d15. Also, the order
is quite weird here, why not keep them in e.g. linear order?

> +function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
> +        epel_uni_w_hv_start
> +        sxtw            x4, w4
> +
> +        add             x10, x4, #3
> +        lsl             x10, x10, #7
> +        sub             sp, sp, x10     // tmp_array
> +        stp             xzr, x30, [sp, #-48]!

As mentioned already in the previous review - why do you back up and
restore xzr here? That's not necessary. Yes, you should keep the stack
16 byte aligned, but you can just leave an empty slot, and just do
"str x30, [sp, #-48]!" here, and vice versa with "ldr" instead of ldp
when restoring.

The same goes in all functions here.

> +2:
> +        ldp             q14, q15, [sp, #32]
> +        ldp             q8, q9,   [sp, #64]
> +        ldp             q10, q11, [sp, #96]
> +        ldp             q12, q13, [sp], #128

Only need d8-d15, and weird register order here, and elsewhere.

> +function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
> +        epel_uni_w_hv_start
> +        sxtw            x4, w4

FWIW, it's unusual to need an explicit sxtw instruction, but I guess
if you use it in the form "add x10, x4, #3" it might be needed.

> +function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
> +        ldp             x15, x16, [sp]
> +        stp             x0, x30, [sp, #-16]!
> +        stp             x1, x2, [sp, #-16]!
> +        stp             x3, x4, [sp, #-16]!
> +        stp             x5, x6, [sp, #-16]!

Don't do consecutive stack pointer updates like this, but merge it
into one large stack decrement followed by positive offsets, like in
all the other cases of stp/ldp.

> +        mov             x17, #16
> +        stp             x17, x7, [sp, #-16]!
> +        stp             x15, x16, [sp, #-16]!
> +        bl              X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
> +        ldp             x15, x16, [sp], #16
> +        ldp             x17, x7, [sp], #16
> +        ldp             x5, x6, [sp], #16
> +        ldp             x3, x4, [sp], #16
> +        ldp             x1, x2, [sp], #16
> +        ldr             x0, [sp]
> +        add             x0, x0, #16
> +        add             x2, x2, #16
> +        mov             x17, #16
> +        stp             x17, xzr, [sp, #-16]!
> +        stp             x15, x16, [sp, #-16]!

Don't do multiple stack decrements, don't needlessly store xzr here.

The same goes for all the other functions in this patch.

// Martin