[FFmpeg-devel] [PATCH 1/5] lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels
Martin Storsjö
martin at martin.st
Mon Jun 12 10:47:54 EEST 2023
On Sun, 4 Jun 2023, Logan.Lyu at myais.com.cn wrote:
> From: Logan Lyu <Logan.Lyu at myais.com.cn>
>
> Signed-off-by: Logan Lyu <Logan.Lyu at myais.com.cn>
> ---
> libavcodec/aarch64/hevcdsp_init_aarch64.c | 5 ++
> libavcodec/aarch64/hevcdsp_qpel_neon.S | 104 ++++++++++++++++++++++
> 2 files changed, 109 insertions(+)
>
> diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
> index 483a9d5253..5a1d520eec 100644
> --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
> +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
> @@ -152,6 +152,9 @@ void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, co
> void ff_hevc_put_hevc_##fn##32_8_neon##ext args; \
> void ff_hevc_put_hevc_##fn##64_8_neon##ext args; \
>
> +NEON8_FNPROTO(pel_uni_pixels, (uint8_t *_dst, ptrdiff_t _dststride,
> + const uint8_t *_src, ptrdiff_t _srcstride,
> + int height, intptr_t mx, intptr_t my, int width),);
>
> NEON8_FNPROTO(pel_uni_w_pixels, (uint8_t *_dst, ptrdiff_t _dststride,
> const uint8_t *_src, ptrdiff_t _srcstride,
> @@ -263,6 +266,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
> c->put_hevc_qpel_bi[8][0][1] =
> c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
>
> + NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
> + NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 0, pel_uni_pixels,);
> NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 0, pel_uni_w_pixels,);
> NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 0, pel_uni_w_pixels,);
> NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,);
> diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
> index ed659cfe9b..6ca05b7201 100644
> --- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
> +++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
> @@ -490,6 +490,110 @@ put_hevc qpel
> put_hevc qpel_uni
> put_hevc qpel_bi
>
> +function ff_hevc_put_hevc_pel_uni_pixels4_8_neon, export=1
> +1:
> + ldr s0, [x2]
> + ldr s1, [x2, x3]
> + add x2, x2, x3, lsl #1
> + str s0, [x0]
> + str s1, [x0, x1]
> + add x0, x0, x1, lsl #1
> + subs w4, w4, #2
> + b.hi 1b
> + ret
> +endfunc
In a loop like this, I would recommend moving the "subs" instruction
further away from the branch that depends on it. For cores with in-order
execution, it does matter a fair bit, while it probably doesn't for cores
with out-of-order execution. Here, the ideal location probably is after
the two loads at the start. The same thing goes for all the other
functions in this patch.
Other than that, this looks ok.
// Martin
More information about the ffmpeg-devel
mailing list