[FFmpeg-devel] [PATCH] af_afir: RISC-V V fcmul_add

Mon Nov 13 17:35:35 EET 2023

   Hi,

Le maanantaina 13. marraskuuta 2023, 11.43.01 EET flow gg a écrit :
> Sorry for the long delay in responding.

No problem. Working with T-Head C910 (or C920?) cores is very tedious. I gave 
up on that and switched over to Kendryte K230 (based on C908) now.

> How is the modified patch now?

It looks better, but some minute improvements are still possible.

> no longer using register stride(learn from your code) and have switched to
> shNadd instead.
> 
> (using m4 and m2 as they are slightly faster than m8 and m4)
> 
> benchmark:
> fcmul_add_c: 2179
> fcmul_add_rvv_f32: 1652

> diff --git a/libavfilter/af_afirdsp.h b/libavfilter/af_afirdsp.h
> index 4208501393..d2d1e909c1 100644
> --- a/libavfilter/af_afirdsp.h
> +++ b/libavfilter/af_afirdsp.h
> @@ -34,6 +34,7 @@ typedef struct AudioFIRDSPContext {
>  } AudioFIRDSPContext;
> 
>  void ff_afir_init_x86(AudioFIRDSPContext *s);
> +void ff_afir_init_riscv(AudioFIRDSPContext *s);

Nit: please stick to alphabetical order like most similar code.

> 
>  static void fcmul_add_c(float *sum, const float *t, const float *c,
> ptrdiff_t len)
>  {
> @@ -76,6 +77,8 @@ static av_unused void ff_afir_init(AudioFIRDSPContext
> *dsp)
> 
>  #if ARCH_X86
>      ff_afir_init_x86(dsp);
> +#elif ARCH_RISCV
> +    ff_afir_init_riscv(dsp);

Ditto.

>  #endif
>  }
> 
> diff --git a/libavfilter/riscv/Makefile b/libavfilter/riscv/Makefile
> new file mode 100644
> index 0000000000..0b968a9c0d
> --- /dev/null
> +++ b/libavfilter/riscv/Makefile
> @@ -0,0 +1,2 @@
> +OBJS += riscv/af_afir_init.o
> +RVV-OBJS += riscv/af_afir_rvv.o
> diff --git a/libavfilter/riscv/af_afir_init.c
> b/libavfilter/riscv/af_afir_init.c new file mode 100644
> index 0000000000..13df8341e7
> --- /dev/null
> +++ b/libavfilter/riscv/af_afir_init.c
> @@ -0,0 +1,39 @@
> +/*
> + * Copyright (c) 2023 Institue of Software Chinese Academy of Sciences
> (ISCAS).
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA
> + */
> +
> +#include <stdint.h>
> +
> +#include "config.h"
> +#include "libavutil/attributes.h"
> +#include "libavutil/cpu.h"
> +#include "libavfilter/af_afirdsp.h"
> +
> +void ff_fcmul_add_rvv(float *sum, const float *t, const float *c,
> +                       ptrdiff_t len);
> +
> +av_cold void ff_afir_init_riscv(AudioFIRDSPContext *s)
> +{
> +#if HAVE_RVV
> +    int flags = av_get_cpu_flags();
> +
> +    if (flags & AV_CPU_FLAG_RVV_F32)

You need to check for Zba as well here. I doubt that we'll see hardware with V 
and without Zba in real life, but for the sake of correctness...

> +        s->fcmul_add = ff_fcmul_add_rvv;
> +#endif
> +}
> diff --git a/libavfilter/riscv/af_afir_rvv.S
> b/libavfilter/riscv/af_afir_rvv.S new file mode 100644
> index 0000000000..078cac8e7e
> --- /dev/null
> +++ b/libavfilter/riscv/af_afir_rvv.S
> @@ -0,0 +1,61 @@
> +/*
> + * Copyright (c) 2023 Institue of Software Chinese Academy of Sciences
> (ISCAS).
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA
> + */
> +
> +#include "libavutil/riscv/asm.S"
> +
> +//  void ff_fcmul_add(float *sum, const float *t, const float *c, int len)
> +func ff_fcmul_add_rvv, zve32f
> +        li          t1, 32
> +1:
> +        vsetvli     t0, a3, e64, m4, ta, ma

You can set SEW=32 and corresponding LMUL here. Then you can remove all other 
VSETVLI instances below. (Note that this will NOT work on draft 0.7.1 
hardware, but it does work on conformant hardware.)

> +        vle64.v     v12, (a0)

This requires 64-bit alignment. I don't know if this is correct for this 
specific filter, so I leave it to other people to comment here.

> +        sub         a3, a3, t0
> +        vsetvli     zero, zero, e32, m2, ta, ma
> +        vnsrl.vx    v8, v12, zero
> +        vnsrl.vx    v10, v12, t1
> +        vsetvli     zero, zero, e64, m4, ta, ma
> +        vle64.v     v12, (a1)
> +        sh3add      a1, t0, a1
> +        vsetvli     zero, zero, e32, m2, ta, ma
> +        vnsrl.vx    v0, v12, zero
> +        vnsrl.vx    v2, v12, t1
> +        vsetvli     zero, zero, e64, m4, ta, ma
> +        vle64.v     v12, (a2)
> +        sh3add      a2, t0, a2
> +        vsetvli     zero, zero, e32, m2, ta, ma
> +        vnsrl.vx    v4, v12, zero
> +        vnsrl.vx    v6, v12, t1
> +        vfmacc.vv   v8, v0, v4
> +        vfnmsac.vv  v8, v2, v6
> +        vfmacc.vv   v10, v0, v6

Swap the two instructions above for better pipeline utilisation on in-order 
CPUs.

> +        vfmacc.vv   v10, v2, v4
> +        vsseg2e32.v v8, (a0)
> +        sh3add      a0, t0, a0
> +        bgtz        a3, 1b
> +
> +        flw         fa0, 0(a1)
> +        flw         fa1, 0(a2)
> +        flw         fa2, 0(a0)
> +        fmul.s      fa0, fa0, fa1
> +        fadd.s      fa2, fa2, fa0

It won't make much difference, but you can use a fused multiply-add here.

> +        fsw         fa2, 0(a0)
> +
> +        ret
> +endfunc

While you're at it, this looks like it could easily be adapted for the double 
precision version. In fact, it will be simpler, since you will have to use 
vlseg2e64 rather than vle128.v+vnsrl.vx+vnsrl.vx. But if you decide to 
implement that too, please keep it a separate patch.

-- 
レミ・デニ-クールモン
http://www.remlab.net/