[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon

Thu May 12 10:01:24 CEST 2016

Hi,

I mostly have nits remarks.

On 11/05/2016 18:39, Matthieu Bouron wrote:
> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>

[...]

> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
> new file mode 100644
> index 0000000..13462e3
> --- /dev/null
> +++ b/libswresample/arm/resample.S
> @@ -0,0 +1,77 @@
>
> [...]
>
> +function ff_resample_common_apply_filter_x4_float_neon, export=1
> +    vmov.f32            q0, #0.0                                       @ accumulator
> +1:  vld1.32             {q1}, [r1]!                                    @ src
> +    vld1.32             {q2}, [r2]!                                    @ filter
> +    vmla.f32            q0, q1, q2                                     @ src + {0..3} * filter + {0..3}

nit: the comment could be "accu += src[0..3] . filter[0..3]"
same for the other ones below

[...]

> +    subs                r3, #4                                         @ filter_length -= 4
> +    bgt                 1b                                             @ loop until filter_length
> +    vpadd.f32           d0, d0, d1                                     @ pair adding of the 4x32-bit accumulated values
> +    vpadd.f32           d0, d0, d0                                     @ pair adding of the 4x32-bit accumulator values
> +    vst1.32             {d0[0]}, [r0]                                  @ write accumulator
> +    mov pc, lr
> +endfunc
> +
> +function ff_resample_common_apply_filter_x8_float_neon, export=1
> +    vmov.f32            q0, #0.0                                       @ accumulator
> +1:  vld1.32             {q1}, [r1]!                                    @ src1
> +    vld1.32             {q2}, [r2]!                                    @ filter1
> +    vld1.32             {q8}, [r1]!                                    @ src2
> +    vld1.32             {q9}, [r2]!                                    @ filter2
> +    vmla.f32            q0, q1, q2                                     @ src1 + {0..3} * filter1 + {0..3}
> +    vmla.f32            q0, q8, q9                                     @ src2 + {0..3} * filter2 + {0..3}

instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
so, if I reuse the formulation I proposed above:
accu += src[0..3] . filter[0..3]
accu += src[4..7] . filter[4..7]

> +    subs                r3, #8                                         @ filter_length -= 4

-= 8

[...]

> diff --git a/libswresample/arm/resample_init.c b/libswresample/arm/resample_init.c
> new file mode 100644
> index 0000000..c817d03
> --- /dev/null
> +++ b/libswresample/arm/resample_init.c
>
> [...]
>
> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void *dest, const void *source,   \
> +                                            int n, int update_ctx)                                \
> +{                                                                                                 \
> +    DELEM *dst = dest;                                                                            \
> +    const DELEM *src = source;                                                                    \
> +    int dst_index;                                                                                \
> +    int index= c->index;                                                                          \
> +    int frac= c->frac;                                                                            \
> +    int sample_index = index >> c->phase_shift;                                                   \
> +    int x4_aligned_filter_length = c->filter_length & ~3;                                         \
> +    int x8_aligned_filter_length = c->filter_length & ~7;                                         \
> +                                                                                                  \
> +    index &= c->phase_mask;                                                                       \
> +    for (dst_index = 0; dst_index < n; dst_index++) {                                             \
> +        FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc * index;                     \
> +                                                                                                  \
> +        FELEM2 val=0;                                                                             \
> +        int i = 0;                                                                                \
> +        if (x8_aligned_filter_length >= 8) {                                                      \
> +            ff_resample_common_apply_filter_x8_##TYPE##_neon(&val, &src[sample_index],            \
> +                                                             filter, x8_aligned_filter_length);   \
> +            i += x8_aligned_filter_length;                                                        \
> +                                                                                                  \
> +        } else if (x4_aligned_filter_length >= 4) {                                               \

do you think there could be a gain processing the remainder of the 
8-aligned part through the 4-aligned part of the code? e.g. for a filter 
length of 15, that would make:
  - one run of the 8-aligned
  - one run of the 4-aligned
  - 3 C loops
As you stated filter length seems to commonly be 32, I guess that 
wouldn't be easy to benchmark, though.

[...]

-- 
Ben