[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon
Benoit Fouet
benoit.fouet at free.fr
Thu May 12 10:01:24 CEST 2016
Hi,
I mostly have nits remarks.
On 11/05/2016 18:39, Matthieu Bouron wrote:
> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>
[...]
> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
> new file mode 100644
> index 0000000..13462e3
> --- /dev/null
> +++ b/libswresample/arm/resample.S
> @@ -0,0 +1,77 @@
>
> [...]
>
> +function ff_resample_common_apply_filter_x4_float_neon, export=1
> + vmov.f32 q0, #0.0 @ accumulator
> +1: vld1.32 {q1}, [r1]! @ src
> + vld1.32 {q2}, [r2]! @ filter
> + vmla.f32 q0, q1, q2 @ src + {0..3} * filter + {0..3}
nit: the comment could be "accu += src[0..3] . filter[0..3]"
same for the other ones below
[...]
> + subs r3, #4 @ filter_length -= 4
> + bgt 1b @ loop until filter_length
> + vpadd.f32 d0, d0, d1 @ pair adding of the 4x32-bit accumulated values
> + vpadd.f32 d0, d0, d0 @ pair adding of the 4x32-bit accumulator values
> + vst1.32 {d0[0]}, [r0] @ write accumulator
> + mov pc, lr
> +endfunc
> +
> +function ff_resample_common_apply_filter_x8_float_neon, export=1
> + vmov.f32 q0, #0.0 @ accumulator
> +1: vld1.32 {q1}, [r1]! @ src1
> + vld1.32 {q2}, [r2]! @ filter1
> + vld1.32 {q8}, [r1]! @ src2
> + vld1.32 {q9}, [r2]! @ filter2
> + vmla.f32 q0, q1, q2 @ src1 + {0..3} * filter1 + {0..3}
> + vmla.f32 q0, q8, q9 @ src2 + {0..3} * filter2 + {0..3}
instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
so, if I reuse the formulation I proposed above:
accu += src[0..3] . filter[0..3]
accu += src[4..7] . filter[4..7]
> + subs r3, #8 @ filter_length -= 4
-= 8
[...]
> diff --git a/libswresample/arm/resample_init.c b/libswresample/arm/resample_init.c
> new file mode 100644
> index 0000000..c817d03
> --- /dev/null
> +++ b/libswresample/arm/resample_init.c
>
> [...]
>
> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void *dest, const void *source, \
> + int n, int update_ctx) \
> +{ \
> + DELEM *dst = dest; \
> + const DELEM *src = source; \
> + int dst_index; \
> + int index= c->index; \
> + int frac= c->frac; \
> + int sample_index = index >> c->phase_shift; \
> + int x4_aligned_filter_length = c->filter_length & ~3; \
> + int x8_aligned_filter_length = c->filter_length & ~7; \
> + \
> + index &= c->phase_mask; \
> + for (dst_index = 0; dst_index < n; dst_index++) { \
> + FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc * index; \
> + \
> + FELEM2 val=0; \
> + int i = 0; \
> + if (x8_aligned_filter_length >= 8) { \
> + ff_resample_common_apply_filter_x8_##TYPE##_neon(&val, &src[sample_index], \
> + filter, x8_aligned_filter_length); \
> + i += x8_aligned_filter_length; \
> + \
> + } else if (x4_aligned_filter_length >= 4) { \
do you think there could be a gain processing the remainder of the
8-aligned part through the 4-aligned part of the code? e.g. for a filter
length of 15, that would make:
- one run of the 8-aligned
- one run of the 4-aligned
- 3 C loops
As you stated filter length seems to commonly be 32, I guess that
wouldn't be easy to benchmark, though.
[...]
--
Ben
More information about the ffmpeg-devel
mailing list