[FFmpeg-devel] [PATCH] swresample/arm: add ff_resample_common_apply_filter_{x4, x8}_{float, s16}_neon
Benoit Fouet
benoit.fouet at free.fr
Thu May 12 15:50:10 CEST 2016
Hi,
On 12/05/2016 15:22, Matthieu Bouron wrote:
> On Thu, May 12, 2016 at 10:01 AM, Benoit Fouet <benoit.fouet at free.fr> wrote:
>
>> Hi,
>>
>> I mostly have nits remarks.
>>
>> On 11/05/2016 18:39, Matthieu Bouron wrote:
>>
>>> From: Matthieu Bouron <matthieu.bouron at stupeflix.com>
>>>
>>>
>> [...]
>>
>> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S
>>> new file mode 100644
>>> index 0000000..13462e3
>>> --- /dev/null
>>> +++ b/libswresample/arm/resample.S
>>> @@ -0,0 +1,77 @@
>>>
>>> [...]
>>>
>>> +function ff_resample_common_apply_filter_x4_float_neon, export=1
>>> + vmov.f32 q0, #0.0 @
>>> accumulator
>>> +1: vld1.32 {q1}, [r1]! @
>>> src
>>> + vld1.32 {q2}, [r2]! @
>>> filter
>>> + vmla.f32 q0, q1, q2 @
>>> src + {0..3} * filter + {0..3}
>>>
>> nit: the comment could be "accu += src[0..3] . filter[0..3]"
>> same for the other ones below
>>
>> [...]
>>
>> + subs r3, #4 @
>>> filter_length -= 4
>>> + bgt 1b @
>>> loop until filter_length
>>> + vpadd.f32 d0, d0, d1 @
>>> pair adding of the 4x32-bit accumulated values
>>> + vpadd.f32 d0, d0, d0 @
>>> pair adding of the 4x32-bit accumulator values
>>> + vst1.32 {d0[0]}, [r0] @
>>> write accumulator
>>> + mov pc, lr
>>> +endfunc
>>> +
>>> +function ff_resample_common_apply_filter_x8_float_neon, export=1
>>> + vmov.f32 q0, #0.0 @
>>> accumulator
>>> +1: vld1.32 {q1}, [r1]! @
>>> src1
>>> + vld1.32 {q2}, [r2]! @
>>> filter1
>>> + vld1.32 {q8}, [r1]! @
>>> src2
>>> + vld1.32 {q9}, [r2]! @
>>> filter2
>>> + vmla.f32 q0, q1, q2 @
>>> src1 + {0..3} * filter1 + {0..3}
>>> + vmla.f32 q0, q8, q9 @
>>> src2 + {0..3} * filter2 + {0..3}
>>>
>> instead of using src1 and src2, you may want to use src[0..3] and src[4..7]
>> so, if I reuse the formulation I proposed above:
>> accu += src[0..3] . filter[0..3]
>> accu += src[4..7] . filter[4..7]
>>
> Fixed locally (as well as the other case you mentionned) with:
> - vmla.f32 q0, q1, q2 @
> src1 + {0..3} * filter1 + {0..3}
> - vmla.f32 q0, q8, q9 @
> src2 + {0..3} * filter2 + {0..3}
> + vmla.f32 q0, q1, q2 @
> accumulator += src1 + {0..3} * filter1 + {0..3}
> + vmla.f32 q0, q8, q9 @
> accumulator += src2 + {4..7} * filter2 + {4..7}
>
> I prefer to use + {0..3} instead of [0..3] to make the comments consistent
> with what has been done in swscale/arm.
>
Fine for me (I chose the "[]" notation to be consistent with the "."
notation also, in order to do as if it were a dot product between two
vectors).
>> + subs r3, #8 @
>>> filter_length -= 4
>>>
>> -= 8
>>
> Fixed locally.
>
>
>> [...]
>>
>> diff --git a/libswresample/arm/resample_init.c
>>> b/libswresample/arm/resample_init.c
>>> new file mode 100644
>>> index 0000000..c817d03
>>> --- /dev/null
>>> +++ b/libswresample/arm/resample_init.c
>>>
>>> [...]
>>>
>>> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void
>>> *dest, const void *source, \
>>> + int n, int update_ctx)
>>> \
>>> +{
>>> \
>>> + DELEM *dst = dest;
>>> \
>>> + const DELEM *src = source;
>>> \
>>> + int dst_index;
>>> \
>>> + int index= c->index;
>>> \
>>> + int frac= c->frac;
>>> \
>>> + int sample_index = index >> c->phase_shift;
>>> \
>>> + int x4_aligned_filter_length = c->filter_length & ~3;
>>> \
>>> + int x8_aligned_filter_length = c->filter_length & ~7;
>>> \
>>> +
>>> \
>>> + index &= c->phase_mask;
>>> \
>>> + for (dst_index = 0; dst_index < n; dst_index++) {
>>> \
>>> + FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc *
>>> index; \
>>> +
>>> \
>>> + FELEM2 val=0;
>>> \
>>> + int i = 0;
>>> \
>>> + if (x8_aligned_filter_length >= 8) {
>>> \
>>> + ff_resample_common_apply_filter_x8_##TYPE##_neon(&val,
>>> &src[sample_index], \
>>> + filter,
>>> x8_aligned_filter_length); \
>>> + i += x8_aligned_filter_length;
>>> \
>>> +
>>> \
>>> + } else if (x4_aligned_filter_length >= 4) {
>>> \
>>>
>> do you think there could be a gain processing the remainder of the
>> 8-aligned part through the 4-aligned part of the code? e.g. for a filter
>> length of 15, that would make:
>> - one run of the 8-aligned
>> - one run of the 4-aligned
>> - 3 C loops
>> As you stated filter length seems to commonly be 32, I guess that wouldn't
>> be easy to benchmark, though.
>>
> I'll see if I could trigger a case where the filter_length is 15 and come
> up with a benchmark. If there is a performance gain, would you be ok if it
> is implemented as a separate patch ?
Sure, no problem for me.
Thanks,
--
Ben
More information about the ffmpeg-devel
mailing list