[FFmpeg-devel] [PATCH] aacenc_utils: unroll loops to allow compiler to use SIMD.
James Almer
jamrial at gmail.com
Sun Mar 6 20:46:08 CET 2016
On 3/6/2016 4:14 PM, Reimar Döffinger wrote:
> On Sun, Mar 06, 2016 at 03:49:00PM -0300, James Almer wrote:
>> On 3/6/2016 3:35 PM, Reimar Döffinger wrote:
>>> Approximately 10% faster transcode from mp3 to aac
>>> with default settings.
>>>
>>> Signed-off-by: Reimar Döffinger <Reimar.Doeffinger at gmx.de>
>>> ---
>>> libavcodec/aacenc_utils.h | 47 ++++++++++++++++++++++++++++++++++++++---------
>>> 1 file changed, 38 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/libavcodec/aacenc_utils.h b/libavcodec/aacenc_utils.h
>>> index b9bd6bf..1639021 100644
>>> --- a/libavcodec/aacenc_utils.h
>>> +++ b/libavcodec/aacenc_utils.h
>>> @@ -36,15 +36,29 @@
>>> #define ROUND_TO_ZERO 0.1054f
>>> #define C_QUANT 0.4054f
>>>
>>> +#define ABSPOW(inv, outv) \
>>> +do { \
>>> + float a = (inv); \
>>> + a = fabsf(a); \
>>> + (outv) = sqrtf(a * sqrtf(a)); \
>>> +} while(0)
>>> +
>>> static inline void abs_pow34_v(float *out, const float *in, const int size)
>>> {
>>> int i;
>>> - for (i = 0; i < size; i++) {
>>> - float a = fabsf(in[i]);
>>> - out[i] = sqrtf(a * sqrtf(a));
>>> + for (i = 0; i < size - 3; i += 4) {
>>> + ABSPOW(in[i], out[i]);
>>> + ABSPOW(in[i+1], out[i+1]);
>>> + ABSPOW(in[i+2], out[i+2]);
>>> + ABSPOW(in[i+3], out[i+3]);
>>> + }
>>
>> Are you sure this wasn't vectorized already? I remember i checked and it mostly
>> was, at least on gcc 5.3 mingw-w64 with default settings.
>
> Then it would hardly get 10% faster, would it (though
> I admit I didn't test the two parts separately)?
> But I am fairly sure that before the patch it only
> used sqrtss instructions and not sqrtps.
Without your patch, GCC 5.3 mingw-w64 x86_64 default settings.
$ make libavcodec/aacenc_ltp.o && objdump -d -M intel libavcodec/aacenc_ltp.o | grep sqrtps
CC libavcodec/aacenc_ltp.o
1029: 0f 51 c8 sqrtps xmm1,xmm0
102f: 0f 51 c0 sqrtps xmm0,xmm0
161d: 0f 51 c8 sqrtps xmm1,xmm0
1623: 0f 51 c0 sqrtps xmm0,xmm0
1ccf: 0f 51 c8 sqrtps xmm1,xmm0
1cd5: 0f 51 c0 sqrtps xmm0,xmm0
2745: 0f 51 c8 sqrtps xmm1,xmm0
274b: 0f 51 c0 sqrtps xmm0,xmm0
34e4: 0f 51 c8 sqrtps xmm1,xmm0
34ea: 0f 51 c0 sqrtps xmm0,xmm0
42f7: 0f 51 c8 sqrtps xmm1,xmm0
42fd: 0f 51 c0 sqrtps xmm0,xmm0
44ac: 0f 51 c8 sqrtps xmm1,xmm0
44b2: 0f 51 c0 sqrtps xmm0,xmm0
With your patch
11fd: 0f 51 c8 sqrtps xmm1,xmm0
1203: 0f 51 c0 sqrtps xmm0,xmm0
12cb: 0f 51 c8 sqrtps xmm1,xmm0
12d1: 0f 51 c0 sqrtps xmm0,xmm0
1d43: 0f 51 c8 sqrtps xmm1,xmm0
1d49: 0f 51 c0 sqrtps xmm0,xmm0
1e21: 0f 51 c8 sqrtps xmm1,xmm0
1e27: 0f 51 c0 sqrtps xmm0,xmm0
2964: 0f 51 c8 sqrtps xmm1,xmm0
296a: 0f 51 c0 sqrtps xmm0,xmm0
2a3c: 0f 51 c8 sqrtps xmm1,xmm0
2a42: 0f 51 c0 sqrtps xmm0,xmm0
35f3: 0f 51 c8 sqrtps xmm1,xmm0
35f9: 0f 51 c0 sqrtps xmm0,xmm0
36bc: 0f 51 c8 sqrtps xmm1,xmm0
36c2: 0f 51 c0 sqrtps xmm0,xmm0
457b: 0f 51 c8 sqrtps xmm1,xmm0
4581: 0f 51 c0 sqrtps xmm0,xmm0
464c: 0f 51 c8 sqrtps xmm1,xmm0
4652: 0f 51 c0 sqrtps xmm0,xmm0
54b3: 0f 51 c8 sqrtps xmm1,xmm0
54b9: 0f 51 c0 sqrtps xmm0,xmm0
558f: 0f 51 c8 sqrtps xmm1,xmm0
5595: 0f 51 c0 sqrtps xmm0,xmm0
56e4: 0f 51 c8 sqrtps xmm1,xmm0
56ea: 0f 51 c0 sqrtps xmm0,xmm0
Didn't bench but it seems to help GCC vectorize more efficiently so this patch
is probably ok, especially if in your case it made your compiler actually be
able to vectorize at all.
More information about the ffmpeg-devel
mailing list