[FFmpeg-devel] [PATCH] aacenc_utils: unroll loops to allow compiler to use SIMD.

Sun Mar 6 20:46:08 CET 2016

On 3/6/2016 4:14 PM, Reimar Döffinger wrote:
> On Sun, Mar 06, 2016 at 03:49:00PM -0300, James Almer wrote:
>> On 3/6/2016 3:35 PM, Reimar Döffinger wrote:
>>> Approximately 10% faster transcode from mp3 to aac
>>> with default settings.
>>>
>>> Signed-off-by: Reimar Döffinger <Reimar.Doeffinger at gmx.de>
>>> ---
>>>  libavcodec/aacenc_utils.h | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 38 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/libavcodec/aacenc_utils.h b/libavcodec/aacenc_utils.h
>>> index b9bd6bf..1639021 100644
>>> --- a/libavcodec/aacenc_utils.h
>>> +++ b/libavcodec/aacenc_utils.h
>>> @@ -36,15 +36,29 @@
>>>  #define ROUND_TO_ZERO 0.1054f
>>>  #define C_QUANT 0.4054f
>>>  
>>> +#define ABSPOW(inv, outv) \
>>> +do { \
>>> +    float a = (inv); \
>>> +    a = fabsf(a); \
>>> +    (outv) = sqrtf(a * sqrtf(a)); \
>>> +} while(0)
>>> +
>>>  static inline void abs_pow34_v(float *out, const float *in, const int size)
>>>  {
>>>      int i;
>>> -    for (i = 0; i < size; i++) {
>>> -        float a = fabsf(in[i]);
>>> -        out[i] = sqrtf(a * sqrtf(a));
>>> +    for (i = 0; i < size - 3; i += 4) {
>>> +        ABSPOW(in[i], out[i]);
>>> +        ABSPOW(in[i+1], out[i+1]);
>>> +        ABSPOW(in[i+2], out[i+2]);
>>> +        ABSPOW(in[i+3], out[i+3]);
>>> +    }
>>
>> Are you sure this wasn't vectorized already? I remember i checked and it mostly
>> was, at least on gcc 5.3 mingw-w64 with default settings.
> 
> Then it would hardly get 10% faster, would it (though
> I admit I didn't test the two parts separately)?
> But I am fairly sure that before the patch it only
> used sqrtss instructions and not sqrtps.

Without your patch, GCC 5.3 mingw-w64 x86_64 default settings.

$ make libavcodec/aacenc_ltp.o && objdump -d -M intel libavcodec/aacenc_ltp.o | grep sqrtps
CC      libavcodec/aacenc_ltp.o
    1029:       0f 51 c8                sqrtps xmm1,xmm0
    102f:       0f 51 c0                sqrtps xmm0,xmm0
    161d:       0f 51 c8                sqrtps xmm1,xmm0
    1623:       0f 51 c0                sqrtps xmm0,xmm0
    1ccf:       0f 51 c8                sqrtps xmm1,xmm0
    1cd5:       0f 51 c0                sqrtps xmm0,xmm0
    2745:       0f 51 c8                sqrtps xmm1,xmm0
    274b:       0f 51 c0                sqrtps xmm0,xmm0
    34e4:       0f 51 c8                sqrtps xmm1,xmm0
    34ea:       0f 51 c0                sqrtps xmm0,xmm0
    42f7:       0f 51 c8                sqrtps xmm1,xmm0
    42fd:       0f 51 c0                sqrtps xmm0,xmm0
    44ac:       0f 51 c8                sqrtps xmm1,xmm0
    44b2:       0f 51 c0                sqrtps xmm0,xmm0

With your patch

    11fd:       0f 51 c8                sqrtps xmm1,xmm0
    1203:       0f 51 c0                sqrtps xmm0,xmm0
    12cb:       0f 51 c8                sqrtps xmm1,xmm0
    12d1:       0f 51 c0                sqrtps xmm0,xmm0
    1d43:       0f 51 c8                sqrtps xmm1,xmm0
    1d49:       0f 51 c0                sqrtps xmm0,xmm0
    1e21:       0f 51 c8                sqrtps xmm1,xmm0
    1e27:       0f 51 c0                sqrtps xmm0,xmm0
    2964:       0f 51 c8                sqrtps xmm1,xmm0
    296a:       0f 51 c0                sqrtps xmm0,xmm0
    2a3c:       0f 51 c8                sqrtps xmm1,xmm0
    2a42:       0f 51 c0                sqrtps xmm0,xmm0
    35f3:       0f 51 c8                sqrtps xmm1,xmm0
    35f9:       0f 51 c0                sqrtps xmm0,xmm0
    36bc:       0f 51 c8                sqrtps xmm1,xmm0
    36c2:       0f 51 c0                sqrtps xmm0,xmm0
    457b:       0f 51 c8                sqrtps xmm1,xmm0
    4581:       0f 51 c0                sqrtps xmm0,xmm0
    464c:       0f 51 c8                sqrtps xmm1,xmm0
    4652:       0f 51 c0                sqrtps xmm0,xmm0
    54b3:       0f 51 c8                sqrtps xmm1,xmm0
    54b9:       0f 51 c0                sqrtps xmm0,xmm0
    558f:       0f 51 c8                sqrtps xmm1,xmm0
    5595:       0f 51 c0                sqrtps xmm0,xmm0
    56e4:       0f 51 c8                sqrtps xmm1,xmm0
    56ea:       0f 51 c0                sqrtps xmm0,xmm0

Didn't bench but it seems to help GCC vectorize more efficiently so this patch
is probably ok, especially if in your case it made your compiler actually be
able to vectorize at all.