[FFmpeg-devel] [PATCH 2/3] x86/float_dsp: unroll loop in vector_fmac_scalar
James Almer
jamrial at gmail.com
Wed Apr 16 18:12:23 CEST 2014
On 16/04/14 7:06 AM, Christophe Gisquet wrote:
> Hi,
>
>> ~6% faster SSE2 performance. AVX/FMA3 are unaffected.
>
> What CPU, environment and test case have you used?
>
> For SSE2, if I'm not mistaken, the difference in the code is having
> different regs used in the unrolled part. When I tested that with AAC,
> which often performs calls for 64 elements, this was not a win for mingw64.
>
> But a 6% win for most typical systems is certainly better than a <1% loss
> for a few. I'm OK with the change otherwise.
>
> Best regards,
> Christophe
Athlon 64 7750+ mingw-w64. Went from 274 cycles to 257 when i benched with
the dts-es sample i uploaded for the fate test.
Also, does aac even use vector_fmac_scalar? A grep on libavcodec shows
results only in dcadec.c.
The objdump disassemble fow win64 looks like this for pre-patch
movaps (%rdx,%r9,1),%xmm1
mulps %xmm0,%xmm1
movaps 0x10(%rdx,%r9,1),%xmm2
mulps %xmm0,%xmm2
addps (%rcx,%r9,1),%xmm1
addps 0x10(%rcx,%r9,1),%xmm2
movaps %xmm1,(%rcx,%r9,1)
movaps %xmm2,0x10(%rcx,%r9,1)
movaps 0x20(%rdx,%r9,1),%xmm1
mulps %xmm0,%xmm1
movaps 0x30(%rdx,%r9,1),%xmm2
mulps %xmm0,%xmm2
addps 0x20(%rcx,%r9,1),%xmm1
addps 0x30(%rcx,%r9,1),%xmm2
movaps %xmm1,0x20(%rcx,%r9,1)
movaps %xmm2,0x30(%rcx,%r9,1)
And post-patch:
movaps (%rdx,%r9,1),%xmm1
mulps %xmm2,%xmm1
movaps 0x10(%rdx,%r9,1),%xmm0
mulps %xmm2,%xmm0
movaps 0x20(%rdx,%r9,1),%xmm3
mulps %xmm2,%xmm3
movaps 0x30(%rdx,%r9,1),%xmm4
mulps %xmm2,%xmm4
addps (%rcx,%r9,1),%xmm1
addps 0x10(%rcx,%r9,1),%xmm0
addps 0x20(%rcx,%r9,1),%xmm3
addps 0x30(%rcx,%r9,1),%xmm4
movaps %xmm1,(%rcx,%r9,1)
movaps %xmm0,0x10(%rcx,%r9,1)
movaps %xmm3,0x20(%rcx,%r9,1)
movaps %xmm4,0x30(%rcx,%r9,1)
The difference in the resulting code is in the order of instructions thanks
to the unrolling of the loop. The mulps now have enough room to finish before
the addps are executed, and so do the addps before the mova to memory.
Currently the addps are basically right after the mulps, which is afaik not
optimal.
More information about the ffmpeg-devel
mailing list