[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc
Jason Garrett-Glaser
darkshikari
Sat Apr 25 02:53:23 CEST 2009
>"lddqu -16(%3,%0), %%xmm4 \n\t" // xmm4 = smp [i-4 .. i-1]
LDDQU does not work correctly (it is equivalent to movdqu except takes
one more byte to encode) on all SSE4-supporting CPUs.
>"cvtdq2pd -8(%3,%0), %%xmm5 \n\t" // xmm5 = smp [i-2, i-1]
Is it really required to constantly convert in and out of floating
point here? Mubench ( http://akuvian.org/src/mubench_results.txt )
says that this operation is horrifically slow on Athlon 64, for
example. Why not use integer math?
>"phaddd %%xmm0, %%xmm0 \n\t"
PHADD is slow and should be avoided where possible. If you're looking
to sum the values in a register, a chain of binary-search-style
shift/add is better. Here's what x264 uses:
%macro HADDD 2
movhlps %2, %1
paddd %1, %2
pshuflw %2, %1, 0xE
paddd %1, %2
%endmacro
%macro HADDW 2
pmaddwd %1, [pw_1 GLOBAL]
HADDD %1, %2
%endmacro
> +// TODO: look into palignr?
Yes, do this. Your code is going to be slow on Penryn, where
cacheline-split loads are very expensive.
Dark Shikari
More information about the ffmpeg-devel
mailing list