[FFmpeg-devel] [RFC] SSE3/4 implementation of flac_encode_residual_lpc

Mon May 4 07:00:54 CEST 2009

On Sun, May 3, 2009 at 9:46 PM, Bobby Bingham <uhmmmm at gmail.com> wrote:
> On Sun, 3 May 2009 21:21:19 -0700
> Jason Garrett-Glaser <darkshikari at gmail.com> wrote:
>
>> On Sun, May 3, 2009 at 8:39 PM, Bobby Bingham <uhmmmm at gmail.com>
>> wrote:
>> > On Sat, 25 Apr 2009 03:03:30 +0000 (UTC)
>> > Loren Merritt <lorenm at u.washington.edu> wrote:
>> >
>> >> On Fri, 24 Apr 2009, Bobby Bingham wrote:
>> >>
>> >> > Attached are patches to move flac_encode_residual_lpc to
>> >> > dsputils, and to add SSE3 and SSE4 implementations. ?I wrote the
>> >> > SSE3 first, but since it doesn't have signed 32x32
>> >> > multiplication AFAICT, I ended up using double precision floats
>> >> > for it, and the result is code that's slower than the C version.
>> >> > ?Unless somebody has a suggestion of how to fix this, I'll drop
>> >> > the SSE3 version.
>> >> >
>> >> > I tried an SSE4 version because it does have signed 32x32->32
>> >> > multiplication, like the C version uses. ?Unfortunately, I don't
>> >> > have an SSE4-capable processor to test it with, so I can't check
>> >> > its speed or even its correctness. ?Benchmarks welcome.
>> >>
>> >> fails regression test on my Penryn.
>> >>
>> >> > +// TODO: look into palignr?
>> >>
>> >> Yea, do that. It should be possible to load each sample just once
>> >> (aligned), and do all other manipulation in registers.
>> >> There are no cpus with both lddqu and sse4, so you're paying the
>> >> full cost of unaligned loads.
>> >
>> > I've changed the code to use palignr, and hopefully fixed it to work
>> > correctly now. ?I've also removed the SSE3 code from this patch as I
>> > haven't managed to get it any faster by using integer arithmetic
>> > yet.
>>
>> >"movdqu ?-16(%3,%0), %%xmm4 ? ? ? ? \n\t" ? // xmm4 = smp ?[i-4 ..
>> >i-1] "movdqu ?-12(%3,%0), %%xmm6 ? ? ? ? \n\t" ? // xmm6 = smp
>> >[i-3 .. i ?]
>>
>> Any reason you didn't use palignr here?
>
> Because it slipped my mind?
>
>>
>> >"movdqu ? ? %%xmm5, %2 ? ? ? ? ? ? ?\n\t"
>>
>> Is there a good reason why this store has to be unaligned?
>
> Not if the calling code is changed to ensure the input and output
> arrays have the same alignment.
>
>>
>> > "phaddd ? ? %%xmm1, %%xmm0 ? ? ? ? ?\n\t"
>> > "phaddd ? ? %%xmm3, %%xmm2 ? ? ? ? ?\n\t"
>> > "phaddd ? ? %%xmm2, %%xmm0 ? ? ? ? ?\n\t" ? // xmm0 = [p0, p1, p2,
>> > p3]
>>
>> Did you not find a better way of doing this without PHADD, given how
>> slow it is?
>
> Also slipped my mind.
>
>>
>> >pmulld
>>
>> pmulld is really really slow (6 clocks on Nehalem!). ?If you make
>> certain assumptions about the nature of the input data (say, restrict
>> your code to only 16-bit samples), you might be able to use a faster
>> instruction.
>
> Well, when trying to get rid of the floating point conversions in the
> SSE3 version, I tried using 16 bit multiplies along with the fact that
> the lpc coefficients are <16 bits, but I didn't manage to get it even to
> the speed of the floating point code because like Loren said, the
> samples might be 17 bits after stereo decorrelation, and there doesn't
> seem to be any single instruction signed 16x16->32 multiply instruction
> that I've seen. ?I expect that pmulld is faster than trying to
> implement something similar on top of other instructions, but as I
> don't have an SSE4 capable CPU, I can't really benchmark it.

Stereo decorrelation means *some* samples might be 17 bits, but that
still leaves a lot of data where the samples aren't 17 bits, from what
I recall Loren saying.  So special-casing the code may still be
worthwhile.

pmaddwd is your 16x16->32 signed multiply instruction.  It will do
just as much work as pmulld in the case where the data is limited to
16 bits--except at twice the speed.

Dark Shikari