[FFmpeg-devel] [rfc] qualification task: SSE2 IDCT
Alexander Strange
astrange
Sun Mar 30 09:37:28 CEST 2008
I didn't have much time this week to do anything but school, but I've
written a working SSE2 adaption of simple_idct. It's not done yet,
since it's still too slow for me to accept it, but I've run out of
obvious low-level optimizations with this approach and don't want to
just disappear.
Current times from dct-test:
IDCT SIMPLE-C: 3610.0 kdct/s
IDCT SIMPLE-MMX: 12738.6 kdct/s
IDCT SIMPLE-SSE2: 9086.8 kdct/s
IDCT XVID-MMX: 6837.2 kdct/s
IDCT XVID-MMX2: 7819.4 kdct/s
IDCT XVID-SKAL-SSE2: 11803.0 kdct/s
(making minor changes to dct-test decreases all the times by 30% - I
guess there's some code alignment problem there, but it doesn't affect
accuracy)
The current problems that I see are:
* SSE pack/unpack are really slow on Core 2 - 3/4 cycles vs. 1 for
MMX. I didn't realize this until I was done, so maybe I can get rid of
some of them. Loading the same DCT into the MMX registers and using
them for zero short-circuiting was actually faster than using SSE...
This might be better on A64 and Penryn.
* All the negations are folded into pmaddwd, so there aren't any psubd
uses in the main part. I'm not sure if psubd/paddd can be parallelized
any more than two adds.
* It doesn't use transposed input. How does SIMPLE_IDCT_PERM work? I
can't really see how it would save any shuffling in the row transform,
but then I haven't tried it. It seems like any other input order would
make checking for all-zero row ACs slower, which is the most important
bit.
* The column part might suck; it runs out of registers, so can't
really be rescheduled, and I don't like the use of movq. Using
transposed input and the row transform twice would avoid it, but there
would have to be another transpose in the middle, using the slow
punpcklwd. The one in simple_idct_mmx looks clean, but I haven't
checked out how it works yet.
* This is really easy to altivec and might be faster than the current
idct-altivec, which is different from both simple and xvid idct. I'll
try to get around to writing simple_idct_altivec sometime.
* skal should license his idct under LGPL so I can port it from nasm
without having to #ifdef it!
dct-test patch depends on the last ones I posted.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dcttest-sse2.diff
Type: application/octet-stream
Size: 527 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add-sse2idct.diff
Type: application/octet-stream
Size: 2092 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_idct_sse2.c
Type: application/octet-stream
Size: 10554 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080330/f6b35c76/attachment-0002.obj>
More information about the ffmpeg-devel
mailing list