[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations
Michael Niedermayer
michaelni
Sat Jul 5 01:59:34 CEST 2008
On Thu, Jul 03, 2008 at 02:51:18PM +0200, Victor Pollex wrote:
> Michael Niedermayer schrieb:
>> On Sat, Jun 28, 2008 at 12:31:41PM +0200, Victor Pollex wrote:
[...]
>>
>> [...]
>>
>>
>>> +#define LOAD_ADD_CLAMP_STORE_8X1(io,t0,t1,r0,r1)\
>>> + "movq "#io", "#t0"\n\t"\
>>> + "movq "#t0", "#t1"\n\t"\
>>> + "punpcklbw %%mm7, "#t0"\n\t"\
>>> + "punpckhbw %%mm7, "#t1"\n\t"\
>>> + "paddw "#r0", "#t0"\n\t"\
>>> + "paddw "#r1", "#t1"\n\t"\
>>> + "packuswb "#t1", "#t0"\n\t"\
>>> + "movq "#t0", "#io"\n\t"
>>>
>>
>> some of the movq seem redundant
>>
> I'm sorry but I don' see it. Although this is now obsolete, I'd still like
> to know which one is redundant and why.
i thought io was a register, if it is memory then things are different
sorry
[...]
the LOAD4/STORE4 patch is ok
[...]
> +/*
> + postcondition:
> + dst0 = [15:0](dst0 + src);
> + dst1 = [15:0](dst1 + src);
> + dst2 = [15:0](dst2 - src);
> + dst3 = [15:0](dst3 - src);
> +*/
> +#define ADD2SUB2(src, dst0, dst1, dst2, dst3)\
> + ADD1SUB1(src, dst0, dst2)\
> + ADD1SUB1(src, dst1, dst3)
This occurs 6 times, thus it safes 6 ADD1SUB1 lines while the macro
with documentation needs 10 lines thus overall this is a loss, not only
is it one more macro the reader has to understand it is also more code
with the macro than without
[...]
> +/*
> + precodition:
> + for all values v in r0, r1, r2, r3: -3971 <= v <= 3971
> +
> + postcondition:
> + r3 = ((17 * (r0 + r2) + (22 * r1 + 10 * r3) + c) >> 3)
> + r4 = ((17 * (r0 - r2) - (10 * r1 - 22 * r3) + c) >> 3)
> + r1 = ((17 * (r0 - r2) + (10 * r1 - 22 * r3) + c) >> 3)
> + r2 = ((17 * (r0 + r2) - (22 * r1 + 10 * r3) + c) >> 3)
> + r0 undefined
> + r5 undefined
> + r6 undefined
> + r7 undefined
> +*/
> +#define TRANSFORM_4X4_ROW(r0,r1,r2,r3,r4,r5,r6,r7,c)\
> + TRANSPOSE4(r0,r1,r2,r3,r4)\
> + TRANSFORM_4X4_COMMON(r0,r3,r4,r2,r1,r5,r6,r7,c)\
> + "paddw "#r4", "#r4"\n\t" /* 2 * (r0 + r2) */\
> + SUMSUB_BA(r3,r4)\
> + "paddw "#r1", "#r3"\n\t"\
> + "paddw "#r7", "#r4"\n\t"\
> + "paddw "#r0", "#r0"\n\t" /* 2 * (r0 - r2) */\
> + SUMSUB_BA(r2,r0)\
> + "paddw "#r5", "#r0"\n\t"\
> + "paddw "#r6", "#r2"\n\t"\
> + TRANSPOSE4(r3,r0,r2,r4,r1)
It should be possible to merge one transpose into the scantble (the mpeg1/2/4
decoder does that too)
[...]
> +/*
> + postcondition:
> + r0 = [15:0](2 * r0);
> + r1 = [15:0](3 * r0);
> +*/
> +#define G3X(r0,r1)\
> + "movq "#r0", "#r1"\n\t" /* r0 */\
> + "paddw "#r0", "#r0"\n\t" /* 2 * r0 */\
> + "paddw "#r0", "#r1"\n\t" /* 3 * r0 */
4 uses, saving 8 lines, macro with docs is 9 lines
[...]
> +static void vc1_inv_trans_4x4_mmx(uint8_t *dest, int linesize, DCTELEM *block)
> +{
> + asm volatile(
> + LOAD4(0x10,0x00(%0),%%mm0,%%mm1,%%mm2,%%mm3)
> + TRANSFORM_4X4_ROW(%%mm0,%%mm1,%%mm2,%%mm3,%%mm4,%%mm5,%%mm6,%%mm7,%3)
> + TRANSFORM_4X4_COL(%%mm3,%%mm4,%%mm1,%%mm2,%%mm0,%%mm5,%%mm6,%%mm7,0x08+%3)
> + "pxor %%mm7, %%mm7\n\t"
> + LOAD_ADD_CLAMP_STORE_4X4(%2,%1,%%mm0,%%mm4,%%mm3,%%mm2,%%mm1)
> + :
> + : "r"(block), "r"(dest), "r"(linesize), "m"(constants[0])
> + : "memory"
you are modifying a register (%1) which is just an input, this isnt
correct, gcc could assume it didnt change ...
it should be "+r" not "r" and listed with the outputs.
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080705/c64ad92a/attachment.pgp>
More information about the ffmpeg-devel
mailing list