[FFmpeg-devel] [PATCH] mmx implementation of vc-1 inverse transformations

Sat Jul 5 01:59:34 CEST 2008

On Thu, Jul 03, 2008 at 02:51:18PM +0200, Victor Pollex wrote:
> Michael Niedermayer schrieb:
>> On Sat, Jun 28, 2008 at 12:31:41PM +0200, Victor Pollex wrote:
[...]
>>
>> [...]
>>
>>   
>>> +#define LOAD_ADD_CLAMP_STORE_8X1(io,t0,t1,r0,r1)\
>>> +    "movq      "#io", "#t0"\n\t"\
>>> +    "movq      "#t0", "#t1"\n\t"\
>>> +    "punpcklbw %%mm7, "#t0"\n\t"\
>>> +    "punpckhbw %%mm7, "#t1"\n\t"\
>>> +    "paddw     "#r0", "#t0"\n\t"\
>>> +    "paddw     "#r1", "#t1"\n\t"\
>>> +    "packuswb  "#t1", "#t0"\n\t"\
>>> +    "movq      "#t0", "#io"\n\t"
>>>     
>>
>> some of the movq seem redundant
>>   
> I'm sorry but I don' see it. Although this is now obsolete, I'd still like 
> to know which one is redundant and why.

i thought io was a register, if it is memory then things are different
sorry

[...]

the LOAD4/STORE4 patch is ok

[...]

> +/*
> +    postcondition:
> +        dst0 = [15:0](dst0 + src);
> +        dst1 = [15:0](dst1 + src);
> +        dst2 = [15:0](dst2 - src);
> +        dst3 = [15:0](dst3 - src);
> +*/
> +#define ADD2SUB2(src, dst0, dst1, dst2, dst3)\
> +    ADD1SUB1(src, dst0, dst2)\
> +    ADD1SUB1(src, dst1, dst3)

This occurs 6 times, thus it safes 6 ADD1SUB1 lines while the macro
with documentation needs 10 lines thus overall this is a loss, not only
is it one more macro the reader has to understand it is also more code
with the macro than without

[...]
> +/*
> +    precodition:
> +        for all values v in r0, r1, r2, r3: -3971 <= v <= 3971
> +
> +    postcondition:
> +        r3 = ((17 * (r0 + r2) + (22 * r1 + 10 * r3) + c) >> 3)
> +        r4 = ((17 * (r0 - r2) - (10 * r1 - 22 * r3) + c) >> 3)
> +        r1 = ((17 * (r0 - r2) + (10 * r1 - 22 * r3) + c) >> 3)
> +        r2 = ((17 * (r0 + r2) - (22 * r1 + 10 * r3) + c) >> 3)
> +        r0 undefined
> +        r5 undefined
> +        r6 undefined
> +        r7 undefined
> +*/
> +#define TRANSFORM_4X4_ROW(r0,r1,r2,r3,r4,r5,r6,r7,c)\
> +    TRANSPOSE4(r0,r1,r2,r3,r4)\
> +    TRANSFORM_4X4_COMMON(r0,r3,r4,r2,r1,r5,r6,r7,c)\
> +    "paddw "#r4", "#r4"\n\t" /* 2 * (r0 + r2) */\
> +    SUMSUB_BA(r3,r4)\
> +    "paddw "#r1", "#r3"\n\t"\
> +    "paddw "#r7", "#r4"\n\t"\
> +    "paddw "#r0", "#r0"\n\t" /* 2 * (r0 - r2) */\
> +    SUMSUB_BA(r2,r0)\
> +    "paddw "#r5", "#r0"\n\t"\
> +    "paddw "#r6", "#r2"\n\t"\
> +    TRANSPOSE4(r3,r0,r2,r4,r1)

It should be possible to merge one transpose into the scantble (the mpeg1/2/4
decoder does that too)

[...]
> +/*
> +    postcondition:
> +        r0 = [15:0](2 * r0);
> +        r1 = [15:0](3 * r0);
> +*/
> +#define G3X(r0,r1)\
> +    "movq  "#r0", "#r1"\n\t" /* r0 */\
> +    "paddw "#r0", "#r0"\n\t" /* 2 * r0 */\
> +    "paddw "#r0", "#r1"\n\t" /* 3 * r0 */

4 uses, saving 8 lines, macro with docs is 9 lines

[...]
> +static void vc1_inv_trans_4x4_mmx(uint8_t *dest, int linesize, DCTELEM *block)
> +{
> +    asm volatile(
> +    LOAD4(0x10,0x00(%0),%%mm0,%%mm1,%%mm2,%%mm3)
> +    TRANSFORM_4X4_ROW(%%mm0,%%mm1,%%mm2,%%mm3,%%mm4,%%mm5,%%mm6,%%mm7,%3)
> +    TRANSFORM_4X4_COL(%%mm3,%%mm4,%%mm1,%%mm2,%%mm0,%%mm5,%%mm6,%%mm7,0x08+%3)
> +    "pxor %%mm7, %%mm7\n\t"
> +    LOAD_ADD_CLAMP_STORE_4X4(%2,%1,%%mm0,%%mm4,%%mm3,%%mm2,%%mm1)
> +    :
> +    : "r"(block), "r"(dest), "r"(linesize), "m"(constants[0])
> +    : "memory"

you are modifying a register (%1) which is just an input, this isnt
correct, gcc could assume it didnt change ...
it should be "+r" not "r" and listed with the outputs.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080705/c64ad92a/attachment.pgp>