[FFmpeg-devel] [PATCH] Altivec version of h264_idct_add

Sat Jun 2 10:25:11 CEST 2007

David Conrad wrote:
> Hi,
> 
> This is an updated version of ff_h264_idct_add_altivec, based on a patch
> by Mauricio Alvarez [1]. It's 1.9 times faster than the scalar version
> on my G4. Regression tests pass except for seektest, which is currently
> broken for me with vanilla SVN (should it work?)
> 
> +#define VEC_LOAD_U8_ADD_S16_STORE_U8(p,va,perm)               \
> +    vdst = vec_ld(0, p);                                      \
> +    vdst_ss = (vec_s16_t)vec_mergeh(zero_u8v, vdst);          \
> +    va = vec_add(va,vdst_ss);                                 \
> +    va_u8 = vec_packsu(va, zero_u8v);                         \
                              ^^^^^^^^ zero_s16v
> +    vfdst = vec_perm(vdst, va_u8, perm);                      \
> +    vec_st(vfdst, 0, dst);
> +

vec_ste (word) wouldn't work the same way?

> +    if ((unsigned long)dst & 0xF){
> +        vec_u8_t vdst_mask;
> +        switch ((unsigned long)dst & 0xF){
> +        case 4:
> +            dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13,
> +                                    0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F);
> +            break;
> +        case 8:
> +            dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> +                                    0x10, 0x11, 0x12, 0x13, 0x0C, 0x0D, 0x0E, 0x0F);
> +            break;
> +        default:    // case 12
> +            dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> +                                    0x08, 0x09, 0x0A, 0x0B, 0x10, 0x11, 0x12, 0x13);
> +            break;
> +        }
> +
> +        vdst_mask = vec_lvsl(0, dst);
> +
> +        VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va0,dstperm);
> +        dst += stride;
> +        VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va1,dstperm);
> +        dst += stride;
> +        VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va2,dstperm);
> +        dst += stride;
> +        VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va3,dstperm);
> +    }
> +    else{
> +        dstperm = (vec_u8_t)AVV(0x10, 0x11, 0x12, 0x13, 0x04, 0x05, 0x06, 0x07,
> +                                0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F);
> +        VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va0,dstperm);
> +        dst += stride;
> +        VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va1,dstperm);
> +        dst += stride;
> +        VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va2,dstperm);
> +        dst += stride;
> +        VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va3,dstperm);
> +    }
> +}
> +

Overall (decode time)

g4: 1/100 1/70 faster.

cell: pretty much the same, the if+switch killed it probably.

lu

-- 

Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero