[FFmpeg-devel] [PATCH] Altivec version of h264_idct_add
Luca Barbato
lu_zero
Sat Jun 2 10:25:11 CEST 2007
David Conrad wrote:
> Hi,
>
> This is an updated version of ff_h264_idct_add_altivec, based on a patch
> by Mauricio Alvarez [1]. It's 1.9 times faster than the scalar version
> on my G4. Regression tests pass except for seektest, which is currently
> broken for me with vanilla SVN (should it work?)
>
> +#define VEC_LOAD_U8_ADD_S16_STORE_U8(p,va,perm) \
> + vdst = vec_ld(0, p); \
> + vdst_ss = (vec_s16_t)vec_mergeh(zero_u8v, vdst); \
> + va = vec_add(va,vdst_ss); \
> + va_u8 = vec_packsu(va, zero_u8v); \
^^^^^^^^ zero_s16v
> + vfdst = vec_perm(vdst, va_u8, perm); \
> + vec_st(vfdst, 0, dst);
> +
vec_ste (word) wouldn't work the same way?
> + if ((unsigned long)dst & 0xF){
> + vec_u8_t vdst_mask;
> + switch ((unsigned long)dst & 0xF){
> + case 4:
> + dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13,
> + 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F);
> + break;
> + case 8:
> + dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> + 0x10, 0x11, 0x12, 0x13, 0x0C, 0x0D, 0x0E, 0x0F);
> + break;
> + default: // case 12
> + dstperm = (vec_u8_t)AVV(0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
> + 0x08, 0x09, 0x0A, 0x0B, 0x10, 0x11, 0x12, 0x13);
> + break;
> + }
> +
> + vdst_mask = vec_lvsl(0, dst);
> +
> + VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va0,dstperm);
> + dst += stride;
> + VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va1,dstperm);
> + dst += stride;
> + VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va2,dstperm);
> + dst += stride;
> + VEC_LOAD_UNALIGNED_U8_ADD_S16_STORE_U8(dst,vdst_mask,va3,dstperm);
> + }
> + else{
> + dstperm = (vec_u8_t)AVV(0x10, 0x11, 0x12, 0x13, 0x04, 0x05, 0x06, 0x07,
> + 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F);
> + VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va0,dstperm);
> + dst += stride;
> + VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va1,dstperm);
> + dst += stride;
> + VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va2,dstperm);
> + dst += stride;
> + VEC_LOAD_U8_ADD_S16_STORE_U8(dst,va3,dstperm);
> + }
> +}
> +
Overall (decode time)
g4: 1/100 1/70 faster.
cell: pretty much the same, the if+switch killed it probably.
lu
--
Luca Barbato
Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero
More information about the ffmpeg-devel
mailing list