[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Ronald S. Bultje
rsbultje at gmail.com
Wed Jul 6 16:55:27 EEST 2016
Hi,
On Tue, Jul 5, 2016 at 10:37 PM, Dan Parrot <dan.parrot at mail.com> wrote:
> rgb24ToY_c 0.92
OK, so let's be data-driven from now on, I really don't like this
name-calling and stuff. Your speedup on average is close to 1, so let's
compare this to x86. I ran this patch:
diff --git a/libswscale/hscale.c b/libswscale/hscale.c
index eca0635..5d0b39d 100644
--- a/libswscale/hscale.c
+++ b/libswscale/hscale.c
@@ -105,7 +105,9 @@ static int lum_convert(SwsContext *c,
SwsFilterDescriptor *desc, int sliceY, int
uint8_t * dst = desc->dst->plane[0].line[i];
if (c->lumToYV12) {
+START_TIMER
c->lumToYV12(dst, src[0], src[1], src[2], srcW, pal);
+STOP_TIMER("rgb24toy");
} else if (c->readLumPlanar) {
c->readLumPlanar(dst, src, srcW, c->input_rgb2yuv_table);
}
And then I ran these commandlines:
$ ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
yuv420p -f null -vframes 100 -v error -nostats - 2>&1 | tail -n1
13890 decicycles in rgb24toy, 65428 runs, 108 skips
$ ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -cpuflags 0 -s hd1080 -i /dev/zero
-pix_fmt yuv420p -f null -vframes 100 -v error -nostats - 2>&1 | tail -n1
62186 decicycles in rgb24toy, 65497 runs, 39 skips
As you can see, I get a ~4x speedup in this function from the SIMD from an
AVX function (ff_rgb24ToY_avx) instead of the C equivalent (rgb24ToY_c),
which has a register width of 16 bytes (i.e. not avx2). For PPC64, which
has equal register width in its altivec instruction set, I'd expect a
roughly equal speedup.
I now want to figure out why you're not seeing a ~4x speedup in your
altivec/ppc64 implementation of rgb24ToY, and hopefully that can serve as a
template for understanding why in general, you're not seeing any speedups.
Ronald
More information about the ffmpeg-devel
mailing list