[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

Wed Jul 6 16:55:27 EEST 2016

Hi,

On Tue, Jul 5, 2016 at 10:37 PM, Dan Parrot <dan.parrot at mail.com> wrote:

> rgb24ToY_c              0.92


OK, so let's be data-driven from now on, I really don't like this
name-calling and stuff. Your speedup on average is close to 1, so let's
compare this to x86. I ran this patch:

diff --git a/libswscale/hscale.c b/libswscale/hscale.c
index eca0635..5d0b39d 100644
--- a/libswscale/hscale.c
+++ b/libswscale/hscale.c
@@ -105,7 +105,9 @@ static int lum_convert(SwsContext *c,
SwsFilterDescriptor *desc, int sliceY, int
         uint8_t * dst = desc->dst->plane[0].line[i];

         if (c->lumToYV12) {
+START_TIMER
             c->lumToYV12(dst, src[0], src[1], src[2], srcW, pal);
+STOP_TIMER("rgb24toy");
         } else if (c->readLumPlanar) {
             c->readLumPlanar(dst, src, srcW, c->input_rgb2yuv_table);
         }

And then I ran these commandlines:

$ ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
yuv420p -f null -vframes 100 -v error -nostats - 2>&1 | tail -n1
  13890 decicycles in rgb24toy,   65428 runs,    108 skips
$ ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -cpuflags 0 -s hd1080 -i /dev/zero
-pix_fmt yuv420p -f null -vframes 100 -v error -nostats - 2>&1 | tail -n1
  62186 decicycles in rgb24toy,   65497 runs,     39 skips

As you can see, I get a ~4x speedup in this function from the SIMD from an
AVX function (ff_rgb24ToY_avx) instead of the C equivalent (rgb24ToY_c),
which has a register width of 16 bytes (i.e. not avx2). For PPC64, which
has equal register width in its altivec instruction set, I'd expect a
roughly equal speedup.

I now want to figure out why you're not seeing a ~4x speedup in your
altivec/ppc64 implementation of rgb24ToY, and hopefully that can serve as a
template for understanding why in general, you're not seeing any speedups.

Ronald