[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

Mon Jul 4 18:20:47 EEST 2016

On Mon, 2016-07-04 at 09:20 +0000, Carl Eugen Hoyos wrote:
> Dan Parrot <dan.parrot <at> mail.com> writes:
> 
> > The dataset used was the entire FATE regression suite.
> 
> I don't think this is a particularly useful testcase:
> It takes very long but mostly tests other things.
> 
> Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero... 
> showed different results?
> I believe this should be both easier and faster to test.
Sorry, I don't understand what that command line just above is trying to
achieve. Could you elaborate?

> > name: rgb24ToY_c_vsx. 
> > no. of calls: 9999. min: 3832 ns. avg: 4709 ns. max: 37550 ns. 
> > total: 47093533 ns. 
> > 
> > name: rgb24ToY_c. 
> > no. of calls: 9999. min: 3809 ns. avg: 4707 ns. max: 29041 ns. 
> > total: 47072923 ns.
> 
> Without any data, I would have thought that this is the most 
> important function (and "no. of calls" seems to confirm this).
> 
> Why is this not faster?
Surprisingly, gcc is producing some badly suboptimal assembly. I need to
follow up with IBM's Linux Technology Center. The major issue is that
multiplication of vector quantities in C is generating as many
multiplications in assembly as would scalar multiplication in a loop. No
way that should be occurring.

> Can you confirm with START_TIMER / STOP_TIMER that there is no 
> gain?
SystemTap probes provide identical functionality by measuring deltas
between function entry and function return.