[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.
Dan Parrot
dan.parrot at mail.com
Mon Jul 4 18:20:47 EEST 2016
On Mon, 2016-07-04 at 09:20 +0000, Carl Eugen Hoyos wrote:
> Dan Parrot <dan.parrot <at> mail.com> writes:
>
> > The dataset used was the entire FATE regression suite.
>
> I don't think this is a particularly useful testcase:
> It takes very long but mostly tests other things.
>
> Did you test if using ffmpeg -benchmark -f rawvideo -i /dev/zero...
> showed different results?
> I believe this should be both easier and faster to test.
Sorry, I don't understand what that command line just above is trying to
achieve. Could you elaborate?
> > name: rgb24ToY_c_vsx.
> > no. of calls: 9999. min: 3832 ns. avg: 4709 ns. max: 37550 ns.
> > total: 47093533 ns.
> >
> > name: rgb24ToY_c.
> > no. of calls: 9999. min: 3809 ns. avg: 4707 ns. max: 29041 ns.
> > total: 47072923 ns.
>
> Without any data, I would have thought that this is the most
> important function (and "no. of calls" seems to confirm this).
>
> Why is this not faster?
Surprisingly, gcc is producing some badly suboptimal assembly. I need to
follow up with IBM's Linux Technology Center. The major issue is that
multiplication of vector quantities in C is generating as many
multiplications in assembly as would scalar multiplication in a loop. No
way that should be occurring.
> Can you confirm with START_TIMER / STOP_TIMER that there is no
> gain?
SystemTap probes provide identical functionality by measuring deltas
between function entry and function return.
More information about the ffmpeg-devel
mailing list