[FFmpeg-devel] [PATCH] PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD.

Wed Jul 6 10:07:12 EEST 2016

On Wed, Jul 6, 2016 at 4:37 AM, Dan Parrot <dan.parrot at mail.com> wrote:
> Finish providing SIMD versions for POWER8 VSX of functions in libswscale/input.c That should allow trac ticket #5570 to be closed.
> The speedups obtained for the functions are:
>
> abgrToA_c               1.19
> bgr24ToUV_c             1.23
> bgr24ToUV_half_c        1.37
> bgr24ToY_c_vsx          1.43
> nv12ToUV_c              1.05
> nv21ToUV_c              1.06
> planar_rgb_to_uv        1.25
> planar_rgb_to_y         1.26
> rgb24ToUV_c             1.11
> rgb24ToUV_half_c        1.10
> rgb24ToY_c              0.92
> rgbaToA_c               0.88
> uyvyToUV_c              1.05
> uyvyToY_c               1.15
> yuy2ToUV_c              1.07
> yuy2ToY_c               1.17
> yvy2ToUV_c              1.05

SIMD implementations that in the best case improve the speed by 43%
(and in some cases is *slower*) seem barely worth it. One would expect
a proper SIMD implementation to offer 100% or higher increases, at
least thats the general expectation on x86 with SSE/AVX.
So the question here is - is thats VSX being bad, or the intrinsics
being bad? How would the speedup be in proper hand-written ASM? If
hand-written ASM can give us the usual 100-200% improvements we would
expect from SIMD, then this is what should generally be favored.

Also, one further thought:
>From the commit message, it sounds like you might only be doing this
for the bounty in #5570, do you plan to maintain these optimizations
in the future?

- Hendrik