[FFmpeg-devel] [PATCH] Add support for "omp simd" pragma.

Mon Jan 11 02:26:11 EET 2021

Am So., 10. Jan. 2021 um 19:55 Uhr schrieb Lynne <dev at lynne.ee>:
>
> Jan 10, 2021, 17:43 by Reimar.Doeffinger at gmx.de:
>
> > From: Reimar Döffinger <Reimar.Doeffinger at gmx.de>
> >
> > This requests loops to be vectorized using SIMD
> > instructions.
> > The performance increase is far from hand-optimized
> > assembly but still significant over the plain C version.
> > Typical values are a 2-4x speedup where a hand-written
> > version would achieve 4x-10x.
> > So it is far from a replacement, however some architures
> > will get hand-written assembler quite late or not at all,
> > and this is a good improvement for a trivial amount of work.
> > The cause, besides the compiler being a compiler, is
> > usually that it does not manage to use saturating instructions
> > and thus has to use 32-bit operations where actually
> > saturating 16-bit operations would be sufficient.
> > Other causes are for example the av_clip functions that
> > are not ideal for vectorization (and even as scalar code
> > not optimal for any modern CPU that has either CSEL or
> > MAX/MIN instructions).
> > And of course this only works for relatively simple
> > loops, the IDCT functions for example seemed not possible
> > to optimize that way.
> > Also note that while clang may accept the code and sometimes
> > produces warnings, it does not seem to do anything actually
> > useful at all.
> > Here are example measurements using gcc 10 under Linux (in a VM unfortunately)
> > on AArch64 on Apple M1:
> > Commad:
> > time ./ffplay_g LG\ 4K\ HDR\ Demo\ -\ New\ York.ts -t 10 -autoexit -threads 1 -noframedrop
> >
> > Original code:
> > real    0m19.572s
> > user    0m23.386s
> > sys     0m0.213s
> >
> > Changing all put_hevc:
> > real    0m15.648s
> > user    0m19.503s (83.4% of original)
> > sys     0m0.186s
> >
> > In addition changing add_residual:
> > real    0m15.424s
> > user    0m19.278s (82.4% of original)
> > sys     0m0.133s
> >
> > In addition changing planar copy dither:
> > real    0m15.040s
> > user    0m18.874s (80.7% of original)
> > sys     0m0.168s
> >
>
> I think I have to disagree.

> The performance gains are marginal

This sounds wrong.

Carl Eugen