[FFmpeg-devel] [PATCH] avutil/lls: speed up performance of solve_lls

Michael Niedermayer michaelni at gmx.at
Wed Nov 25 03:05:57 CET 2015


On Tue, Nov 24, 2015 at 08:32:05PM -0500, Ganesh Ajjanagadde wrote:
> On Tue, Nov 24, 2015 at 8:19 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> > On Mon, Nov 23, 2015 at 08:00:12PM -0500, Ganesh Ajjanagadde wrote:
> >> This is a trivial rewrite of the loops that results in better cache
> >> efficiency. Essentially, iterating backwards over an array is a bad
> >> idea, since it leads to many cache misses: forward iteration fetches a
> >> new cache line that gets subsequently used, while backwards iteration fetches the
> >> cache line thinking that the indexing would move forwards (the common behavior),
> >> resulting in many cache misses.
> >>
> >> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within
> >> solve_lls, with START/STOP_TIMER. File is tests/data/fate/flac-16-lpc-cholesky.err.
> >> Hardware: x86-64, Haswell, GNU/Linux.
> >>
> >> new:
> >>   12598 decicycles in solve_lls, 2096871 runs,    281 skips
> >>   12393 decicycles in solve_lls, 4193889 runs,    415 skips
> >>   12250 decicycles in solve_lls, 8387253 runs,   1355 skips
> >>   12585 decicycles in solve_lls,16775089 runs,   2127 skips
> >>   12785 decicycles in solve_lls,33550272 runs,   4160 skips
> >>   12483 decicycles in solve_lls,67101734 runs,   7130 skips
> >>   12610 decicycles in solve_lls,134202614 runs,  15114 skips
> >>
> >> old:
> >>   18101 decicycles in solve_lls, 2096659 runs,    493 skips
> >>   17850 decicycles in solve_lls, 4193609 runs,    695 skips
> >>   17757 decicycles in solve_lls, 8387458 runs,   1150 skips
> >>   17746 decicycles in solve_lls,16775207 runs,   2009 skips
> >>   17906 decicycles in solve_lls,33550820 runs,   3612 skips
> >>   17931 decicycles in solve_lls,67102891 runs,   5973 skips
> >>   17907 decicycles in solve_lls,134206693 runs,  11035 skips
> >>
> >> -------------------------------------------------------------------------------
> >> Barring asm for this function, there are two (more interesting) potential
> >> optimizations for the Cholesky decomposition here:
> >> 1. Notice that update_lls is doing a rank one update of the matrix via the var
> >> vector. Rank one update of the Cholesky decomposition can be done much more
> >> efficiently than a full blown decomposition, see e.g Wikipedia. This will almost
> >> certainly require some API thought.
> >>
> >> 2. LDL' form of the Cholesky factorization offers 2 main advantages:
> >> a) Avoiding sqrt and its associated cost, trading it off with extra multiplications.
> >> This may or may not be worth the computational cost, though it seems like in
> >> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in
> >> not too many extra multiplies.
> >> b) It provides benefits especially in the poorly conditioned
> >> case since one does not have to worry about nan's and associated tainting due
> >> to the sqrt.
> >> This may or may not require API thought.
> >>
> >> I personally view 1 as being more unambiguously worthy of exploration at this stage.
> >>
> >> Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
> >> ---
> >>  libavutil/lls.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > this significantly changes the output for
> > libavutil/lls-test
> >
> > is that intended ?
> 
> Wait, I thought there was only one line changed in a diff between the
> old and new output, and it was trivial involving a nan. Maybe I posted
> the wrong patch, in which case sorry.

--- old 2015-11-25 02:15:20.396932045 +0100
+++ new 2015-11-25 02:15:26.620932176 +0100
@@ -2,299 +2,299 @@
 real:-0.828987 order:1 pred:-0.828987 var:0.000000 coeffs:2.166433  0.000000  0.000000
 real:-0.828987 order:2 pred:-0.828987 var:0.000000 coeffs:2.166433  0.000000  0.000000
 real:-0.326534 order:0 pred:-0.588792 var:0.272632 coeffs:1.427836 -0.385559  0.000000
-real:-0.326534 order:1 pred:-0.326534 var:-nan coeffs:3.274787 -1.436430  0.000000
-real:-0.326534 order:2 pred:-0.326534 var:-nan coeffs:3.274787 -1.436430  0.000000
+real:-0.326534 order:1 pred: 0.435087 var:0.734692 coeffs:1.427836 -1.436430  0.000000
+real:-0.326534 order:2 pred: 0.435087 var:0.734692 coeffs:1.427836 -1.436430  0.000000
 real:-0.496309 order:0 pred:-0.560478 var:0.227332 coeffs:1.343229 -0.128675 -0.372131
-real:-0.496309 order:1 pred:-0.649268 var:0.214850 coeffs:1.541607 -0.245332  0.000000
-real:-0.496309 order:2 pred:-0.496309 var:0.000000 coeffs:1.629421  0.626407 -0.697480
+real:-0.496309 order:1 pred:-1.051144 var:0.444802 coeffs:2.504735 -0.245332  0.000000
+real:-0.496309 order:2 pred:-0.882916 var:0.272824 coeffs:2.504735 -0.245332 -0.697480
 real: 0.386201 order:0 pred: 0.644642 var:0.263532 coeffs:1.005471 -0.232581 -0.189366
-real: 0.386201 order:1 pred: 0.590550 var:0.236486 coeffs:1.416443 -0.420822  0.000000
-real: 0.386201 order:2 pred: 0.653364 var:0.216704 coeffs:1.417320 -0.084129 -0.312548
+real: 0.386201 order:1 pred: 0.480672 var:0.250092 coeffs:1.245061 -0.420822  0.000000
+real: 0.386201 order:2 pred: 1.327563 var:0.521050 coeffs:1.980792  0.330539 -0.312548
 real:-0.193176 order:0 pred:-0.419259 var:0.261372 coeffs:0.886941 -0.255165 -0.256307
-real:-0.193176 order:1 pred:-0.397912 var:0.235146 coeffs:1.347037 -0.459663  0.000000
-real:-0.193176 order:2 pred:-0.320068 var:0.205317 coeffs:1.374542 -0.016455 -0.397716
+real:-0.193176 order:1 pred:-0.379534 var:0.235868 coeffs:1.308159 -0.459663  0.000000
+real:-0.193176 order:2 pred:-0.058885 var:0.330313 coeffs:0.926332 -0.111365 -0.397716
[...and lots more...]


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

No great genius has ever existed without some touch of madness. -- Aristotle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20151125/cffb6559/attachment.sig>


More information about the ffmpeg-devel mailing list