[FFmpeg-devel] [PATCH] avutil/lls: speed up performance of solve_lls

Tue Nov 24 02:00:12 CET 2015

This is a trivial rewrite of the loops that results in better cache
efficiency. Essentially, iterating backwards over an array is a bad
idea, since it leads to many cache misses: forward iteration fetches a
new cache line that gets subsequently used, while backwards iteration fetches the
cache line thinking that the indexing would move forwards (the common behavior),
resulting in many cache misses.

Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within
solve_lls, with START/STOP_TIMER. File is tests/data/fate/flac-16-lpc-cholesky.err.
Hardware: x86-64, Haswell, GNU/Linux.

new:
  12598 decicycles in solve_lls, 2096871 runs,    281 skips
  12393 decicycles in solve_lls, 4193889 runs,    415 skips
  12250 decicycles in solve_lls, 8387253 runs,   1355 skips
  12585 decicycles in solve_lls,16775089 runs,   2127 skips
  12785 decicycles in solve_lls,33550272 runs,   4160 skips
  12483 decicycles in solve_lls,67101734 runs,   7130 skips
  12610 decicycles in solve_lls,134202614 runs,  15114 skips

old:
  18101 decicycles in solve_lls, 2096659 runs,    493 skips
  17850 decicycles in solve_lls, 4193609 runs,    695 skips
  17757 decicycles in solve_lls, 8387458 runs,   1150 skips
  17746 decicycles in solve_lls,16775207 runs,   2009 skips
  17906 decicycles in solve_lls,33550820 runs,   3612 skips
  17931 decicycles in solve_lls,67102891 runs,   5973 skips
  17907 decicycles in solve_lls,134206693 runs,  11035 skips

-------------------------------------------------------------------------------
Barring asm for this function, there are two (more interesting) potential
optimizations for the Cholesky decomposition here:
1. Notice that update_lls is doing a rank one update of the matrix via the var
vector. Rank one update of the Cholesky decomposition can be done much more
efficiently than a full blown decomposition, see e.g Wikipedia. This will almost
certainly require some API thought.

2. LDL' form of the Cholesky factorization offers 2 main advantages:
a) Avoiding sqrt and its associated cost, trading it off with extra multiplications.
This may or may not be worth the computational cost, though it seems like in
FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in
not too many extra multiplies.
b) It provides benefits especially in the poorly conditioned
case since one does not have to worry about nan's and associated tainting due
to the sqrt.
This may or may not require API thought.

I personally view 1 as being more unambiguously worthy of exploration at this stage.

Signed-off-by: Ganesh Ajjanagadde <gajjanagadde at gmail.com>
---
 libavutil/lls.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/libavutil/lls.c b/libavutil/lls.c
index f77043b..9fd756e 100644
--- a/libavutil/lls.c
+++ b/libavutil/lls.c
@@ -55,7 +55,7 @@ void avpriv_solve_lls(LLSModel *m, double threshold, unsigned short min_order)
         for (j = i; j < count; j++) {
             double sum = covar[i][j];
 
-            for (k = i - 1; k >= 0; k--)
+            for (k = 0; k <= i-1; k++)
                 sum -= factor[i][k] * factor[j][k];
 
             if (i == j) {
@@ -71,14 +71,14 @@ void avpriv_solve_lls(LLSModel *m, double threshold, unsigned short min_order)
     for (i = 0; i < count; i++) {
         double sum = covar_y[i + 1];
 
-        for (k = i - 1; k >= 0; k--)
+        for (k = 0; k <= i-1; k++)
             sum -= factor[i][k] * m->coeff[0][k];
 
         m->coeff[0][i] = sum / factor[i][i];
     }
 
     for (j = count - 1; j >= min_order; j--) {
-        for (i = j; i >= 0; i--) {
+        for (i = 0; i <= j; i++) {
             double sum = m->coeff[0][i];
 
             for (k = i + 1; k <= j; k++)
-- 
2.6.2