[FFmpeg-devel] IRC shit smearing

Sat Jan 16 03:58:07 CET 2010

> We surely dont gain L1 cache hits, a 16x16 block and all the variables
> easily fit in.

Are you so sure about this?  With hyperthreading on a Core i7, for
example, you have a total of 16K L1 cache per thread for data.  Is
this enough for everything, especially considering the overhead caused
by 64-byte cache lines?  I _strongly_ doubt it.  And even if you're
right, every byte we take out of L1 cache usage is a byte that is left
over for motion compensation to not cache-miss on.

> So whats left is keeping things in registers (not with gcc in this world IMHO)
> ,merging branches
> and letting the CPU reorder instructions accross decode and pixel_handling
> That said, the loop filter needs cbp, mv, nnz if iam not mistaken so its
> still far from as localized as one might think
>
> also our other decoders are similarly split and are quite a bit faster than
> the alternatives _when_ our decoders where optimized by someone who invested
> a serious effort.
> In that sense i think that multithreaded frame decoding would gain us most
> and after that i think there are many optimizations that would gain more
> speed per messiness than your suggestion above.
> But i would be very happy if you could elaborate on why such rearchitecture
> would be faster in your oppinon. (it does not seem particularly hard to do
> such change, i just dont terribly like it and doubt its speed advantage)

Benefits other than cache:

1) Not all data in fill_caches needs to be loaded; only the relevant
stuff to the current block.
2) It's generally more efficient to merge loops together.  For example, we do:

for( i = 0; i < 16; i++ ) { decode idct block() }
... later ...
for( i = 0; i < 16; i++ ) { if( nnz ) { idct } )

I would think:

for( i = 0; i < 16; i++ ) { if( decode idct block() > 0 ) { idct } }

would be more efficient.

3) More important than anything else...

This is the way CoreAVC does it.  Accordingly, everything is templated
for progressive/PAFF/MBAFF and CAVLC/CABAC.

CoreAVC is a full 50% faster than libavcodec (as measured by Mans),
with a single thread, when using the exact same assembly code on the
exact same compiler.  And as someone who has read through the entire
codebase, it does not have a single "major" optimization that
libavcodec doesn't, such as SIMD deblock-strength calculation.  There
is no magic here; if anything, ffmpeg has a large variety of
optimizations that CoreAVC doesn't have, such as ff_emulated_edge_mc.
This 50% cannot be made up in ffmpeg merely by tons of
micro-optimizations.

If we want to compete, we should start by trying to do things the way
that faster decoders do them.

Dark Shikari