[FFmpeg-devel] IRC shit smearing
Jason Garrett-Glaser
darkshikari
Sat Jan 16 03:58:07 CET 2010
> We surely dont gain L1 cache hits, a 16x16 block and all the variables
> easily fit in.
Are you so sure about this? With hyperthreading on a Core i7, for
example, you have a total of 16K L1 cache per thread for data. Is
this enough for everything, especially considering the overhead caused
by 64-byte cache lines? I _strongly_ doubt it. And even if you're
right, every byte we take out of L1 cache usage is a byte that is left
over for motion compensation to not cache-miss on.
> So whats left is keeping things in registers (not with gcc in this world IMHO)
> ,merging branches
> and letting the CPU reorder instructions accross decode and pixel_handling
> That said, the loop filter needs cbp, mv, nnz if iam not mistaken so its
> still far from as localized as one might think
>
> also our other decoders are similarly split and are quite a bit faster than
> the alternatives _when_ our decoders where optimized by someone who invested
> a serious effort.
> In that sense i think that multithreaded frame decoding would gain us most
> and after that i think there are many optimizations that would gain more
> speed per messiness than your suggestion above.
> But i would be very happy if you could elaborate on why such rearchitecture
> would be faster in your oppinon. (it does not seem particularly hard to do
> such change, i just dont terribly like it and doubt its speed advantage)
Benefits other than cache:
1) Not all data in fill_caches needs to be loaded; only the relevant
stuff to the current block.
2) It's generally more efficient to merge loops together. For example, we do:
for( i = 0; i < 16; i++ ) { decode idct block() }
... later ...
for( i = 0; i < 16; i++ ) { if( nnz ) { idct } )
I would think:
for( i = 0; i < 16; i++ ) { if( decode idct block() > 0 ) { idct } }
would be more efficient.
3) More important than anything else...
This is the way CoreAVC does it. Accordingly, everything is templated
for progressive/PAFF/MBAFF and CAVLC/CABAC.
CoreAVC is a full 50% faster than libavcodec (as measured by Mans),
with a single thread, when using the exact same assembly code on the
exact same compiler. And as someone who has read through the entire
codebase, it does not have a single "major" optimization that
libavcodec doesn't, such as SIMD deblock-strength calculation. There
is no magic here; if anything, ffmpeg has a large variety of
optimizations that CoreAVC doesn't have, such as ff_emulated_edge_mc.
This 50% cannot be made up in ffmpeg merely by tons of
micro-optimizations.
If we want to compete, we should start by trying to do things the way
that faster decoders do them.
Dark Shikari
More information about the ffmpeg-devel
mailing list