[FFmpeg-devel] [RFC] An improved implementation of ARMv5TE IDCT (simple_idct_armv5te.S)
Siarhei Siamashka
siarhei.siamashka
Thu Sep 13 18:34:06 CEST 2007
On 12 September 2007, Michael Niedermayer wrote:
> On Wed, Sep 12, 2007 at 11:31:02AM +0300, Siarhei Siamashka wrote:
> > I have been working on improving IDCT performance for ARM for some time
> > already and now think that it is more or less ready (performance wise) to
> > get accepted upstream:
> > https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv
> >4l/simple_idct_armv5te.S?root=mplayer&view=markup
>
> [...]
>
> > So now the question is: how it would be best to integrate this idct code
> > into ffmpeg? Should it replace the current simple_idct_armv5te.S file?
>
> yes if its always faster
OK, I just thought about adding support for ARMv6 SIMD instructions later and
other cpu specific optimizations to the same source file, generating optimized
implementation for each of the instruction sets from a common template via
macro expansion. In this sense, this file would not be '_armv5te' anymore :)
But I guess, renaming or moving source files is not a problem and can be done
any time in the future.
A positive effect is the reuse of some common blocks of code, eliminating code
duplication. A negative effect is a somewhat impaired readability. Having a
proper regression tests suite is a requirement to keep such code maintainable.
For example, there are three ways to perform pixel data cropping:
1. Similar to old armv5te code (3 instructions per pixel, 192 cycles overhead)
2. Use table lookups like in this patch (1 cycle per pixel, one register
occupied to store address of cropping table, harder code scheduling because of
high data load latency, potential data cache misses when performing lookups).
Overall, it seems to work better than 1. in practice.
3. Use ARMv6 instructions (no disadvantages)
Addition of ARMv6 instructions use for pixel data cropping to 0-255 range
should be not difficult.
> > Could you please review the copyright part to check if it is ok?
>
> is there anything special on it? it looks like normal LGPL
>
> > I can provide a patch with omitted experimental prefetch code, all the
> > globals getting 'ff_' prefix, use of 'ff_cropTbl' instead of keeping its
> > own copy of cropping table
>
> patch welcome
>
> > (by the way, it would be nice to extend MAX_NEX_CROP to 2048
> > as idct can produce results in +-2K range when feeded with completely
> > random data on input).
>
> patch welcome
Patch attached. In order not to break badly on possible MAX_NEG_CROP
changes in the future, header file 'dsputil.h' was split into two
parts: 'dsputil.h' and 'dsputil_asm.h'. The latter is supposed to contain
only defines for some constants and no C code at all, so that it can be
included from assembly sources too. If it is unacceptable, probably we'll
have to keep cropping table in 'simple_idct_armv5te.S'. The point is
that 'simple_idct_croptbl_armv5te' pointer variable is currently in
code section, so it is read-only and can't be modified on initialization
from 'dsputil_arm.c'. Moving this variable to data section introduces
addressing problems, as all the instructions on ARM are 32-bit long
and only 12-bits can be used as immediate offset (relative to instruction
pointer). Accessing this variable in data section would require additional
pointer-to-pointer intermediate variable placed into code section making
code more complicated and reducing performance. Right now the size of compiled
code is 2336 bytes for simple_idct_armv5te.S (3052 bytes for older code).
Additional crop table embedded directly in simple_idct_armv5te.S source would
increase compiled size by extra 4352 bytes.
Standard ffmpeg regression tests passed with this patch applied to the latest
ffmpeg SVN head (with a tweak to use FF_IDCT_SIMPLEARMV5TE instead of
FF_IDCT_SIMPLE in 'dsputil_arm.c', otherwise assembly optimized code would not
be used at all in the regression tests).
I also used mplayer-20070622 snapshot (currently gentoo stable) for
benchmarking the effect of this patch on video decoding:
http://distro.ibiblio.org/pub/linux/distributions/gentoo/distfiles/mplayer-20070622.tar.bz2
All the ARM optimized IDCT code is up to date in it and is not different from
ffmpeg SVN head (except for a few changes in license headers).
gcc version 3.4.4 (release) (CodeSourcery ARM 2005q3-2)
The mplayer binaries were configured with -O2 and O3 optimization settings:
CFLAGS="-O2 -fomit-frame-pointer -march=armv5te" ./configure --enable-armv5te --enable-armv6
CFLAGS="-O3 -fomit-frame-pointer -march=armv5te" ./configure --enable-armv5te --enable-armv6
The results of a long batch test with matrixbench_normdivx_vbrmp3.avi video
clip from http://samples.mplayerhq.hu/benchmark/testsuite1/ are listed at the
end of this message.
Overall, it is clear that we can not say that benchmarking is an easy task.
Just using different optimization settings (-O2 vs. -O3) makes the results
drift in a somewhat unpredictable way (probably instructions cache size and
cache aliasing problems related). Testing with other videos will most likely
result in different relative numbers. Anyway, the patch provides a visible
performance improvement (5% or more) over the previous code both on ARM9E
and ARM11 cores (XScale results should be similar to ARM11).
As an additional test, I compared '-vo md5sum' mplayer output using
different idct for this matrix video clip with each other. The results
from C simple_idct and both old and new simple_idct_armv5te are identical.
The only result that is different, was provided by simple_idct_armv6.
I wonder if it is expected behaviour or probably I have found some bug in
armv6 idct implementation?
=== Nokia 770 (ARM9E core, 252MHz, 32K I-cache, 16K D-cache) ===
-O2, old simple_idct_armv5te
BENCHMARKs: VC: 179.475s VO: 0.063s A: 0.000s Sys: 2.483s = 182.021s
BENCHMARKs: VC: 181.505s VO: 0.060s A: 0.000s Sys: 2.398s = 183.964s
BENCHMARKs: VC: 181.374s VO: 0.063s A: 0.000s Sys: 2.402s = 183.839s
BENCHMARKs: VC: 182.206s VO: 0.064s A: 0.000s Sys: 2.371s = 184.641s
BENCHMARKs: VC: 181.735s VO: 0.064s A: 0.000s Sys: 2.388s = 184.187s
BENCHMARKs: VC: 181.592s VO: 0.063s A: 0.000s Sys: 2.389s = 184.044s
BENCHMARKs: VC: 181.331s VO: 0.062s A: 0.000s Sys: 2.378s = 183.770s
BENCHMARKs: VC: 181.505s VO: 0.065s A: 0.000s Sys: 2.398s = 183.969s
-O2, new simple_idct_armv5te
BENCHMARKs: VC: 169.270s VO: 0.065s A: 0.000s Sys: 2.419s = 171.753s
BENCHMARKs: VC: 170.924s VO: 0.065s A: 0.000s Sys: 2.408s = 173.397s
BENCHMARKs: VC: 170.835s VO: 0.065s A: 0.000s Sys: 2.400s = 173.300s
BENCHMARKs: VC: 173.313s VO: 0.065s A: 0.000s Sys: 2.396s = 175.774s
BENCHMARKs: VC: 170.923s VO: 0.066s A: 0.000s Sys: 2.419s = 173.408s
BENCHMARKs: VC: 170.716s VO: 0.066s A: 0.000s Sys: 2.391s = 173.173s
BENCHMARKs: VC: 170.813s VO: 0.066s A: 0.000s Sys: 2.398s = 173.277s
BENCHMARKs: VC: 170.533s VO: 0.063s A: 0.000s Sys: 2.410s = 173.007s
-O2, C implementation of simple_idct
BENCHMARKs: VC: 213.861s VO: 0.063s A: 0.000s Sys: 2.385s = 216.309s
BENCHMARKs: VC: 216.335s VO: 0.062s A: 0.000s Sys: 2.388s = 218.785s
BENCHMARKs: VC: 216.333s VO: 0.062s A: 0.000s Sys: 2.386s = 218.781s
BENCHMARKs: VC: 217.918s VO: 0.062s A: 0.000s Sys: 2.376s = 220.356s
BENCHMARKs: VC: 216.272s VO: 0.061s A: 0.000s Sys: 2.374s = 218.707s
BENCHMARKs: VC: 216.372s VO: 0.063s A: 0.000s Sys: 2.387s = 218.822s
BENCHMARKs: VC: 216.092s VO: 0.062s A: 0.000s Sys: 2.401s = 218.555s
BENCHMARKs: VC: 216.325s VO: 0.063s A: 0.000s Sys: 2.394s = 218.783s
-O3, old simple_idct_armv5te
BENCHMARKs: VC: 183.081s VO: 0.063s A: 0.000s Sys: 2.367s = 185.510s
BENCHMARKs: VC: 183.142s VO: 0.060s A: 0.000s Sys: 2.356s = 185.558s
BENCHMARKs: VC: 183.167s VO: 0.064s A: 0.000s Sys: 2.374s = 185.606s
BENCHMARKs: VC: 183.601s VO: 0.062s A: 0.000s Sys: 2.353s = 186.016s
BENCHMARKs: VC: 183.261s VO: 0.062s A: 0.000s Sys: 2.358s = 185.681s
BENCHMARKs: VC: 183.214s VO: 0.064s A: 0.000s Sys: 2.353s = 185.632s
BENCHMARKs: VC: 183.225s VO: 0.062s A: 0.000s Sys: 2.344s = 185.631s
BENCHMARKs: VC: 183.241s VO: 0.062s A: 0.000s Sys: 2.359s = 185.661s
-O3, new simple_idct_armv5te
BENCHMARKs: VC: 170.260s VO: 0.063s A: 0.000s Sys: 2.356s = 172.679s
BENCHMARKs: VC: 170.576s VO: 0.068s A: 0.000s Sys: 2.353s = 172.997s
BENCHMARKs: VC: 170.344s VO: 0.062s A: 0.000s Sys: 2.350s = 172.756s
BENCHMARKs: VC: 170.727s VO: 0.065s A: 0.000s Sys: 2.360s = 173.153s
BENCHMARKs: VC: 170.634s VO: 0.062s A: 0.000s Sys: 2.354s = 173.050s
BENCHMARKs: VC: 170.433s VO: 0.067s A: 0.000s Sys: 2.341s = 172.841s
BENCHMARKs: VC: 170.409s VO: 0.067s A: 0.000s Sys: 2.347s = 172.823s
BENCHMARKs: VC: 170.503s VO: 0.063s A: 0.000s Sys: 2.372s = 172.938s
-O3, C implementation of simple_idct
BENCHMARKs: VC: 217.215s VO: 0.059s A: 0.000s Sys: 2.417s = 219.691s
BENCHMARKs: VC: 219.585s VO: 0.065s A: 0.000s Sys: 2.479s = 222.129s
BENCHMARKs: VC: 215.359s VO: 0.064s A: 0.000s Sys: 2.410s = 217.833s
BENCHMARKs: VC: 215.369s VO: 0.064s A: 0.000s Sys: 2.396s = 217.828s
BENCHMARKs: VC: 215.499s VO: 0.066s A: 0.000s Sys: 2.404s = 217.969s
BENCHMARKs: VC: 215.410s VO: 0.065s A: 0.000s Sys: 2.399s = 217.874s
BENCHMARKs: VC: 215.377s VO: 0.064s A: 0.000s Sys: 2.399s = 217.841s
BENCHMARKs: VC: 215.347s VO: 0.065s A: 0.000s Sys: 2.412s = 217.824s
=== Nokia N800 (ARM11 core, 330MHz, 32K I-cache, 32K D-cache) ===
-O2, old simple_idct_armv5te
BENCHMARKs: VC: 138.124s VO: 0.065s A: 0.000s Sys: 1.741s = 139.930s
BENCHMARKs: VC: 139.550s VO: 0.067s A: 0.000s Sys: 1.757s = 141.374s
BENCHMARKs: VC: 139.904s VO: 0.065s A: 0.000s Sys: 1.788s = 141.758s
BENCHMARKs: VC: 139.080s VO: 0.064s A: 0.000s Sys: 1.759s = 140.903s
BENCHMARKs: VC: 139.916s VO: 0.063s A: 0.000s Sys: 1.760s = 141.739s
BENCHMARKs: VC: 139.748s VO: 0.063s A: 0.000s Sys: 1.766s = 141.577s
BENCHMARKs: VC: 139.325s VO: 0.063s A: 0.000s Sys: 1.761s = 141.150s
BENCHMARKs: VC: 139.053s VO: 0.064s A: 0.000s Sys: 1.767s = 140.884s
-O2, new simple_idct_armv5te
BENCHMARKs: VC: 124.846s VO: 0.065s A: 0.000s Sys: 1.720s = 126.631s
BENCHMARKs: VC: 127.525s VO: 0.065s A: 0.000s Sys: 1.745s = 129.335s
BENCHMARKs: VC: 126.842s VO: 0.064s A: 0.000s Sys: 1.733s = 128.639s
BENCHMARKs: VC: 127.362s VO: 0.072s A: 0.000s Sys: 1.775s = 129.209s
BENCHMARKs: VC: 127.891s VO: 0.066s A: 0.000s Sys: 1.746s = 129.703s
BENCHMARKs: VC: 127.964s VO: 0.066s A: 0.000s Sys: 1.760s = 129.790s
BENCHMARKs: VC: 127.302s VO: 0.065s A: 0.000s Sys: 1.745s = 129.113s
BENCHMARKs: VC: 126.942s VO: 0.068s A: 0.000s Sys: 1.758s = 128.768s
-O2, simple_idct_armv6
BENCHMARKs: VC: 121.252s VO: 0.061s A: 0.000s Sys: 1.724s = 123.038s
BENCHMARKs: VC: 124.197s VO: 0.062s A: 0.000s Sys: 1.733s = 125.992s
BENCHMARKs: VC: 123.555s VO: 0.065s A: 0.000s Sys: 1.736s = 125.356s
BENCHMARKs: VC: 123.519s VO: 0.065s A: 0.000s Sys: 1.720s = 125.305s
BENCHMARKs: VC: 124.071s VO: 0.063s A: 0.000s Sys: 1.714s = 125.848s
BENCHMARKs: VC: 124.146s VO: 0.064s A: 0.000s Sys: 1.732s = 125.942s
BENCHMARKs: VC: 123.530s VO: 0.066s A: 0.000s Sys: 1.723s = 125.319s
BENCHMARKs: VC: 123.470s VO: 0.064s A: 0.000s Sys: 1.720s = 125.254s
-O2, C implementation of simple_idct
BENCHMARKs: VC: 164.477s VO: 0.065s A: 0.000s Sys: 1.761s = 166.303s
BENCHMARKs: VC: 167.452s VO: 0.064s A: 0.000s Sys: 1.755s = 169.271s
BENCHMARKs: VC: 167.380s VO: 0.065s A: 0.000s Sys: 1.757s = 169.202s
BENCHMARKs: VC: 166.895s VO: 0.066s A: 0.000s Sys: 1.758s = 168.719s
BENCHMARKs: VC: 167.504s VO: 0.065s A: 0.000s Sys: 1.757s = 169.326s
BENCHMARKs: VC: 167.455s VO: 0.064s A: 0.000s Sys: 1.753s = 169.271s
BENCHMARKs: VC: 167.216s VO: 0.065s A: 0.000s Sys: 1.765s = 169.046s
BENCHMARKs: VC: 167.416s VO: 0.063s A: 0.000s Sys: 1.784s = 169.264s
-O3, old simple_idct_armv5te
BENCHMARKs: VC: 142.686s VO: 0.063s A: 0.000s Sys: 1.701s = 144.450s
BENCHMARKs: VC: 143.233s VO: 0.062s A: 0.000s Sys: 1.708s = 145.003s
BENCHMARKs: VC: 142.283s VO: 0.064s A: 0.000s Sys: 1.714s = 144.062s
BENCHMARKs: VC: 142.233s VO: 0.065s A: 0.000s Sys: 1.736s = 144.034s
BENCHMARKs: VC: 143.095s VO: 0.065s A: 0.000s Sys: 1.693s = 144.853s
BENCHMARKs: VC: 142.917s VO: 0.064s A: 0.000s Sys: 1.701s = 144.683s
BENCHMARKs: VC: 141.926s VO: 0.065s A: 0.000s Sys: 1.730s = 143.721s
BENCHMARKs: VC: 142.785s VO: 0.065s A: 0.000s Sys: 1.720s = 144.569s
-O3, new simple_idct_armv5te
BENCHMARKs: VC: 125.601s VO: 0.061s A: 0.000s Sys: 1.703s = 127.366s
BENCHMARKs: VC: 126.094s VO: 0.063s A: 0.000s Sys: 1.727s = 127.883s
BENCHMARKs: VC: 125.614s VO: 0.064s A: 0.000s Sys: 1.737s = 127.414s
BENCHMARKs: VC: 125.480s VO: 0.062s A: 0.000s Sys: 1.719s = 127.261s
BENCHMARKs: VC: 126.102s VO: 0.062s A: 0.000s Sys: 1.730s = 127.894s
BENCHMARKs: VC: 126.362s VO: 0.066s A: 0.000s Sys: 1.745s = 128.173s
BENCHMARKs: VC: 125.380s VO: 0.062s A: 0.000s Sys: 1.754s = 127.196s
BENCHMARKs: VC: 125.538s VO: 0.061s A: 0.000s Sys: 1.727s = 127.326s
-O3, simple_idct_armv6
BENCHMARKs: VC: 125.411s VO: 0.064s A: 0.000s Sys: 1.742s = 127.217s
BENCHMARKs: VC: 125.459s VO: 0.062s A: 0.000s Sys: 1.783s = 127.304s
BENCHMARKs: VC: 124.586s VO: 0.067s A: 0.000s Sys: 1.738s = 126.391s
BENCHMARKs: VC: 124.555s VO: 0.062s A: 0.000s Sys: 1.740s = 126.357s
BENCHMARKs: VC: 125.617s VO: 0.064s A: 0.000s Sys: 1.729s = 127.410s
BENCHMARKs: VC: 125.187s VO: 0.065s A: 0.000s Sys: 1.743s = 126.995s
BENCHMARKs: VC: 124.505s VO: 0.064s A: 0.000s Sys: 1.741s = 126.310s
BENCHMARKs: VC: 124.444s VO: 0.063s A: 0.000s Sys: 1.742s = 126.248s
-O3, C implementation of simple_idct
BENCHMARKs: VC: 162.218s VO: 0.073s A: 0.000s Sys: 1.742s = 164.033s
BENCHMARKs: VC: 163.582s VO: 0.071s A: 0.000s Sys: 1.761s = 165.414s
BENCHMARKs: VC: 162.605s VO: 0.072s A: 0.000s Sys: 1.759s = 164.436s
BENCHMARKs: VC: 162.713s VO: 0.073s A: 0.000s Sys: 1.726s = 164.512s
BENCHMARKs: VC: 163.388s VO: 0.071s A: 0.000s Sys: 1.730s = 165.189s
BENCHMARKs: VC: 163.241s VO: 0.070s A: 0.000s Sys: 1.739s = 165.050s
BENCHMARKs: VC: 162.843s VO: 0.073s A: 0.000s Sys: 1.739s = 164.655s
BENCHMARKs: VC: 162.440s VO: 0.073s A: 0.000s Sys: 1.748s = 164.260s
-------------- next part --------------
A non-text attachment was scrubbed...
Name: improved-simple-idct-armv5te.diff
Type: text/x-diff
Size: 47619 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070913/59fd1e03/attachment.diff>
More information about the ffmpeg-devel
mailing list