[Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)

Mon Dec 11 13:24:54 CET 2006

Hi Michael,

On Mon, 2006-12-11 at 01:20 +0100, Panagiotis Issaris wrote:
> On Sat, Dec 09, 2006 at 02:47:02AM +0100, Michael Niedermayer wrote:
> >[...]
> > > +    c = pieces[2][0]-pieces[2][3];
> > > +    b = pieces[2][1]+pieces[2][2];
> > > +    d = pieces[2][1]-pieces[2][2];
> > > +    block[0][2] = a+b;
> > > +    block[2][2] = a-b;
> > > +    block[1][2] = (c<<1)+d;
> > > +    block[3][2] = c-(d<<1);
> > > +
> > > +    a = pieces[3][0]+pieces[3][3];
> > > +    c = pieces[3][0]-pieces[3][3];
> > > +    b = pieces[3][1]+pieces[3][2];
> > > +    d = pieces[3][1]-pieces[3][2];
> > > +    block[0][3] = a+b;
> > > +    block[2][3] = a-b;
> > > +    block[1][3] = (c<<1)+d;
> > > +    block[3][3] = c-(d<<1);
> > > +}
> > 
> > i assume that a for loop would slow this down significantly? if so a macro would
> > make that much smaller without speed loss ...
> 
> I've tested this like this:
>  163 START_TIMER
>  164     DCTELEM pieces[4][4];
>  165     DCTELEM a, b, c, d;
>  166     int i;
>  167 
>  168     for (i=0; i<4; i++)
>  169     {
>  170         a = block[0][i]+block[3][i];
>  171         c = block[0][i]-block[3][i];
>  172         b = block[1][i]+block[2][i];
>  173         d = block[1][i]-block[2][i];
>  174         pieces[0][i] = a+b;
>  175         pieces[2][i] = a-b;
>  176         pieces[1][i] = (c<<1)+d;
>  177         pieces[3][i] = c-(d<<1);
>  178     }
>  179 
>  180     for (i=0; i<4; i++)
>  181     {
>  182         a = pieces[i][0]+pieces[i][3];
>  183         c = pieces[i][0]-pieces[i][3];
>  184         b = pieces[i][1]+pieces[i][2];
>  185         d = pieces[i][1]-pieces[i][2];
>  186         block[0][i] = a+b;
>  187         block[2][i] = a-b;
>  188         block[1][i] = (c<<1)+d;
>  189         block[3][i] = c-(d<<1);
>  190     }
>  191 STOP_TIMER("DCTFOR")
> 
> Resulting in:
> ...
> 924 dezicycles in DCTFOR, 8387443 runs, 1165 skipste=1350.3kbits/s    
> frame= 1989 q=-1.0 Lsize=   11046kB time=66.4 bitrate=1363.5kbits/s    
> video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
> 
> When using the DCT without loops:
> ...
> 914 dezicycles in DCT, 8387499 runs, 1109 skipstrate=1351.4kbits/s    
> frame= 1989 q=-1.0 Lsize=   11046kB time=66.4 bitrate=1363.5kbits/s    
> video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
> 
> But, the runs varied over a range bigger then the difference shown above.  I got
> runs of 924, 944 and more decicycles for the DCT without the loops as well. Same
> for the DCT with the for loops, decicycles spent in the DCT varied from 910 to
> 980. So, to me, it appears adding the loop doesn't hurt much. The tests above
> took place on a Athlon64 X2 3800+. I will conduct the same tests tomorrow on a
> P4 and see if it makes a considerable difference on that machine.

I reran the tests on a Pentium 4 CPU 3.20GHz and on that machine it
appears to make a consistent difference of about 200 clock cycles.

With the for loops:
...
1983 dezicycles in DCTFOR, 16775281 runs, 1935 skips3689.0kbits/s    
frame=  101 q=-1.0 Lsize=    1652kB time=4.0 bitrate=3350.1kbits/s    
video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%

Repeated runs gave: 1991, 1986, 1994, 1995, 1997, 2061

Without the for loops:
...
1809 dezicycles in DCT, 16776700 runs, 516 skipsate=3640.6kbits/s    
frame=  101 q=-1.0 Lsize=    1652kB time=4.0 bitrate=3350.1kbits/s    
video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%

Repeated runs gave: 1806, 1790, 1805, 1814, 1826, 1835

So, on Athlon64 it appears to make no real difference, on P4 it does.
I'll try and rewrite it a bit shorter using a macro.

With friendly regards,
Takis
-- 
vCard: http://www.issaris.org/pi.vcf
Public key: http://www.issaris.org/pi.key