[Ffmpeg-devel] Native H.264 encoder (was: I'm giving up)
Panagiotis Issaris
takis.issaris
Mon Dec 11 13:24:54 CET 2006
Hi Michael,
On Mon, 2006-12-11 at 01:20 +0100, Panagiotis Issaris wrote:
> On Sat, Dec 09, 2006 at 02:47:02AM +0100, Michael Niedermayer wrote:
> >[...]
> > > + c = pieces[2][0]-pieces[2][3];
> > > + b = pieces[2][1]+pieces[2][2];
> > > + d = pieces[2][1]-pieces[2][2];
> > > + block[0][2] = a+b;
> > > + block[2][2] = a-b;
> > > + block[1][2] = (c<<1)+d;
> > > + block[3][2] = c-(d<<1);
> > > +
> > > + a = pieces[3][0]+pieces[3][3];
> > > + c = pieces[3][0]-pieces[3][3];
> > > + b = pieces[3][1]+pieces[3][2];
> > > + d = pieces[3][1]-pieces[3][2];
> > > + block[0][3] = a+b;
> > > + block[2][3] = a-b;
> > > + block[1][3] = (c<<1)+d;
> > > + block[3][3] = c-(d<<1);
> > > +}
> >
> > i assume that a for loop would slow this down significantly? if so a macro would
> > make that much smaller without speed loss ...
>
> I've tested this like this:
> 163 START_TIMER
> 164 DCTELEM pieces[4][4];
> 165 DCTELEM a, b, c, d;
> 166 int i;
> 167
> 168 for (i=0; i<4; i++)
> 169 {
> 170 a = block[0][i]+block[3][i];
> 171 c = block[0][i]-block[3][i];
> 172 b = block[1][i]+block[2][i];
> 173 d = block[1][i]-block[2][i];
> 174 pieces[0][i] = a+b;
> 175 pieces[2][i] = a-b;
> 176 pieces[1][i] = (c<<1)+d;
> 177 pieces[3][i] = c-(d<<1);
> 178 }
> 179
> 180 for (i=0; i<4; i++)
> 181 {
> 182 a = pieces[i][0]+pieces[i][3];
> 183 c = pieces[i][0]-pieces[i][3];
> 184 b = pieces[i][1]+pieces[i][2];
> 185 d = pieces[i][1]-pieces[i][2];
> 186 block[0][i] = a+b;
> 187 block[2][i] = a-b;
> 188 block[1][i] = (c<<1)+d;
> 189 block[3][i] = c-(d<<1);
> 190 }
> 191 STOP_TIMER("DCTFOR")
>
> Resulting in:
> ...
> 924 dezicycles in DCTFOR, 8387443 runs, 1165 skipste=1350.3kbits/s
> frame= 1989 q=-1.0 Lsize= 11046kB time=66.4 bitrate=1363.5kbits/s
> video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
>
> When using the DCT without loops:
> ...
> 914 dezicycles in DCT, 8387499 runs, 1109 skipstrate=1351.4kbits/s
> frame= 1989 q=-1.0 Lsize= 11046kB time=66.4 bitrate=1363.5kbits/s
> video:11020kB audio:0kB global headers:0kB muxing overhead 0.233141%
>
> But, the runs varied over a range bigger then the difference shown above. I got
> runs of 924, 944 and more decicycles for the DCT without the loops as well. Same
> for the DCT with the for loops, decicycles spent in the DCT varied from 910 to
> 980. So, to me, it appears adding the loop doesn't hurt much. The tests above
> took place on a Athlon64 X2 3800+. I will conduct the same tests tomorrow on a
> P4 and see if it makes a considerable difference on that machine.
I reran the tests on a Pentium 4 CPU 3.20GHz and on that machine it
appears to make a consistent difference of about 200 clock cycles.
With the for loops:
...
1983 dezicycles in DCTFOR, 16775281 runs, 1935 skips3689.0kbits/s
frame= 101 q=-1.0 Lsize= 1652kB time=4.0 bitrate=3350.1kbits/s
video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
Repeated runs gave: 1991, 1986, 1994, 1995, 1997, 2061
Without the for loops:
...
1809 dezicycles in DCT, 16776700 runs, 516 skipsate=3640.6kbits/s
frame= 101 q=-1.0 Lsize= 1652kB time=4.0 bitrate=3350.1kbits/s
video:1652kB audio:0kB global headers:0kB muxing overhead 0.000000%
Repeated runs gave: 1806, 1790, 1805, 1814, 1826, 1835
So, on Athlon64 it appears to make no real difference, on P4 it does.
I'll try and rewrite it a bit shorter using a macro.
With friendly regards,
Takis
--
vCard: http://www.issaris.org/pi.vcf
Public key: http://www.issaris.org/pi.key
More information about the ffmpeg-devel
mailing list