[Ffmpeg-devel] [PATCH] lowres chroma bug
Michael Niedermayer
michaelni
Thu Feb 8 22:32:15 CET 2007
Hi
On Thu, Feb 08, 2007 at 05:32:58AM -0800, Trent Piepho wrote:
> On Thu, 8 Feb 2007, Michael Niedermayer wrote:
> > On Wed, Feb 07, 2007 at 03:51:22PM -0800, Trent Piepho wrote:
> > > >
> > > > last time i compared hardcoded registers with gcc-choosen ones, the later
> > > > where slower (that was in cabac.h in case you want to proof me wrong, id
> > > > be happy if we could get rid of the hardcoded registers there ...)
> > >
> > > It going to depend a lot on how the code is used. If your asm will only
> > > appear in one place, ie. it's neither a macro nor an inlined function nor
> > > in a unrolled loop, etc., the you could just let gcc pick a register and
> > > then go back and hardcode that same register. That should generate the
> > > exact same code.
> >
> > i agree, it should but iam not so sure if it really does if you need
> > additional dummy variables for the gcc choosen register case ...
>
> You can see in the resulting code that gcc doesn't generate any loads or
> stores to the dummy variable, or even allocate any stack space for it.
>
> > > The advantage comes when the code is a macro or inlined in multiple places.
> > > With a hard coded register, the same register must be used each time. If
> > > you let gcc choose, it can pick different registers depending on the
> > > context. In this case, no matter what register you pick, you may do worse
> > > than letting gcc pick.
> >
> > in theory yes, in practice i dont have that much faith in gccs ability to
> > select registers better then doing random assignment, and forcing
> > input operands to be always in the same register compared to random ones
> > can avoid some instrucions
>
> At least in simple cases, it is easy to see the gcc register assignment is
> much better than random. Here's an example:
> #include <string.h>
> int foo()
> {
> int a, b;
> void *d, *s;
>
> asm("# a = %0, b = %1" : "=r"(a), "=r"(b)); /*block 1*/
> bar(a);
> asm("# read a = %0 b = %1" :: "r"(a), "r"(b));
>
> asm("# s = %0, d = %1" : "=r"(s), "=r"(d)); /*block 2*/
> bzero(d, 32);
>
> asm("# a = %0, b = %1" : "=r"(a), "=r"(b) : "r"(s)); /*block 3*/
> return a;
> }
>
> In block 1, a and b need to keep their values across the call to bar().
> gcc generates:
> # a = %ebx, b = %esi # a, b
> pushl %ebx # a
> call bar #
> # read a = %ebx b = %esi # a, b
>
> It choose ebx and esi because those are callee saved registers and do not
> need to be saved and re-loaded across the call to bar(). If the call to
> bar() is commented out, it will choose edx and eax instead.
>
> In block 2, gcc will emit an inline version of bzero using rep stosl, which
> must write to the address edi, and so gcc will assign edi to d. Change the
> bzero to use s or a or b, and then that variable will be assigned edi.
> Comment out the bzero, and gcc will just use eax/edx.
>
> In block 3, a is the return value of the function and so will be put in eax
> since that's where the return value needs to go. Change the function to
> return b, and then b will get put in eax.
all nice but why does it not work in practice (cabac.h) ? ive tried to change
various registers to gcc selected ones but the code was always slower and the
hardcoded registers in current cabac.h are just random
>
> > > Like the inlined put_bits() function in bitstream.h, I think you would get
> > > better code if the eax wasn't hardcoded.
> >
> > well benchmark it and send a patch if its faster
>
> I have no idea how to benchmark that function. Adding an rdtsc to the code
> will totally change the register allocation since it clobbers eax and edx.
> Also, better register allocation doesn't make the asm code itself any
> faster, the instructions are the same no matter which register they use.
> Rather, it makes the code around the asm block faster. So, you would need
> to benchmark all the code that put_bits() is inlined into. How could that
> be done? You could benchmark the entire program, but I doubt a bit better
> code in put_bits() would be measurable against everything else.
rdtsc surrounding mpeg1/4_encode_mb() should do as *encode_mb() isnt inlined,
its in a seperate object so it cant be ...
also you could use something like:
asm(
push eax edx
rdtsc
add elapsed time to global/static variable
pop edx eax
);
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070208/a73b2efd/attachment.pgp>
More information about the ffmpeg-devel
mailing list