[Ffmpeg-devel] [PATCH] lowres chroma bug

Thu Feb 8 22:32:15 CET 2007

Hi

On Thu, Feb 08, 2007 at 05:32:58AM -0800, Trent Piepho wrote:
> On Thu, 8 Feb 2007, Michael Niedermayer wrote:
> > On Wed, Feb 07, 2007 at 03:51:22PM -0800, Trent Piepho wrote:
> > > >
> > > > last time i compared hardcoded registers with gcc-choosen ones, the later
> > > > where slower (that was in cabac.h in case you want to proof me wrong, id
> > > > be happy if we could get rid of the hardcoded registers there ...)
> > >
> > > It going to depend a lot on how the code is used.  If your asm will only
> > > appear in one place, ie. it's neither a macro nor an inlined function nor
> > > in a unrolled loop, etc., the you could just let gcc pick a register and
> > > then go back and hardcode that same register.  That should generate the
> > > exact same code.
> >
> > i agree, it should but iam not so sure if it really does if you need
> > additional dummy variables for the gcc choosen register case ...
> 
> You can see in the resulting code that gcc doesn't generate any loads or
> stores to the dummy variable, or even allocate any stack space for it.
> 
> > > The advantage comes when the code is a macro or inlined in multiple places.
> > > With a hard coded register, the same register must be used each time.  If
> > > you let gcc choose, it can pick different registers depending on the
> > > context.  In this case, no matter what register you pick, you may do worse
> > > than letting gcc pick.
> >
> > in theory yes, in practice i dont have that much faith in gccs ability to
> > select registers better then doing random assignment, and forcing
> > input operands to be always in the same register compared to random ones
> > can avoid some instrucions
> 
> At least in simple cases, it is easy to see the gcc register assignment is
> much better than random.  Here's an example:
> #include <string.h>
> int foo()
> {
>     int a, b;
>     void *d, *s;
> 
>     asm("# a = %0, b = %1" : "=r"(a), "=r"(b));     /*block 1*/
>     bar(a);
>     asm("# read a = %0 b = %1" :: "r"(a), "r"(b));
> 
>     asm("# s = %0, d = %1" : "=r"(s), "=r"(d));     /*block 2*/
>     bzero(d, 32);
> 
>     asm("# a = %0, b = %1" : "=r"(a), "=r"(b) : "r"(s)); /*block 3*/
>     return a;
> }
> 
> In block 1, a and b need to keep their values across the call to bar().
> gcc generates:
>         # a = %ebx, b = %esi    # a, b
>         pushl   %ebx    # a
>         call    bar     #
>         # read a = %ebx b = %esi        # a, b
> 
> It choose ebx and esi because those are callee saved registers and do not
> need to be saved and re-loaded across the call to bar().  If the call to
> bar() is commented out, it will choose edx and eax instead.
> 
> In block 2, gcc will emit an inline version of bzero using rep stosl, which
> must write to the address edi, and so gcc will assign edi to d.  Change the
> bzero to use s or a or b, and then that variable will be assigned edi.
> Comment out the bzero, and gcc will just use eax/edx.
> 
> In block 3, a is the return value of the function and so will be put in eax
> since that's where the return value needs to go.  Change the function to
> return b, and then b will get put in eax.

all nice but why does it not work in practice (cabac.h) ? ive tried to change
various registers to gcc selected ones but the code was always slower and the
hardcoded registers in current cabac.h are just random

> 
> > > Like the inlined put_bits() function in bitstream.h, I think you would get
> > > better code if the eax wasn't hardcoded.
> >
> > well benchmark it and send a patch if its faster
> 
> I have no idea how to benchmark that function.  Adding an rdtsc to the code
> will totally change the register allocation since it clobbers eax and edx.
> Also, better register allocation doesn't make the asm code itself any
> faster, the instructions are the same no matter which register they use.
> Rather, it makes the code around the asm block faster.  So, you would need
> to benchmark all the code that put_bits() is inlined into.  How could that
> be done?  You could benchmark the entire program, but I doubt a bit better
> code in put_bits() would be measurable against everything else.

rdtsc surrounding mpeg1/4_encode_mb() should do as *encode_mb() isnt inlined,
its in a seperate object so it cant be ...
also you could use something like:

asm(
    push eax edx
    rdtsc
    add elapsed time to global/static variable
    pop edx eax
);

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Democracy is the form of government in which you can choose your dictator
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070208/a73b2efd/attachment.pgp>