[MPlayer-dev-eng] vo_gl PBO patch ..

Wed Apr 30 22:44:40 CEST 2008

On Wed, Apr 30, 2008 at 02:05:57PM -0600, Sven Gothel wrote:
> On Wednesday 30 April 2008 12:29:13 Reimar Döffinger wrote:
> > On Tue, Apr 29, 2008 at 01:44:32PM -0600, Sven Gothel wrote:
> > > version 4 .. reflecting Reimar's suggestions.
> > 
> > I still have some problems understanding the need for most of those
> > changes.
> > While it is really ugly for several reasons and I will have to clean it up
> > once I have a bit more time,
> Well, we have a different idea what is more complicated and what is not,
> that's for sure :)

Well, the criteria I used is simple: my patch adds about 10 non-trivial
lines of code, yours 50.

> My patch just adds a little PBO helper tool,
> to ease the PBO usage, utilizing a bit OO style,
> so we don't have to bother with all the floating around variables.

Reducing the number of global variables is a good goal, but doing this
in a patch that fixes a performance issue makes it really hard to
understand where the performance issue is, what causes it, and what is
the simplest way to fix it.

> The new feature adds a ways to enable DMA xfers.
> The first DMA is video -> PBO buffer, utilizing memcpy.
> The 2nd is the GPU's DMA xfer using the PBO for glTexSubImage2D.

How do you define DMA? I learned DMA as data transfer between (usually)
main memory and a peripheral device under the control of the device,
thus leaving the CPU free.
To me, the memcpy used in this case seems to fail all these criteria...

> The impact is a performance benefit especially for HD content,
> from around 80% -> 20%, factor 4, on AMD drivers.

This only restates what the whole patch achieves, but it does not
clarify _which parts_ of the patch are relevant for this.
E.g. the current PBO handling code uses 1 PBO for YV12 mode, your patch
changes this to use 3. Does this have any effect on speed?
And what about the changes to draw_slice? Do they improve speed?
Was that code tested at all? May it actually be slower that the original
code?
This case is less critical since the new code can be disabled, but
having all these changes to code paths that are used in very different
situations in one single patch is likely to find out where issues are if
any arise.

> > could you please use attached patch and 
> > tell me what it misses in functionality compared to yours?
> 
> I do.
> 
> As far I understand your patch,
> you perform a memcpy, ie DMA xfer, to shape the mp_image_t, 
> so you get rid of the GL_UNPACK_ROW_LENGTH trouble.
> But this should be redundant, since this PBO case hits if you 
> are already in charge of the texture and video buffer, ie. get_image.
> 
> Your patch would still use slow PIO in glTexSubImage,
> because of the lack of direct PBO in get_image.
> (see below)

I'd prefer if you test, because it should be exactly the other way
round, it should be faster (and with my nVidia card vo CPU usage does
drop from 14% to 6%) but it should not fix the GL_UNPACK_ROW_LENGTH
issue (though it probably is trivial to fix).

> > Another thing is probably the GL_UNPACK_ROW_LENGTH support you said ATI drivers
> > are lacking, if this still causes a problems with this patch I'd need a bit
> > more information on this.
> 
> Well, it seems it is simply a bug in the current ATI driver,
> which enforces us to use multiple memcpy, ie. shape the PBO memory
> so no NPOT stride happen - even though the driver claims to support NPOT.

What is NPOT supposed to mean? My only interpretation of NPOT would be
non-power-of-two but that makes no sense. Do you mean it requires 4-byte
alignment or something like that? In that case, the problem is more
likely to be in the GL_UNPACK_ALIGNMENT handling...

Greetings,
Reimar Döffinger