[MPlayer-dev-eng] vo_gl PBO patch ..
Sven Gothel
sgothel at jausoft.com
Wed Apr 30 23:48:40 CEST 2008
On Wednesday 30 April 2008 14:44:40 Reimar Döffinger wrote:
> On Wed, Apr 30, 2008 at 02:05:57PM -0600, Sven Gothel wrote:
> > On Wednesday 30 April 2008 12:29:13 Reimar Döffinger wrote:
> > > On Tue, Apr 29, 2008 at 01:44:32PM -0600, Sven Gothel wrote:
> > > > version 4 .. reflecting Reimar's suggestions.
> > >
> > > I still have some problems understanding the need for most of those
> > > changes.
> > > While it is really ugly for several reasons and I will have to clean it up
> > > once I have a bit more time,
> > Well, we have a different idea what is more complicated and what is not,
> > that's for sure :)
>
> Well, the criteria I used is simple: my patch adds about 10 non-trivial
> lines of code, yours 50.
>
> > My patch just adds a little PBO helper tool,
> > to ease the PBO usage, utilizing a bit OO style,
> > so we don't have to bother with all the floating around variables.
>
> Reducing the number of global variables is a good goal, but doing this
> in a patch that fixes a performance issue makes it really hard to
> understand where the performance issue is, what causes it, and what is
> the simplest way to fix it.
It depends. Proper management for multiple PBO's indicates a little toolkit.
But .. maybe I am completly wrong, possible,
and the whole thing I did is not necessary - we will see.
>
> > The new feature adds a ways to enable DMA xfers.
> > The first DMA is video -> PBO buffer, utilizing memcpy.
> > The 2nd is the GPU's DMA xfer using the PBO for glTexSubImage2D.
>
> How do you define DMA? I learned DMA as data transfer between (usually)
> main memory and a peripheral device under the control of the device,
> thus leaving the CPU free.
> To me, the memcpy used in this case seems to fail all these criteria...
DMA can be any memory transfer not handled by the CPU in PIO mode.
On a system, there are many DMA controller available.
The GPU one usually is able to handle:
GPUMem <-> SystemMem
GPUMem <-> GPUMem
maybe even (AMD GPU's can do this for sure):
SystemMem <-> SystemMem
The system memory controller under some architectures is able to do the same,
at least SystemMem <-> SystemMem.
This maybe utilized by memcpy, to my knowledge it is on ia/x86 platforms.
Even though memcpy would be performed in PIO mode,
using PBO's for glTexSubImage ensures at least this one is using DMA.
>
> > The impact is a performance benefit especially for HD content,
> > from around 80% -> 20%, factor 4, on AMD drivers.
>
> This only restates what the whole patch achieves, but it does not
> clarify _which parts_ of the patch are relevant for this.
Right, let's drop the details of the patch and
let's discuss the issue itself.
> E.g. the current PBO handling code uses 1 PBO for YV12 mode, your patch
> changes this to use 3. Does this have any effect on speed?
No.
It just ensures, that we setup the PBO only once.
Of course, instead of the 3 (full-size + 2 half-size),
you may create and use the fullsize only, since they are used sequentially.
That would be fine .. IMHO.
> And what about the changes to draw_slice? Do they improve speed?
Yes.
It ensures at least one guaranteed DMA transfer (see above).
> Was that code tested at all? May it actually be slower that the original
> code?
Actually this path is not that critical, at least not with my test cases.
Ie. I do not have HD mpeg content to decode with mpeg12,
ffmpeg12 doesn't use slicing, I have observed.
Regarding the slicing part, I see that you always copy the whole video frame.
So we may can drop the previous slices/copies and do it only on the whole image,
ie when y+slice==h ?
> This case is less critical since the new code can be disabled,
Right.
> but
> having all these changes to code paths that are used in very different
> situations in one single patch is likely to find out where issues are if
> any arise.
I totally agree that we have to test and discuss all of this,
that's why we love code reviews.
And having more test data/information, ie HD codecs, etc,
I like to run a few test with those as well.
However, I believe my patch doesn't change anything fundamental/functional,
and it is optional anyway -> no impact with the default pbodma=0.
The latter stament should be verified first (IMHO),
because it is important that I don't break anything, of course.
And let's drive our performance discussion forward ..
it's fun to do so.
>
> > > could you please use attached patch and
> > > tell me what it misses in functionality compared to yours?
> >
> > I do.
> >
> > As far I understand your patch,
> > you perform a memcpy, ie DMA xfer, to shape the mp_image_t,
> > so you get rid of the GL_UNPACK_ROW_LENGTH trouble.
> > But this should be redundant, since this PBO case hits if you
> > are already in charge of the texture and video buffer, ie. get_image.
> >
> > Your patch would still use slow PIO in glTexSubImage,
> > because of the lack of direct PBO in get_image.
> > (see below)
>
> I'd prefer if you test, because it should be exactly the other way
> round,
Well, at least I have catched that 'GL_UNPACK_ROW_LENGTH fix' :)
> it should be faster (and with my nVidia card vo CPU usage does
> drop from 14% to 6%) but it should not fix the GL_UNPACK_ROW_LENGTH
> issue (though it probably is trivial to fix).
I see.
But NVidia's driver is already using DMA xfer's for a simple texture upload,
even without PBO's.
The GL_UNPACK_ROW_LENGTH ATI bug currently only appears for NPOT strides
using PBO's.
Normal texture uploads just works.
>
> > > Another thing is probably the GL_UNPACK_ROW_LENGTH support you said ATI drivers
> > > are lacking, if this still causes a problems with this patch I'd need a bit
> > > more information on this.
> >
> > Well, it seems it is simply a bug in the current ATI driver,
> > which enforces us to use multiple memcpy, ie. shape the PBO memory
> > so no NPOT stride happen - even though the driver claims to support NPOT.
>
> What is NPOT supposed to mean? My only interpretation of NPOT would be
> non-power-of-two
right. Actually they use it for non power of two textures,
but we get the drift, right.
> but that makes no sense. Do you mean it requires 4-byte
> alignment or something like that?
Nope not a n-byte alignment,
but a required dimension, where each side is = pow(2,n).
You know, the so very old texture requirement.
> In that case, the problem is more
> likely to be in the GL_UNPACK_ALIGNMENT handling...
that's what I tried to communicate - but only related to PBO xfers.
This problem is only an issue using PBO's for tex uploads
and is 'just' a bug for ATI cards, hence the pbodma option
for multiple memcpy.
The root problem is the slow texture upload on, let's say,
non NVidia cards/driver, which just don't use DMA implicit.
Cheers, Sven
>
> Greetings,
> Reimar Döffinger
More information about the MPlayer-dev-eng
mailing list