[FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy as libmfx alignment requirement for uploading

Mark Thompson sw at jkqxz.net
Thu Oct 25 01:47:48 EEST 2018


On 24/10/18 21:45, Rogozhkin, Dmitry V wrote:
> On Tue, 2018-10-23 at 23:46 +0100, Mark Thompson wrote:
>> On 23/10/18 08:49, Li, Zhong wrote:
>>>> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On
>>>> Behalf
>>>> Of Mark Thompson
>>>> Sent: Sunday, October 14, 2018 12:36 AM
>>>> To: ffmpeg-devel at ffmpeg.org
>>>> Subject: Re: [FFmpeg-devel] [FFmpeg-cvslog] lavu/qsv: make a copy
>>>> as
>>>> libmfx alignment requirement for uploading
>>>>
>>>> On 11/10/18 06:38, Zhong Li wrote:
>>>>> ffmpeg | branch: master | Zhong Li <zhong.li at intel.com> | Mon
>>>>> Sep 17
>>>>> 19:16:44 2018 +0800| [681aa7d14f97fd98181ca6d61e11be48fe65692d]
>>>>> |
>>>>> committer: Zhong Li
>>>>>
>>>>> lavu/qsv: make a copy as libmfx alignment requirement for
>>>>> uploading
>>>>>
>>>>> Libmfx requires 16 bytes aligned input/output for uploading.
>>>>> Currently only output is 16 byte aligned and assigning same
>>>>> width/height to input with smaller buffer size actually, thus
>>>>> definitely will
>>>>
>>>> cause segment fault.
>>>>>
>>>>> Can reproduce with any 1080p nv12 rawvideo input:
>>>>> ffmpeg -init_hw_device qsv=qsv:hw -hwaccel qsv
>>>>> -filter_hw_device qsv
>>>>> -f rawvideo -pix_fmt nv12 -s:v 1920x1080 -i 1080p_nv12.yuv -vf
>>>>> 'format=nv12,hwupload=extra_hw_frames=16,hwdownload,format=nv12
>>>>> '
>>>>
>>>> -an
>>>>> -y out_nv12.yuv
>>>>>
>>>>> It can fix #7418
>>>>>
>>>>> Signed-off-by: Zhong Li <zhong.li at intel.com>
>>>>>
>>>>>>
>>>>
>>>> http://git.videolan.org/gitweb.cgi/ffmpeg.git/?a=commit;h=681aa7d
>>>> 14f9
>>>>>> 7fd98181ca6d61e11be48fe65692d
>>>>>
>>>>> ---
>>>>>
>>>>>  libavutil/hwcontext_qsv.c | 31 +++++++++++++++++++++++++++++--
>>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/libavutil/hwcontext_qsv.c
>>>>> b/libavutil/hwcontext_qsv.c
>>>>> index 33e121b416..814ce215ce 100644
>>>>> --- a/libavutil/hwcontext_qsv.c
>>>>> +++ b/libavutil/hwcontext_qsv.c
>>>>> @@ -862,6 +862,10 @@ static int
>>>>
>>>> qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
>>>>>      mfxSyncPoint sync = NULL;
>>>>>      mfxStatus err;
>>>>>      int ret = 0;
>>>>> +    /* make a copy if the input is not padded as libmfx
>>>>> requires */
>>>>> +    AVFrame tmp_frame, *src_frame;
>>>>> +    int realigned = 0;
>>>>> +
>>>>>
>>>>>      while (!s->session_upload_init && !s->session_upload &&
>>>>> !ret) {
>>>>> #if HAVE_PTHREADS @@ -887,16 +891,36 @@ static int
>>>>> qsv_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
>>>>>      if (ret < 0)
>>>>>          return ret;
>>>>>
>>>>> +
>>>>> +    if (src->height & 16 || src->linesize[0] & 16) {
>>>>
>>>> Doesn't this still need to check that the layout is compatible
>>>> with other the
>>>> limitations that libmfx has before trying to do the upload?  In
>>>> particular,
>>>> that the pitches for the different planes are actually all the
>>>> same - I certainly
>>>> wouldn't expect them to be for YUV420P.
>>>
>>> Do you mean our previous discussion on https://lists.libav.org/pipe
>>> rmail/libav-devel/2018-April/086070.html ?
>>> If yes, I will try to reproduce it with hwupload pipeline instead
>>> of transcoding pipeline, and then give an update patch.
>>
>> Yes.  You might need to write a small test program to generation an
>> arbitrary allocation pattern, since libavutil only has a fixed
>> layout.
>>
>>>>
>>>> (IIRC there was a requirement 0 < frame->data[N] - frame->data[0] 
>>>> < 2^24
>>>> as well (or something like that, I might be wrong), which also
>>>> need not be
>>>> true.)
>>>
>>> I don't know such a requirement, could it be found from MSDK
>>> guide? 
>>
>> That came from discussion of hardware capabilities around VAAPI, I
>> think.
> 
> There are upper size frame limitations which HW can support. And this
> limitation can be different for different operations, for example AVC
> encoder supports 4K as a maximum. Jpeg can support 16K. This depend on
> the component and may depend on the generation of hardware. In any case
> I don't think any of these kind of limitations are something to be
> checked within ffmpeg during allocation. That's responsibility of
> underlying implementation via which you try to allocate to return an
> error to ffmpeg that allocation can not be done. Or responsibility of
> the component (msdk or vaapi) to return an error that processing of the
> given surface can not be done (due to per-component limitations).

I meant real hardware capabilities around layout, which are going to be a superset of anything which can actually be reported in any particular API (see below).

>>
>> Won't something like this be the reason for wanting a virtually
>> contiguous buffer?  If there isn't a restriction along these lines
>> then there is no need to make the buffer virtually
>> contiguous.  (After all, the hardware can't know about C undefined
>> behaviour in pointer comparisons...)
> 
> "Virtually contiguous" sounds dangerous to me. There is a discrepancy
> in treating layouts of color formats in ffmpeg and other libraries (and
> drivers). First of all I did not see clear document describing what
> ffmpeg thinks layout should be. What it seems to me is that ffmpeg
> treats color formats per-plane meaning that in NV12 Y and UV could be
> allocated absolutely separately.

Correct.  FFmpeg deliberately places little constraint on the real layout of an NV12 frame - the planes need not be in any particular order or place in memory, nor need the pitch have any particular value relative to the width.  The one constraint we do have is that each line start at an address which is a multiple of some defined platform-specific alignment (and therefore plane pitches are multiples of such a value as well).

>                                  While you work with the SW components
> this might be ok. But HW defines pretty specifically what it expects.
> And typically memory layout is supposed to be contiguous, I mean
> _physically_ contiguous.

While this is true in a lot of implementations, it isn't in the ones being considered here.  Modern Intel GPUs, including their video hardware, have an MMU with the same level of page access as the CPU does, so anything which is virtually contiguous from the CPU point of view certainly can be from the GPU as well.  Objects which are physically contiguous might be desirable in some marginal cases for performance (desirable ordering properties for RAM chips?), but certainly aren't necessary.

>                          Secondly, HW may have some other implications
> over layout requiring particular alignments. These could be width
> alignments, i.e. pitches. Or height padding alignments. As of now it
> seems that ffmpeg don't provide an architecture to satisfy all these
> needs. This leads to overheads in terms of memory consumption since
> additional "rightly allocated" memory is required to copy frames and
> eventually this implies processing overheads as well.

These real constraints are mostly unsatifiable for reasonable use-cases of Intel GPUs.  For example, H.264 encoding on Skylake requires Y-tiled NV12 surfaces with a number of additional constraints such as the chroma plane having the same pitch as the luma plane and being located at a line offset of at most 2^15-1.  This particular constraint is the one I was thinking of above: given a pitch of (say) 2048 = 2^11, the chroma plane can't be offset from the luma plane by more than 2^26, and in the positive direction only - that pretty much means you need to allocate them virtually-contiguously, since planes allocated normally have no reason to satisfy that (especially with ASLR enabled).  (This comes from <https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol02a-commandreference-instructions.pdf> - MFX_SURFACE_STATE beginning page 818 is relevant.)

The Y-tiling in particular makes it very difficult for any upload case - there is no reasonable software implementation of the address swizzle needed to write Y-tiled planes.  The one possible getout here is to use the MMU to get a linear map of the Y-tiled surface, but in general this causes problems because such a mapping is uncached and therefore has very low performance for reads.  (You can still use it in FFmpeg with VAAPI if you want - the av_hwframe_map() function and the hwmap filter allow it, though it's only really usable for write-only surfaces.)

Given that, we have to make a copy somewhere in almost all cases.  I believe the change made here for libmfx is done so that the GPU can perform the copy itself, lifting the input data directly from CPU memory and therefore saving on the CPU time which would be used to perform the same copy.  I can't find this particular case in the relevant source code, but I assume if will have some similar constraints to the above video case - in particular, the chroma plane location relative to the luma plane is probably constrained in the same way (I admit I've extrapolated that from the virtual-contiguity requirement, but I don't see where else it would come from).

> Here is an example from gstreamer where they describe what they mean
> under color formats: https://gstreamer.freedesktop.org/documentation/de
> sign/mediatype-video-raw.html. This would be wonderful if ffmpeg could
> define something similar. Pay attention that in gstreamer you clearly
> get understanding what is "default" NV12 memory layout is: it is
> contiguous memory where UV plane is located right after the end of the
> Y plane, no additional padding for width and height. 

It's unclear what you are gaining from this being defined as the "default".  If that's defined in the API then it constrains the allocator to only return frames of that form (so you can't do things like optimising pitches to avoid cache way collisions), but offers little benefit to the user unless they happen to have input in exactly that form somehow (say because they read a raw file with identical layout).  (And it's irrelevant for any component, since they have to deal with the general case anyway.)

> I have tried to find something similar for ffmpeg, but failed (except
> the ffmpeg source code of course).

The pixdesc descriptors (libavutil/pixdesc.[ch]) contain all the relevant information about pixel formats in a programmatic form, so some information about that could probably be auto-generated.  (Relatedly, that gets used to test which formats are representable on an OpenCL device with a given set of capabilities (e.g. CL_R and CL_RG in CL_UNORM_INT8 -> NV12 support).  Unfortunately very few APIs present this information in a machine-readable form, so generally the formats are hard-coded from fourccs or something like that.)

Thanks,

- Mark


More information about the ffmpeg-devel mailing list