[FFmpeg-devel] [PATCH] lavfi: VAAPI video processing filter

Wed Sep 14 21:37:43 EEST 2016

On 14/09/16 02:30, Jun Zhao wrote:
> On 2016/9/14 6:06, Mark Thompson wrote:
>> How about something like this, then?  It adds a new filter to do the video processing, while leaving the scale filter as-is.
> 
> Can we merge the vpp scale/the other vpp filters in one AVFilter, e,g vf_postprocess_vaapi.c. 
> If we split the scale/the other vpp filters, I guess maybe have some performance issue,
> when merge the scale/the other vpp filters in one AVFilter, only once surface copy ,
>  
>     1 input surface-> 1 output surface // once copy for scale/de-noise/sharpness/...
> 
> but if split them, it's will lead to twice surface copy in some case. 
> 
>     1 input surface -> 1 output surface -> 2 output surface // 1st copy for scale, 2nd
>                                                             // copy for the other vaapi filters

Can you share what driver/platform you are testing on and what commands you are using to get the result that the combined filter is faster?

For example, I get (1080p H.264 input, current i965 on Skylake):

[With the patch to vf_scale_vaapi]

./ffmpeg_g -y -vaapi_device /dev/dri/renderD128 -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -an -vf 'format=vaapi|nv12,hwupload,scale_vaapi=denoise=50:w=1280:h=720' -c:v h264_vaapi -qp 20 out.mp4

-> 225fps.

[With the patch adding vf_process_vaapi]

./ffmpeg_g -y -vaapi_device /dev/dri/renderD128 -hwaccel vaapi -hwaccel_output_format vaapi -i in.mp4 -an -vf 'format=vaapi|nv12,hwupload,process_vaapi=denoise=50,scale_vaapi=w=1280:h=720' -c:v h264_vaapi -qp 20 out.mp4

-> 255fps.

I'm not sure why the separate filters are actually faster here, but I was certainly expecting them to be about the same - since we haven't introduced any additional synchronisation points in either sequence, it should all be fully pipelined in the batch buffer rings from the decoder to the encoder output.  I believe the argument about surfaces is specious because the combined case needs the same intermediates and therefore internally allocates temporary surfaces for them.

Thanks,

- Mark