[FFmpeg-devel] [PATCH 6/6] ffv1enc_vulkan: switch to receive_packet

Sun Nov 24 17:51:55 EET 2024

Le 24/11/2024 à 04:41, Lynne via ffmpeg-devel a écrit :
> On 11/23/24 23:10, Jerome Martinez wrote:
>
>> Le 23/11/2024 à 20:58, Lynne via ffmpeg-devel a écrit :
>>> This allows the encoder to fully saturate all queues the GPU
>>> has, giving a good 10% in certain cases and resolutions.
>>
>>
>> Using a RTX 4070:
>> +50% (!!!) with 2K 10-bit content.
>> +17% with 4K 16-bit content.
>> Also the speed with 2K content is now 4x the speed of 4K content 
>> which is similar to the SW encoder (with similar count of slices) and 
>> which is the expected result, it seems that a bottleneck with smaller 
>> resolutions is removed.
>>
>>
>> Unfortunatly, it has a drawback, a 6K5K content which was well 
>> handled without this patch is now having an immediate error:
>> [vost#0:0/ffv1_vulkan @ 0x10467840] [enc:ffv1_vulkan @ 0x12c011c0] 
>> Error submitting video frame to the encoder
>> [vost#0:0/ffv1_vulkan @ 0x10467840] [enc:ffv1_vulkan @ 0x12c011c0] 
>> Error encoding a frame: Cannot allocate memory
>> [vost#0:0/ffv1_vulkan @ 0x10467840] Task finished with error code: 
>> -12 (Cannot allocate memory)
>> [vost#0:0/ffv1_vulkan @ 0x10467840] Terminating thread with return 
>> code -12 (Cannot allocate memory)
>>
>> Which is a problem, the handling of 6K5K being good on the RTX 4070 
>> (3x faster than a CPU at the same price) before this patch.
>> Is it possible to keep the handling of bigger resolutions on such 
>> card while keeping the performance boost of this patch?
>
>
> To an extent. At high resolutions, -async_depth 0 (maximum) harms 
> performance for higher
> resolution. I get the best results with it set to 2 or 3 for 6k 
> content, on my odd setup.
> Increasing async_depth increases the amount of VRAM used, so that's 
> the tradeoff.
> Automatically detecting it is difficult, as Vulkan doesn't give you 
> metrics on how much free
> VRAM there is, so there's nothing we can do

I am torn between a default having as much performance as possible and a 
default working for sure (a default value of 1 is OK for the 6K5K 
content on the RTX 4070, not 2).
Surprisingly, default async_depth works on 4K (51 MiB) but async_depth 2 
does not work on 6K5K (183 MiB), but I don't know what is the value of 
nb_queues.
Maybe real use case is a user managing 6K5K with the biggest GPU 
available so it does not hurt much to have a default crashing with such 
big content.

The encoder catches the allocation error and sends a nice message, 
wouldn't it possible to reduce automatically async_depth and retry 
instead of sending immediately the error, in the case async_depth is not 
provided, and error only if -async_depth 1 does not work?

> than to document it and hope users follow the instructions in case 
> they run out of memory.

If not possible to try automatically smaller values, is it possible to 
add "use -async_depth with a value smaller than (here the current 
value)" to the error message?

> The good news is that -async_depth 1 uses less VRAM than before this 
> patch.
> Must of the VRAM used is from somewhere within Nvidia's black-box 
> driver, as RADV
> uses 1/3rd of the VRAM at the same content and async_depth settings. 
> Nothing we
> can do about this too.
>
>
>>> This also improves error resilience if an allocation fails,
>>> and properly cleans up after itself if it does.
>>
>> Looks like that this part does not work, still a freeze if an 
>> allocation fails.
>
>
> This is due to Nvidia's drivers. If you switch to using their GSP
> firmware, recovery is instant, pretty much.

Beyond my knowledge, and it does not make things worse so not blocking.