[FFmpeg-devel] [Query] Issues with FFMPEG 7.1 and HW plugins
Kamboj, Nitin
Nitin.Kamboj at amd.com
Sun May 11 16:38:31 EEST 2025
[AMD Official Use Only - AMD Internal Distribution Only]
Hi,
The provided text describes a technical issue encountered while implementing a zero-copy
hardware-accelerated transcoding solution using FFmpeg. Then we discuss workaround/solution
aiming to leverage hardware acceleration to decode, scale, overlay, and encode video streams.
The problem arises specifically in the transition to FFmpeg version 7.1, where the pipeline stalls due
to synchronization issues between multiple video streams fed into a filter graph. The text elaborately
documents the underlying problem, intermediate steps attempted for a fix, and further questions
needed to refine the solution approach.
Pipeline Workflow:
-Two video sources are decoded into separate 360p and 4K streams
-The 360p stream is scaled up using a hardware-accelerated scaler.
-Both streams are then combined using a hardware-accelerated overlay.
-The final overlay output is encoded into a final stream.
Usecase:-
ffmpeg -hwaccel ... -i 360p60.264 -i 4kp60.264 -filter_complex \
"[IN1]scale_hw[TMP1];[IN2][TMP2]overlay_hw[OUT]" -map "[OUT]" ...
FILTERGRAPH
+----------------------------+
+-------------+ | +--------+ 480p |
|Decoder(360p)+----+ +-->+--->| Scaler +------+ |
+-------------+ | | | +--------+ | |
| +-------+ | | | |
+--->|Queue +---+ | | |
| +-------+ | | v |
+-------------+ | | | +---------+ | +---------+
|Decoder(4k) +----+ +-->+-------------->| Overlay +--+---->| Encoder |
+-------------+ | +---------+ | +---------+
+----------------------------+
Issue:-
The above HW-accelerated use case works fine on ffmpeg 6.x but stalls
on ffmpeg n7.1 (after migration)
With all SW plugin this use-case works fine on n7.1 as well, but using
HW-accelerated plugins (for decode, scale, overlay and encode) causes
pipeline stalls.
Limitations:-
Limited output frame pools (allocated at init) for all the
HW-accelerated plugins. The HW memory is limited so these must
be kept as small as possible. A plugin will wait for someone
downstream to consume/free the frame it has sent out if it runs
out of frames in it's frame pool.
Cause:-
As per our detailed analysis and current understanding of
ffmpeg7.1 application changes (multithreading support),
the issue is because both the decoded outputs are fed to the
filter graph using a common thread queue.
Since the 2 decoders are independent and on separate threads this
causes the decoder(360p) to generate more frames before the other
decoder(4K) can generate a single frame. Now the overlay plugin needs
at least one frame at both the inputs to proceed, thus many HW frames
get buffered on one of the input(scale->overlay) of overlay filter,
causing us to run out of free HW frames, hence the pipeline is stalled.
Detailed Explanation:-
1. Frame Pools Involved.
a. Decoder(360p) out pool (frames consumed/freed by scaler)
b. Decoder(4k) out pool (frames consumed/freed by overlay,
when both inputs are available and an output is ready)
c. Scaler out pool (frames consumed/freed by overlay, when
both inputs are available and an output is ready)
d. Overlay out pool (frames consumed by encoder)
2. Execution model of scaler_hw plugin.
The plugin runs completely on the filtergraph's thread. Assume the
out pool has a size N. On receiving an input frame, the plugin first
tries to allocate a frame from out pool.
If frame is not available the filtergraph's thead will block, which
means the overlay filter will never get a chance to run and consume
a frame and we have a deadlock. Thus we must ensure scaler
never processes N more frames than overlay.
3. Filtergraph Execution model.
All the filters in a filtergraph run on the same thread.
Every filterlink has an infinitely expanding queue to
buffer inputs. For multi input filters activate
is only called when all inputs are available.
4. FFmpeg application scheduler (ffmpeg_sched.c)
Filtergraphs with multiple inputs have a single queue
into which all inputs by various decoders are written.
Even though filtergraph has a concept of best input, only the top
entry from queue is fed to the filtergraph, thus making the
best input a mere suggestion and not a binding request. This means
a faster demuxer/decoder combo can flood one input of the
filtergraph causing the pipeline stalls as we run out of HW
frames.
Workarounds tried:-
Modify ffmpeg_sched.c to use multiple queues (1 for every filtergraph
input) and always feed the best requested input to filtergraph.
(Diff versus n7.1 attached)
Side-Effect
This works for most of the cases but fails when we have a single file with
multiple streams fed to separate decoders and then a filter inside the
filtergraph tries to combine those two inputs.
If the streams inside the file are not properly interleaved such a
use-case will cause a circular deadlock with this workaround.
Specifically the fate-test filter-overlay-dvdsub-2397 hangs.
Questions:-
1. There is a schedule concept in ffmpeg_sched.c but it not very strict
and it works by looking at timestamps of the muxer not demuxer or decoder.
Obviously, this is needed to utilize multiple threads fully. But is it
possible to modify the scheduler to be more strict?
For example add a constraint that one decoder will never run ahead
of any other decoder by more that K frames.
2. If we modify the filter to return TRYAGAIN instead of wait when output
buffer pool is empty and then flush out all available outputs at the
next activation, will the filtergraph mechanism be able to handle such
a filter?
What will the activate function of such a filter look like?
Regards,
Nitin Kamboj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Make-multiple-queue-changes.patch
Type: application/octet-stream
Size: 4008 bytes
Desc: 0001-Make-multiple-queue-changes.patch
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20250511/1f4bd4a7/attachment.obj>
More information about the ffmpeg-devel
mailing list