[FFmpeg-devel] [RFC] New swscale internal design prototype

Sun Mar 9 00:53:42 EET 2025

Hi all,

for the past two months, I have been working on a prototype for a radical
redesign of the swscale internals, specifically the format handling layer.
This includes, or will eventually expand to include, all format input/output
and unscaled special conversion steps.

I am not yet at a point where the new code can replace the scaling kernels,
but for the time being, we could start usaing it for the simple unscaled cases,
in theory, right away.

Rather than repeating my entire design document here, I opted to collect my
notes into a design document on my WIP branch:

https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt

I have spent the past week or so ironing out the last kinks and extensively
benchmarking the new design at least on x86, and it is generally a roughly 1.9x
improvement over the existing unscaled special converters across the board,
before even adding any hand written ASM. (This speedup is *just* using the
less-than-optimal compiler output from my reference C code!)

In some cases we even measure ~3-4x or even ~6x speedups, especially those
where swscale does not currently have hand written SIMD. Overall:

cpu: 16-core AMD Ryzen Threadripper 1950X
gcc 14.2.1:
   single thread:
     Overall speedup=1.887x faster, min=0.250x max=22.578x
   multi thread:
     Overall speedup=1.657x faster, min=0.190x max=87.972x

(The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support
 for efficient decoding, but I wanted to focus on the core operations first
 before worrying about adding LUT-based optimizations to the design)

I am (almost) ready to begin moving forwards with this design, merging it into
swscale and using it at least for unscaled format conversions, XYZ decoding,
colorspace transformations (subsuming the existing, horribly unoptimized,
3DLUT layer), gamma transformations, and so on.

I wanted to post it here to gather some feedback on the approach. Where does
it fall on the "madness" scale? Is the new operations and optimizer design
comprehensible? Am I trying too hard to reinvent compilers? Are there any
platforms where the high number of function calls per frame would be
probitively expensive? What are the thoughts on the float-first approach? See
also the list of limitations and improvement ideas at the bottom of my design
document.

Thanks for your time,
Niklas