[FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented uyvytoyuv422
Krzysztof Pyrkosz
ffmpeg at szaka.eu
Tue Feb 11 23:24:50 EET 2025
On Mon, Feb 10, 2025 at 03:15:35PM +0200, Martin Storsjö wrote:
> > Just as I'm about to send this patch, I'm thinking if non-interleaved
> > read followed by 4 invocations of TBL wouldn't be more performant. One
> > call to generate a contiguous vector of u, second for v and two for y.
> > I'm curious to find out.
>
> My guess is that it may be more performant on more modern cores, but
> probably not on older ones.
That's the case. It's 15% faster on A78 and twice as slow on A72.
>
> > + sxtw x7, w7
> > + ldrsw x8, [sp]
> > + ubfx x10, x4, #1, #31
>
> The ubfx instruction is kinda esoteric; I presume what you're doing here is
> essentially the same as "lsr #1"? That'd be much more idiomatic and
> readable.
That's correct. What put me off was that register 4 is passed as int
(w4) and I expected register 10 to be 64 bits long with high bits set to
0. lsr w10, w4, #1 already does that.
I modified the code to handle {uyvy,yuyv}toyuv{420,422} using macros,
since these 4 functions share common routines. The code lost on the
readability, though.
Krzysztof
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-swscale-aarch64-rgb2rgb_neon-Implemented-uyvytoyuv42.patch
Type: text/x-diff
Size: 15407 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20250211/a674582f/attachment.patch>
More information about the ffmpeg-devel
mailing list