[FFmpeg-devel] [PATCH v2] x86/tx_float: implement inverse MDCT AVX2 assembly

Fri Sep 2 08:49:31 EEST 2022

Version 2 notes: halved the amount of loads and loops for the
pre-transform loop by exploiting the symmetry.

This commit implements an iMDCT in pure assembly.

This is capable of processing any mod-8 transforms, rather than just
power of two, but since power of two is all we have assembly for
currently, that's what's supported.
It would really benefit if we could somehow use the C code to decide
which function to jump into, but exposing function labels from assebly
into C is anything but easy.
The post-transform loop could probably be improved.

This was somewhat annoying to write, as we must support arbitrary
strides during runtime. There's a fast branch for stride == 4 bytes
and a slower one which uses vgatherdps.

Zen 3 benchmarks for stride == 4 for old (av_imdct_half) vs new (av_tx):

128pt:
   2815 decicycles in         av_tx (imdct),16776766 runs,    450 skips
   3097 decicycles in         av_imdct_half,16776745 runs,    471 skips

256pt:
   4931 decicycles in         av_tx (imdct), 4193127 runs,   1177 skips
   5401 decicycles in         av_imdct_half, 2097058 runs,     94 skips

512pt:
   9764 decicycles in         av_tx (imdct), 4193929 runs,    375 skips
  10690 decicycles in         av_imdct_half, 2096948 runs,    204 skips

1024pt:
  20113 decicycles in         av_tx (imdct), 4194202 runs,    102 skips
  21258 decicycles in         av_imdct_half, 2097147 runs,      5 skips

Patch attached.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: v2-0001-x86-tx_float-implement-inverse-MDCT-AVX2-assembly.patch
Type: text/x-diff
Size: 12881 bytes
Desc: not available
URL: <https://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20220902/034fa486/attachment.patch>