[FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds fast gather detection.

Lynne dev at lynne.ee
Mon Jul 12 16:39:56 EEST 2021


12 Jul 2021, 13:53 by jamrial at gmail.com:

> On 7/12/2021 7:46 AM, Lynne wrote:
>
>> 12 Jul 2021, 11:29 by alankelly-at-google.com at ffmpeg.org:
>>
>>> On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly at google.com> wrote:
>>>
>>>> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev at lynne.ee> wrote:
>>>>
>>>>> Jun 25, 2021, 09:54 by alankelly-at-google.com at ffmpeg.org:
>>>>>
>>>>>> Broadwell and later and Zen3 and later have fast gather instructions.
>>>>>> ---
>>>>>>  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>>>>>>
>>>>> Broadwell,
>>>>>
>>>>>> and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>>>>>>  libavutil/cpu.h     |  2 ++
>>>>>>  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>>>>>>  libavutil/x86/cpu.h |  1 +
>>>>>>  3 files changed, 19 insertions(+), 2 deletions(-)
>>>>>>
>>>>>
>>>>> No, we really don't need more FAST/SLOW flags, especially for
>>>>> something like this which is just fixable by _not_using_vgather_.
>>>>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>>>>> if it's guaranteed to either be faster for what we're gathering or
>>>>> is just as fast "slow". If neither is true, we use manual lookups,
>>>>> which is actually advantageous since for AVX2 we can interleave
>>>>> the lookups that happen in each lane.
>>>>>
>>>>> Even if we disregard this, I've extensively benchmarked vgather
>>>>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>>>>> a great vgather improvement to be found in Zen 3 to justify
>>>>> using a new CPU flag for this.
>>>>> _______________________________________________
>>>>> ffmpeg-devel mailing list
>>>>> ffmpeg-devel at ffmpeg.org
>>>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>>>
>>>>> To unsubscribe, visit link above, or email
>>>>> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
>>>>>
>>>>
>>>> Thanks for your response. I'm not against finding a cleaner way of
>>>> enabling/disabling the code which will be protected by this flag. However,
>>>> the manual lookups solution proposed will not work in this case, the avx2
>>>> version of hscale will only be faster if fast gathers are available,
>>>> otherwise, the ssse3 version should be used.
>>>>
>>>> I haven't got access to a Zen3 so I can't comment on the performance. I
>>>> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
>>>> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
>>>> has similar performance to Zen2.
>>>>
>>>> Is there a proxy which could be used for detecting Broadwell or Skylake
>>>> and later? AVX512 seems too strict as there are Skylake chips without
>>>> AVX512. Thanks
>>>>
>>>
>>> Hi,
>>>
>>> I will paste the performance figures from the thread for the other part of
>>> this patch here so that the justification for this flag is clearer:
>>>
>>> Skylake Haswell
>>> hscale_8_to_15_width4_ssse3 761.2 760
>>> hscale_8_to_15_width4_avx2 468.7 957
>>> hscale_8_to_15_width8_ssse3 1170.7 1032
>>> hscale_8_to_15_width8_avx2 865.7 1979
>>> hscale_8_to_15_width12_ssse3 2172.2 2472
>>> hscale_8_to_15_width12_avx2 1245.7 2901
>>> hscale_8_to_15_width16_ssse3 2244.2 2400
>>> hscale_8_to_15_width16_avx2 1647.2 3681
>>>
>>> As you can see, it is catastrophic on Haswell and older chips but the gains
>>> on Skylake are impressive.
>>> As I don't have performance figures for Zen 3, I can disable this feature
>>> on all cpus apart from Broadwell and later as you say that there is no
>>> worthwhile improvement on Zen3. Is this OK with you?
>>>
>>
>> It's not that catastrophic. Since Haswell CPUs generally don't have
>> large AVX2 gains, could you just exclude Haswell only from
>> EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST
>> to enable those functions?
>>
>
> And disable all non gather AVX2 asm functions on Haswell? No. And it's a lie that Haswell doesn't have large gains with AVX2.
>

It won't disable ALL of the AVX2, but it'll affect a few random components, the most
prominent of which is some (not all) hevc assembly.
But I think I'd rather just not do anything at all. Performance of vgather even on Haswell
is still above 2x the C version, and we barely have any vgathers in our code. And
Haswell use is in decline too.


More information about the ffmpeg-devel mailing list