[FFmpeg-devel] [PATCH 0/5] RISC-V: Improve H264 decoding performance using RVV intrinsic

Wed May 10 15:14:32 EEST 2023

Hi,

Le 10 mai 2023 11:46:57 GMT+03:00, Arnie Chang <arnie.chang at sifive.com> a écrit :
>Considering the benefits of the open ISA like RISC-V,
>the intrinsic code should still have a better chance of being optimized by
>the compiler for hardware variants.

You probably have access to proprietary performance information of SiFive which nobody else here can argue about, so maybe you are onto something here.

However, FFmpeg needs to support any RV64GC CPU with a single build, because that's how many Linux distributions and applications will build it. So we can't really rely on the compiler's per-CPU model tuning for scheduling. In any case, my guess is that there won't be that much room for the compiler to reorder vector code, even if it's using intrinsics.

To the contrary, I fear that we need to tune the group multiplier (LMUL) at runtime to get good performance on different processor designs. Essentially unrolling. And if that turns out to be true, then we *cannot* use intrinsics since they don't support varying the group multiplier at runtime unlike outline assembler.

So I could be completely wrong but if so, we'd need more substantial explanation and justification why.

>At this moment, the intrinsic implementation is the only thing available.
>It would take a significant amount of time to rewrite it in assembly due to
>the large amount of functions.

Is it really that much work? Leaving aside maybe converting the inline functions into assembler macros, it seems mostly like a case of passing the C code through the compiler, then disassembling the result and then reformatting for legibility here and there.

As the proverb goes, "on the Internet, nobody knows you're a monkey". Nobody needs to know that somebody wrote their assembler with the help of intrinsics and a compiler.

>I was wondering if we could treat the intrinsic code as an initial version
>for the RISC-V port with the following modification.
>    - Add an option --enable-rvv-intrinsic to EXPLICITLY enable the
>intrinsic optimization, which is disabled by default.

I will let more senior developers to comment here, but I suspect that this would set a bad example that would eventually induce other people into choosing intrinsics over outline assembler for new code.

Adding a build option could be viable if we wanted to advise against using the code. But here we rather want to advise against using the code as a reference, not against running it.

If this were the kernel, I'd argue merging the code into `staging` but FFmpeg is not so large that it'd have a staging area.

>      Based on the given conditions, vector supports in GCC and intrinsics
>dislike and limits. Disabling it by default seems a reasonable way.
>
>For those who want to be involved in the optimization of H.264 decoder on
>RISC-V can work on the assembly and decide whether to refer to intrinsic
>code.
>I believe this would be a good starting point for future optimization.

Well most likely. The thing is though that nobody in the FFmpeg community (except you) has hardware access in any shape or form at this time, at least that I'd know. That's one of the reasons why my own efforts have stalled.

>
>
>On Wed, May 10, 2023 at 12:51 AM Rémi Denis-Courmont <remi at remlab.net>
>wrote:
>
>>         Hi,
>>
>> Le tiistaina 9. toukokuuta 2023, 12.50.25 EEST Arnie Chang a écrit :
>> > We are submitting a set of patches that significantly improve H.264
>> decoding
>> > performance by utilizing RVV intrinsic code.
>>
>> I believe that there is a general dislike of compiler intrinsic for vector
>> optimisations in FFmpeg for a plurality of reasons. FWIW, that dislike is
>> not
>> limited to FFmpeg:
>> https://www.reddit.com/r/RISCV/comments/131hlgq/comment/ji1ie3l/
>> Indeed, in my personal opinion, RISC-V V intrinsics specifically are
>> painful to
>> read/write compared to assembler.
>>
>> On top of that, in this particular case, intrinsics have at least three,
>> possibly four, additional and more objective challenges as compared to the
>> existing RVV assembler:
>>
>> 1) They are less portable, requiring the most bleeding edge version of
>> compilers. Case in point: our FATE GCC instance does not support them as
>> of
>> today (because Debian Unstable does not).
>>
>> 2) They do not work with run-time CPU detection, at least not currently.
>> This
>> is going to be a major stumbling point for Linux distributions which need
>> to
>> build code that runs on processors without vector unit.
>>
>> 3) V intrinsics require specifying the group multiplier at every
>> instruction.
>> In most cases, this is just very inconvenient. But in those algorithms
>> that
>> require a fixed vector size (e.g. Opus DSP already now), this simply does
>> _not_
>> work.
>>
>> Essentially, this is the downside of relying on the compiler to do the
>> register allocation.
>>
>> 4) (Unsure) Intrinsics are notorious for missing some code points.
>>
>>
>> The first two points may be addressed eventually. But the third point is
>> intrinsic to intrinsics (hohoho). So unless there is a case for why
>> intrinsics
>> would be all but _required_, please avoid them.
>>
>> Now I do realise that that means some of the code won't be XLEN-indepent.
>> Well, we can cross that bridge with macros if/when somebody actually cares
>> about FFmpeg vector optimisations on RV32I.
>>
>> Br,
>>
>> --
>> 雷米‧德尼-库尔蒙
>> http://www.remlab.net/
>>
>>
>>
>>