[FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

Michael Platzer michael.platzer at axelera.ai
Tue Jan 23 19:34:46 EET 2024


Hi Rémi,

Thanks for your reply.

> It was faster on what the best approximation of real hardware available at the time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the commit because I don't like to publish benchmarks collected from prototypes.
> Nevertheless I think the commit message hints enough that anybody could easily guess that it was a performance optimisation, if I'm being honest.
> 
> This is not exactly surprising: typical hardware can only access so many memory addresses simultaneously (i.e. one or maybe two), so indexed loads and strided loads are bound to be much slower than unit-strided loads.

I agree that the indexed and strided loads and stores are certainly slower than unit-strided loads and stores. However, the vrgather instruction is unlikely to be very performant either, unless the vector length is relatively short. Particularly, if vector register groups are used via a length multiplier LMUL of, e.g., 8, then any element in the destination vector register could be sourced from any element in the 8 source vector registers (i.e., 1/4 of the vector register file).

AFAIK (but please correct me if I am wrong) the Sipeed Lichee Pi4A uses a quad-core XT-910, which depending on the exact variant has a vector register length (VLEN) of either 64 or 128 bits, so given the configured element width of 32 bits and length multiplier of 2, we are looking at vectors of 4 or 8 elements.

There is a comment that reads "e16/m2 and e32/m4 are possible but slower due to gather", which does not surprise me, since the performance of vrgather most likely scales quadratically compared to the vector length. Similarly, vrgather is likely less performant on a RISC-V CPU with larger VLEN, since the hardware resources for a crossbar required for a permutation over the full vector register length become prohibitive for VLEN beyond 128 bits. This requires the permutation to be spread over several iterations instead, which need to cover every combination of input and output elements (hence the quadratic growth in execution time).

By contrast, the performance of strided loads and stores, while certainly slower than unit-strided loads and stores, likely scales linearly with the vector length, so on CPUs with large VLEN the original code could very well run faster than the variant with vrgather, despite the slower strided loads and stores.

> Maybe you have access to special hardware that is able to optimise the special case of strides equal to minus one to reduce the number of memory accesses.
> But I didn't back then, and as a matter of fact, I still don't. Hardware donations are welcome.

Hardware availability is indeed still an issue for RISC-V vector processing.

> > The RISC-V vector loads and stores support negative stride values for 
> > use cases such as this one.
> 
> [Citation required]

The purpose of strided loads and stores is to load/store elements that are not consecutive in memory, but instead separated by a constant offset. Additionally, the authors of the specification decided to allow negative stride values, since they apparently deemed it useful to be able to reverse the order of those elements.

> > Using vrgather instead replaces the more specific operation with a 
> > more generic one,
> 
> That is a very subjective and unsubstantiated assertion. This feels a bit hypocritical while you are attacking me for not providing justification.

vrgather is more generic because it can be used for any kind of permutation, which strided loads and stores cannot. This is not subjective.

> As far as I can tell, neither instruction are specific to reversing vector element order. An actual real-life specific instruction exists on Arm in the form of vector-reverse. I don't know any ISA with load-reverse or store- reverse.

A load-reverse or store-reverse would just be a special case of strided load/store.

> > which is likely to be less performant on most HW architectures.
> 
> Would you care to define "most architectures"? I only know one commercially available hardware architecture as of today, Kendryte K230 SoC with T-Head
> C908 CPU, so I can't make much sense of your sentence here.

When writing about the performance of vrgather I primarily had the scalability issues explained above in mind. It seems that you have already experienced these, since you found that a larger LMUL reduces the performance of vrgather.

> > In addition, it requires to setup an index vector,
> 
> That is irrelevant since in this loop, the vector bank is not a bottleneck.
> The loop can run with maximul LMUL either way. And besides, the loop turned out to be faster with a smaller multiplier.

That is because the performance of vrgather does not scale linearly. I would assume that this does not happen with the original code (i.e., the performance of strided loads/stores does not decrease for larger LMUL).

> > thus raising dynamic instruction count.
> 
> It adds only one instruction (reverse subtraction) in the main loop,

If I read the diff correctly, the strided load is replaced by a unit-strided load and a vrgather (one instruction replaced by two). So, there are two additional instructions in the main loop.

> and even that could be optimised away if relevant.

How would the reverse subtraction be optimized away? I assume that it needs to be part of the loop since it depends on the VL of the current iteration.

Michael


More information about the ffmpeg-devel mailing list