[FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

Fri Jan 19 19:14:02 EET 2024

Hi,

Le perjantaina 19. tammikuuta 2024, 17.30.00 EET Michael Platzer via ffmpeg-
devel a écrit :
> Commit 446b0090cbb66ee614dcf6ca79c78dc8eb7f0e37 by Remi Denis-Courmont has
> replaced RISC-V vector loads and stores with negative stride with vrgather
> (generalized permutation within vector registers) instructions in order to
> reverse the elements in a vector register. The commit message explains that
> this change was done, but it does not explain why.

It was faster on what the best approximation of real hardware available at the 
time, i.e. a Sipeed Lichee Pi4A board. There are no benchmarks in the commit 
because I don't like to publish benchmarks collected from prototypes. 
Nevertheless I think the commit message hints enough that anybody could easily 
guess that it was a performance optimisation, if I'm being honest.

This is not exactly surprising: typical hardware can only access so many 
memory addresses simultaneously (i.e. one or maybe two), so indexed loads and 
strided loads are bound to be much slower than unit-strided loads.

Maybe you have access to special hardware that is able to optimise the special 
case of strides equal to minus one to reduce the number of memory accesses. 
But I didn't back then, and as a matter of fact, I still don't. Hardware 
donations are welcome.

> I fail to see what could possibly have motivated this change.

> The RISC-V vector loads and stores support negative stride values for use
> cases such as this one.

[Citation required]

> Using vrgather instead replaces the more specific operation with a more
> generic one,

That is a very subjective and unsubstantiated assertion. This feels a bit 
hypocritical while you are attacking me for not providing justification.

As far as I can tell, neither instruction are specific to reversing vector 
element order. An actual real-life specific instruction exists on Arm in the 
form of vector-reverse. I don't know any ISA with load-reverse or store-
reverse.

> which is likely to be less performant on most HW architectures.

Would you care to define "most architectures"? I only know one commercially 
available hardware architecture as of today, Kendryte K230 SoC with T-Head 
C908 CPU, so I can't make much sense of your sentence here.

> In addition, it requires to setup an index vector,

That is irrelevant since in this loop, the vector bank is not a bottleneck. 
The loop can run with maximul LMUL either way. And besides, the loop turned 
out to be faster with a smaller multiplier.

> thus raising dynamic instruction count.

It adds only one instruction (reverse subtraction) in the main loop, and even 
that could be optimised away if relevant.

-- 
レミ・デニ-クールモン
http://www.remlab.net/