[FFmpeg-devel] RISC-V vector DSP functions: Motivation for commit 446b009

Tue Jan 23 20:02:02 EET 2024

Le tiistaina 23. tammikuuta 2024, 19.34.46 EET Michael Platzer via ffmpeg-devel 
a écrit :
> I agree that the indexed and strided loads and stores are certainly slower
> than unit-strided loads and stores. However, the vrgather instruction is
> unlikely to be very performant either, unless the vector length is
> relatively short.

> Particularly, if vector register groups are used via a
> length multiplier LMUL of, e.g., 8, then any element in the destination
> vector register could be sourced from any element in the 8 source vector
> registers (i.e., 1/4 of the vector register file).

Gather instruction seem to scale quadratically on existing hardware, which is 
bad. That's why the FFmpeg code was later modified to use LMUL=1 in that 
particular case.

Now if you want to argue that VLSE is better, then please provide a patch 
exhibiting better performance on FFmpeg's checkasm on real hardware.
Otherwise, this discussion is not much more than he-said-she-said.

> By contrast, the performance of strided loads and stores, while certainly
> slower than unit-strided loads and stores, likely scales linearly with the
> vector length, so on CPUs with large VLEN the original code could very well
> run faster than the variant with vrgather, despite the slower strided loads
> and stores.

Yes, but it's a stretch to expect that accessing memory will be faster than 
accessing registers, especially when the dataset is typically too large to fit 
in L1. Furthermore strided loads require adders to compute the accessed 
address - something VRGATHER (or even VLUEXI) does not need.

Some people wish that processor cores would make a special optimised case of 
minus EEW/8 strides. And sure, that would be nice. But so far that's just 
wishful thinking.

> > > The RISC-V vector loads and stores support negative stride values for 
> > > use cases such as this one.
> > 
> > [Citation required]
>
> The purpose of strided loads and stores is to load/store elements that are
> not consecutive in memory, but instead separated by a constant offset.
> Additionally, the authors of the specification decided to allow negative
> stride values, since they apparently deemed it useful to be able to reverse
> the order of those elements.

FFmpeg *still* uses strided loads and stores where applicable, typically where 
the stride is legitimately variable. I cannot find a justification that small 
constant non-unit strides would be a good idea anywhere though.

Just because you can use negative offsets does not mean that this will be 
optimised for negative-unit offsets. Again, I have only seen some wishful 
thinking from some developers here and there. I have yet to see a serious 
quote from a IP vendor or a benchmark that would support this.

> > > Using vrgather instead replaces the more specific operation with a 
> > > more generic one,
> > 
> > 
> > That is a very subjective and unsubstantiated assertion. This feels a bit
> > hypocritical while you are attacking me for not providing justification.
> 
> vrgather is more generic because it can be used for any kind of permutation,
> which strided loads and stores cannot. This is not subjective.

That would be a fair comparison of vrgather with hypothetical vreverse or 
vtranspose instructions. But you're comparing apples and oranges here.

> > As far as I can tell, neither instruction are specific to reversing vector
> > element order. An actual real-life specific instruction exists on Arm in
> > the form of vector-reverse. I don't know any ISA with load-reverse or
> > store- reverse.
> 
> A load-reverse or store-reverse would just be a special case of strided
> load/store.

By that logic, a unit-stride load is just a special case of a strided load, 
and a strided load is just a special case of an indexed load. From an 
architectural functional standpoint, that is indeed definitely true. From a 
hardware silicon design and microbenchmark standpoint, that is however 
certainly false.

> When writing about the performance of vrgather I primarily had the
> scalability issues explained above in mind. It seems that you have already
> experienced these, since you found that a larger LMUL reduces the
> performance of vrgather.

> How would the reverse subtraction be optimized away? I assume that it needs
> to be part of the loop since it depends on the VL of the current iteration.

VRSUB computes the same vector at all but the last two iterations. All you 
need to do is make a special case for the tail iterations. Then VRSUB can be 
ran just twice for the whole function, zero times per loop iteration.

-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/