[FFmpeg-devel] Some IWMMXT functions for libavcodec

Siarhei Siamashka siarhei.siamashka
Sat May 17 11:41:40 CEST 2008


On Thursday 15 May 2008, Dmitry Antipov wrote:
> Siarhei Siamashka wrote:
> > And Intel manual recommends to insert non-memory related instructions
> > after 'wldrd', you could quite conveniently insert pointer increment by
> > stride there, or anything else. Back-to back 'wldrd' instructions
> > introduce pipeline stall.
>
> This probably requires an additional investigations from my side. But
> "Intel Wireless MMX Technology Developer Guide" says:
>
> "...Currently, there are two 64-bit buffer slots for Load operations and
> one 64-bit buffer slot available for Store transactions. If the memory
> buffer is currently empty, the Memory pipeline resource availability delay
> is only one clock. However, if the buffer is currently full due to a
> sequence of memory transactions, the next instruction must wait for space
> in the buffer. The resource availability delay in this case is two
> cycles...
> ...The buffering in the Memory pipeline allows two Load transactions to be
> issued sequentially without incurring a penalty (stall). More than two
> outstanding Load transactions causes a stall and loss in performance."
>
> I.e. this code is good:
>
> wldrd wr0, [%1]
> wldrd wr1, [%2]
>
> But this code:
>
> wldrd wr0, [%1]
> wldrd wr1, [%2]
> wldrd wr2, [%3]
>
> is not so good since it will work, but will cause memory pipeline stall. As
> you can see, I'm not doing more than 2 sequential loads with WLDRx.

Well, I was reading this document:
http://www.intel.com/design/intelxscale/314510.htm

And it states in section "D.3.2.3. Memory Control Pipeline" that:
"There is also an additional stall introduced by the core when 2 double 
word (64 bits) are issued back to back such as:
WLDRD or WSTRD
WLDR[B,H,W,D] or WSTR[B,H,W,D] <- 1 cycle stall.
Critical inner loop sequences can use non memory related instructions 
following a WLDRD or WSTRD."

Does Intel contradict itself? Or there is some variation between different
revisions of XScale cores and they have different optimization rules? Can you
provide a direct link to the document you are using?


One more interesting issue with WLDRD instruction is that it should support
register offset addressing mode according to the manual. So you should
have been able to use:
    wldrd wr2, [%1, #8]
    wldrd wr1, [%1], %2
instead of
    wldrd wr1, [%1]
    wldrd wr2, [%1, #8]
    add %1, %1, %2

But the toolchain I'm using (also tried gcc 4.3 and binutils 2.18) seems 
to silently ignore register offset and generates wrong instruction here
(without register postincrement). Either I'm misunderstanding something, 
or it is a bug in binutils. Could you please try to investigate it further
and submit a bugreport to binutils if needed?


-- 
Best regards,
Siarhei Siamashka




More information about the ffmpeg-devel mailing list