[FFmpeg-devel] [PATCH] Fix mm_flags, mm_support for ARM

Tue Jul 1 12:25:01 CEST 2008

On Tue, Jul 1, 2008 at 11:37 AM, Laurent Desnogues
<laurent.desnogues at gmail.com> wrote:
> On Tue, Jul 1, 2008 at 10:00 AM, Siarhei Siamashka
> <siarhei.siamashka at gmail.com> wrote:
>> Loading from memory (actually from L1 cache) is faster than using an immediate
>> operand. Just because you can easily initialize 2 registers per cycle instead
>> of just one (on ARM11). The only drawback is the higher latency. With
>> immediate operand, you can use this constant right away, but when loading it
>> from memory, you have to wait a bit.
>>
>> You have a number of constants to be loaded from memory anyway, putting one
>> more constant into the same cache line will have absolutely no impact on
>> performance.
>
> What you say is not always true:  when you have data close
> to instructions, you pollute your Icache with data, and your
> Dcache with instructions;  on top of that you make sure you
> need one Itlb *plus* one Dtlb entry.

In order to reduce instruction/data cache pollution, data and code can
be aligned at cache line boundaries, hence the use of .balign
directives.

Do you know any way of generating code for ARM which would not
intermix instructions with data? You should keep in mind that all the
ARM instructions (I'm not considering thumb here) have fixed size
which is 32-bit. You can't fit any arbitrary constant immediate
operand in it. Moreover, you can't encode some absolute address into
instruction and get it fixed by applying relocations. So absolute
addresses are always stored intermixed with code and accessed using
pc-relative addressing. Please try to compile something like the
following fragment to see what is generated (pay attention to how
external variables are accessed so that this code can be linked with
other object files):

extern int x;
extern int y;
extern int z;

void set_global_variables()
{
    x = 0x12345678;
    y = 0x1234;
    y = 0x12;
}

> I think both approaches have to be benchmarked in real
> life situation, and on several processors.

Please do it. Any improvements are very much welcome. Based on your
previous posts, I assume that you have ARM hardware to run these
tests.

> Also when loading from memory, if your data side is blocking
> then you are basically stalling your pipeline while the data is
> loaded.

When all the data fits into a single cache line, adding one more
constant so that this data set still fits cache line, will not
introduce extra cache misses. It there anything wrong in this
statement (except for my English grammar)? Cache line is 32 bytes on
ARM9/ARM11 and 64 bytes on Cortex-A8