[FFmpeg-devel] [PATCH] Demuxer for Leitch/Harris' VR native stream format (LXF)
Måns Rullgård
mans
Mon Sep 13 22:51:22 CEST 2010
Michael Niedermayer <michaelni at gmx.at> writes:
> On Mon, Sep 13, 2010 at 08:47:37PM +0100, M?ns Rullg?rd wrote:
>> Michael Niedermayer <michaelni at gmx.at> writes:
>>
>> > On Mon, Sep 13, 2010 at 04:56:40PM +0100, M?ns Rullg?rd wrote:
>> >> Daniel Verkamp <daniel at drv.nu> writes:
>> >>
>> >> > 2010/9/13 M?ns Rullg?rd <mans at mansr.com>:
>> >> >> Tomas H?rdin <tomas.hardin at codemill.se> writes:
>> >> >>
>> >> >>>> > > > +//returns number of bits set in value
>> >> >>>> > > > +static int num_set_bits(uint32_t value) {
>> >> >>>> > > > + ? ?int ret;
>> >> >>>> > > > +
>> >> >>>> > > > + ? ?for(ret = 0; value; ret += (value & 1), value >>= 1);
>> >> >>>> > > > +
>> >> >>>> > > > + ? ?return ret;
>> >> >>>> > > > +}
>> >> >>>> > >
>> >> >>>> > > if we dont have a population count function yet, than one should be added
>> >> >>>> > > to some header in libavutil
>> >> >>>> >
>> >> >>>> > I couldn't find one. That probably belongs in its own thread though.
>> >> >>>> >
>> >> >>>> > Which files would such a function belong in - intmath.h/c, common.h or
>> >> >>>> > somewhere else? Also, which name would be best: ff_count_bits(),
>> >> >>>> > av_count_bits() or something else?
>> >> >>>>
>> >> >>>> av_popcount()
>> >> >>>> would be similar to gccs __builtin_popcount()
>> >> >>>
>> >> >>> OK. I attached popcount.patch which adds such a function to common.h.
>> >> >>> Also bumped minor of lavu. The implementation uses a 16-byte LUT and
>> >> >>> therefore counts four bits at a time. I suspect there are better
>> >> >>> solutions though. I did verify that it returns exactly the same number
>> >> >>> the other implementation does for all 2^32 possible input values.
>> >> >>
>> >> >> I can't think of a better generic solution off the top of my head.
>> >> >
>> >> > There is at least one algorithm to do this without loops or lookup
>> >> > tables using SWAR tricks, but I haven't benchmarked it:
>> >> > http://aggregate.org/MAGIC/#Population Count (Ones Count)
>> >>
>> >> That method will be several times slower on any modern hardware.
>> >
>> > hardly
>> > the patch needs 32 operations (i assume its unrolled) 8 of which are
>> > table lookups which might be less than fast on some hw
>> > the aggregate.org code needs 15 operations the suggested modification for
>> > athlons (aka modernb cpus with fast multipler) would reduce that to 12
>>
>> Did you count the operations needed to construct the constants? I
>> think not.
>
> i used simple mathematical counting of operations as you surely see
> and as you know constants dont count there. a counting of cpu cycles
> of your specific embeded cpu on which constant construction is
> expensice i could not do, iam sorry.
Naive operation counting is useless. Even division counts a only 1
there, while everybody knows it takes many times longer than addition
on real hardware. Even on modern hardware Multiplication also takes
significantly longer than addition.
> but if you doubt that the aggregate algorithm is faster in modern desktop
> hardware i can benchmark that, i cannot benchmark it on arm though as i have
> none.
Whatever gcc produced (it's quite OK) for ARM takes about 15 cycles
(including setup) in both cases. Using a bigger table is obviously
faster (I measured). OTOH, avoiding the table might be worth
something too.
--
M?ns Rullg?rd
mans at mansr.com
More information about the ffmpeg-devel
mailing list