[FFmpeg-devel] Nellymoser encoder

Michael Niedermayer michaelni
Sat Aug 30 18:10:41 CEST 2008


On Sat, Aug 30, 2008 at 03:42:37PM +0200, Bartlomiej Wolowiec wrote:
> Friday 29 August 2008 22:36:10 Michael Niedermayer napisa?(a):
> > > > > > > +
> > > > > > > +void apply_mdct(NellyMoserEncodeContext *s, float *in, float
> > > > > > > *coefs) +{
> > > > > > > +    DECLARE_ALIGNED_16(float, in_buff[NELLY_SAMPLES]);
> > > > > > > +
> > > > > > > +    memcpy(&in_buff[0], &in[0], NELLY_SAMPLES * sizeof(float));
> > > > > > > +    s->dsp.vector_fmul(in_buff, ff_sine_128, NELLY_BUF_LEN);
> > > > > > > +    s->dsp.vector_fmul_reverse(in_buff + NELLY_BUF_LEN, in_buff
> > > > > > > + NELLY_BUF_LEN, ff_sine_128, NELLY_BUF_LEN); +
> > > > > > > ff_mdct_calc(&s->mdct_ctx, coefs, in_buff);
> > > > > > > +}
> > > > > >
> > > > > > The data is copied once in encode_frame and twice here
> > > > > > There is no need to copy the data 3 times.
> > > > > > vector_fmul can be used with a singl memcpy to get the data into
> > > > > > any destination, and vector_fmul_reverse doesnt even need 1 memcpy,
> > > > > > so overall a single memcpy is enough
> > > > >
> > > > > Hope that you meant something similar to my solution.
> > > >
> > > > no, you still do 2 memcpy() but now the code is really messy as well.
> > > >
> > > > what you should do is, for each block of samples you get from the user
> > > > 1. apply one half of the window onto it with vector_fmul_reverse and
> > > >    destination of some internal buffer
> > > > 2. memcpy into the 2nd destination and apply the other half of the
> > > >    window onto it with vector_fmul
> > > > 3. run the mdct as appropriate on the internal buffers.
> > >
> > > Hmm, I considered it, but I don't understand exactly what should I
> > > change... In the code I copy data two times:
> > > a) in encode_frame - I convert int16_t to float and copy data to s->buf -
> > > I need to do it somewhere because vector_mul requires float *.
> > > Additionally, part of the data is needed to the next call of encode_frame
> > > b) in apply_mdct - here I think that some additional part of buffer is
> > > needed. If I understood correctly I have to get rid of a), but how to get
> > > access to old data when the next call of encode_frame is performed and
> > > how call vector_fmul on int16_t?
> >
> > have you tried setting AVCodec.sample_fmts to SAMPLE_FMT_FLT ?
> > I think ffmpeg should support this already. If it does not work then we can
> > keep int16 for now which would implicate more copying
> 
> Hmm... I tried to use SAMPLE_FMT_FLT, but something doesn't work. I made only 
> that changes:
> 
> float *samples = data;
> ...
> for (i = 0; i < avctx->frame_size; i++) {
>     s->buf[s->bufsel][i] = samples[i]*(1<<15);
> }
> ...
> .sample_fmts = (enum SampleFormat[]){SAMPLE_FMT_FLT,SAMPLE_FMT_NONE},

hmm


> 

> [...]
> > > +
> > > +#define find_best(val, table, LUT, LUT_add, LUT_size) \
> > > +    best_idx = \
> > > +        LUT[av_clip ((((int)val) >> 8) + LUT_add, 0, LUT_size - 1)]; \
> > > +    if (abs(val - table[best_idx]) > abs(val - table[best_idx + 1])) \
> > > +        best_idx++;
> >
> > (int)some_float is slow, lrintf() should be faster
> 
> yes, but here rounding down is needed. lrintf(x-0.5) ?

ok but merge that 0.5 into LUT_add


> 
> > also if val is a float instead of an int then fabs() may actually be better
> > than abs()
> ok
> 
> > > +
> > > +/**
> > > + * Encodes NELLY_SAMPLES samples. It assumes, that samples contains 3 *
> > > NELLY_BUF_LEN values + *  @param s               encoder context
> > > + *  @param output          output buffer
> > > + *  @param output_size     size of output buffer
> > > + */
> > > +static void encode_block(NellyMoserEncodeContext *s, unsigned char
> > > *output, int output_size) +{
> > > +    PutBitContext pb;
> > > +    int i, band, block, best_idx, power_idx = 0;
> > > +    float power_val, power_candidate, coeff, coeff_sum;
> > > +    int band_start, band_end;
> > > +    float pows[NELLY_FILL_LEN];
> > > +    int bits[NELLY_BUF_LEN];
> > > +
> > > +    const float C = 1.0;
> > > +    const float D = 2.0;
> > > +
> > > +    apply_mdct(s);
> > > +
> > > +    init_put_bits(&pb, output, output_size * 8);
> > > +
> > >
> > > +    band_start = 0;
> > > +    band_end = ff_nelly_band_sizes_table[0];
> > > +    for (band = 0; band < NELLY_BANDS; band++) {
> > > +        coeff_sum = 0;
> > > +        for (i = band_start; i < band_end; i++) {
> > > +            //coeff_sum += s->mdct_out[i                ] *
> > > s->mdct_out[i                ] +            //           + s->mdct_out[i
> > > + NELLY_BUF_LEN] * s->mdct_out[i + NELLY_BUF_LEN]; +            coeff_sum
> > > += pow(fabs(s->mdct_out[i]), D) + pow(fabs(s->mdct_out[i +
> > > NELLY_BUF_LEN]), D); +        }
> > > +        power_candidate =
> > > +            //log(FFMAX(1.0, coeff_sum /
> > > (ff_nelly_band_sizes_table[band] << 7))) * 1024.0 / M_LN2; +            C
> > > * log(FFMAX(1.0, coeff_sum / (ff_nelly_band_sizes_table[band] << 7))) *
> > > 1024.0 / log(D); +
> > > +        if (band) {
> > > +            power_candidate -= power_idx;
> > > +            find_best(power_candidate, ff_nelly_delta_table,
> > > sf_delta_lut, 37, 78); +            put_bits(&pb, 5, best_idx);
> > > +            power_idx += ff_nelly_delta_table[best_idx];
> > > +        } else {
> > > +            //base exponent
> > > +            find_best(power_candidate, ff_nelly_init_table, sf_lut, -20,
> > > 96); +            put_bits(&pb, 6, best_idx);
> > > +            power_idx = ff_nelly_init_table[best_idx];
> > > +        }
> >
> > the choice of power_idx/best_idx values could still be tried to be found
> > with viterbi. Its somewhat similar (and simpler) than our viterbi/trellis
> > ADPCM encoder
> 
> Ok. I read some information about viterbi. 
> ( http://en.wikipedia.org/wiki/Viterbi_algorithm ;) ).
> I see that it should be assumed:
> states - values - there will be about 2^16 of them?
> observations -following values of power_candidate
> probabilities - e.g. |searched value - assumed value|^C
> transition_probabilit - anything, e.g. 1

Wiki is using the term "viterbi algorithm" in a very narrow sense,
ive not seen anyone outside wiki do that.
Wikipedians seem to love describing the same algorithm 50 times
with different variables and attributed to different authors. Iam not
disputing that it has been reinvented >50 times in various fields but
its still the same algorithm.
I suspect several points of what is under 
"Algorithms that use dynamic programming" are identical to viterbi just
with different variable names.

Considering wikis narrow view, "viterbi" might not be the best choice as
reference as its written with a specific use case in mind that is different
from ours. Iam not sure if there is a abstract description outside of specific
use cases on wiki.


> 
> and write it cleverer. It seems to me, that it will be a quite slow algorithm. 

yes, if its not written clever it would be.

lets first agree on a little random terminonlogy that is closer to our
use case
there are the power_candidate that are the ideal values we want to store but
likely wont be able to store exactly.

there is a graph with bands*2^16 nodes, a column for each band and row
for each value.
And we want to find the path with least "weight" through this graph from
left to right. The weight of each node is simply the distance (squared)
from the power_candidate. The steps we are allowed to take correspond to
the 32 difference or 64 start values that could be stored.

Without optimizations, we would thus have to keep track of up to 2^16
paths and their weights and try each of these pathes with each of the
possible 32 differences for each band, thats

23*2^16*32 ~ 50 million operations per frame

If we now limit ourselfs to just 2000 nodes surrounding each power_candidate
then there would be just 2000*32 transitions to consider but as only <=6
of the values from the difference table would fall within the next 2000
area it would be just 2000*6 thus making the overall operation count

23*2000*6 = 276000 operations per frame, which is IMHO acceptable speed
for a high quality mode.
And it is unlikely that the global optimal path will have nodes that are
far away from the 2000 surrounding, besides its easy to adjust this for
speed vs. quality.


An alternative would be to instead of keeping N pathes that are closest
to the current power_candidate, keep the N so far overall best pathes,
thats what adpcm.c does and it likely has better quality/per speed
but is harder to implement


> Maybe, only relying on this idea, it will be better to make possible 
> transition to e.g. 3 best states? (but personally I hardly see here a 
> connection with viterbi...).

optimized viterbi :)

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080830/dc67d0ad/attachment.pgp>



More information about the ffmpeg-devel mailing list