[FFmpeg-devel] [RFC] AAC Encoder
Michael Niedermayer
michaelni
Sun Aug 17 16:41:39 CEST 2008
On Sun, Aug 17, 2008 at 02:57:48PM +0300, Kostya wrote:
> On Sun, Aug 17, 2008 at 03:08:58AM +0200, Michael Niedermayer wrote:
> > On Sat, Aug 16, 2008 at 06:00:39PM +0300, Kostya wrote:
> > > On Sat, Aug 16, 2008 at 03:57:56AM +0200, Michael Niedermayer wrote:
> > > > On Fri, Aug 15, 2008 at 07:59:52PM +0300, Kostya wrote:
[...]
> > > }
> > >
> > > if(cpe->ch[ch].ics.window_sequence[0] != EIGHT_SHORT_SEQUENCE){
> > > cpe->ch[ch].ics.use_kb_window[0] = 1;
> > > cpe->ch[ch].ics.num_windows = 1;
> > > cpe->ch[ch].ics.swb_sizes = apc->bands1024;
> > > cpe->ch[ch].ics.num_swb = apc->num_bands1024;
> > > cpe->ch[ch].ics.num_window_groups = 1;
> > > cpe->ch[ch].ics.group_len[0] = 1;
> > > }else{
> > > cpe->ch[ch].ics.use_kb_window[0] = 1;
> > > cpe->ch[ch].ics.num_windows = 8;
> > > cpe->ch[ch].ics.swb_sizes = apc->bands128;
> > > cpe->ch[ch].ics.num_swb = apc->num_bands128;
> >
> > > cpe->ch[ch].ics.num_window_groups = 4;
> > > for(i = 0; i < 4; i++)
> > > cpe->ch[ch].ics.group_len[i] = 2;
> >
> > this is not optimal
>
> of course. It's a simple test model without anything resembling optimal
so it is useless ...
random() at least would excercise all parts of the encoder for testing ...
[...]
> > [...]
> > > /**
> > > * window grouping information stored as bits (0 - new group, 1 - group continues)
> > > */
> > > static const uint8_t window_grouping[9] = {
> > > 0xB6, 0x6C, 0xD8, 0xB2, 0x66, 0xC6, 0x96, 0x36, 0x36
> > > };
> > >
> > > /**
> > > * Tell encoder which window types to use.
> > > * @see 3GPP TS26.403 5.4.1 "Blockswitching"
> > > */
> > > static void psy_3gpp_window(AACPsyContext *apc, int16_t *audio, int16_t *la, int tag, int type, ChannelElement *cpe)
> > > {
> > > int ch;
> > > int chans = type == TYPE_CPE ? 2 : 1;
> > > int i, j;
> > > int br = apc->avctx->bit_rate / apc->avctx->channels;
> > > int attack_ratio = (br <= 16000 + 8000*chans) ? 18 : 10;
> > > Psy3gppContext *pctx = (Psy3gppContext*) apc->model_priv_data;
> > > Psy3gppChannel *pch = &pctx->ch[tag];
> > > uint8_t grouping[2];
> > > enum WindowSequence win[2];
> > >
> > > if(la && !(apc->flags & PSY_MODEL_NO_SWITCH)){
> > > float s[8], v;
> > > for(ch = 0; ch < chans; ch++){
> > > enum WindowSequence last_window_sequence = cpe->ch[ch].ics.window_sequence[0];
> > > int switch_to_eight = 0;
> > > float sum = 0.0, sum2 = 0.0;
> > > int attack_n = 0;
> > > for(i = 0; i < 8; i++){
> > > for(j = 0; j < 128; j++){
> > > v = iir_filter(audio[(i*128+j)*apc->avctx->channels+ch], pch->iir_state[ch]);
> > > sum += v*v;
> > > }
> > > s[i] = sum;
> > > sum2 += sum;
> > > }
> > > for(i = 0; i < 8; i++){
> > > if(s[i] > pch->win_energy[ch] * attack_ratio){
> > > attack_n = i + 1;
> > > switch_to_eight = 1;
> > > break;
> > > }
> > > }
> > > pch->win_energy[ch] = pch->win_energy[ch]*7/8 + sum2/64;
> > >
> > > switch(last_window_sequence){
> > > case ONLY_LONG_SEQUENCE:
> > > win[ch] = switch_to_eight ? LONG_START_SEQUENCE : ONLY_LONG_SEQUENCE;
> > > grouping[ch] = 0;
> > > break;
> > > case LONG_START_SEQUENCE:
> > > win[ch] = EIGHT_SHORT_SEQUENCE;
> > > grouping[ch] = pch->next_grouping[ch];
> > > break;
> > > case LONG_STOP_SEQUENCE:
> > > win[ch] = ONLY_LONG_SEQUENCE;
> > > grouping[ch] = 0;
> > > break;
> > > case EIGHT_SHORT_SEQUENCE:
> > > win[ch] = switch_to_eight ? EIGHT_SHORT_SEQUENCE : LONG_STOP_SEQUENCE;
> > > grouping[ch] = switch_to_eight ? pch->next_grouping[ch] : 0;
> > > break;
> > > }
> > > pch->next_grouping[ch] = window_grouping[attack_n];
> >
> > this is limited to 9 of 256 possible groupings, not to mention that i have my
> > doubts about the optimality of the highpass based selection.
>
> 128, actually (first window is always belongs to group one)
> and it's all in the spec
using just 9 of 128 groupings still does not look much better to me
[...]
> > > }
> > >
> > > /**
> > > * Determine scalefactors and prepare coefficients for encoding.
> > > * @see 3GPP TS26.403 5.4 "Psychoacoustic model"
> > > */
> > > static void psy_3gpp_process(AACPsyContext *apc, int tag, int type, ChannelElement *cpe)
> > > {
> > > int start;
> > > int ch, w, wg, g, i;
> > > int prev_scale;
> > > Psy3gppContext *pctx = (Psy3gppContext*) apc->model_priv_data;
> > > float pe_target;
> > > int bits_avail;
> > > int chans = type == TYPE_CPE ? 2 : 1;
> > > Psy3gppChannel *pch = &pctx->ch[tag];
> > >
> > > //calculate energies, initial thresholds and related values - 5.4.2 "Threshold Calculation"
> > > memset(pch->band, 0, sizeof(pch->band));
> > > for(ch = 0; ch < chans; ch++){
> > > start = 0;
> > > for(w = 0; w < cpe->ch[ch].ics.num_windows*16; w += 16){
> > > for(g = 0; g < cpe->ch[ch].ics.num_swb; g++){
> > > for(i = 0; i < cpe->ch[ch].ics.swb_sizes[g]; i++)
> > > pch->band[ch][w+g].energy += cpe->ch[ch].coeffs[start+i] * cpe->ch[ch].coeffs[start+i];
> > > pch->band[ch][w+g].energy /= 262144.0f;
> > > pch->band[ch][w+g].thr = pch->band[ch][w+g].energy * 0.001258925f;
> > > start += cpe->ch[ch].ics.swb_sizes[g];
> >
> > > if(pch->band[ch][w+g].energy != 0.0){
> > > float ffac = 0.0;
> > >
> > > for(i = 0; i < cpe->ch[ch].ics.swb_sizes[g]; i++)
> > > ffac += sqrt(FFABS(cpe->ch[ch].coeffs[start+i]));
> > > pch->band[ch][w+g].ffac = ffac / sqrt(512.0);
> > > }
> >
> > apparently not used before M/S and its calculated after M/S again
>
> it's recalculated only for M/S bands
its recalculated unneccessarily (for M/S bands)
[...]
> > > pch->band[ch][w+g].thr = FFMAX(pch->band[ch][w+g].thr, pch->band[ch][w+g].thr_quiet * 0.25);
> > > }
> > > }
> > > }
> > >
> > > // M/S detection - 5.5.2 "Mid/Side Stereo"
> > > if(chans > 1 && cpe->common_window){
> > > start = 0;
> > > for(w = 0; w < cpe->ch[0].ics.num_windows*16; w += 16){
> > > for(g = 0; g < cpe->ch[0].ics.num_swb; g++){
> > > double en_m = 0.0, en_s = 0.0, ff_m = 0.0, ff_s = 0.0, minthr;
> > > float m, s;
> > >
> > > cpe->ms_mask[w+g] = 0;
> > > if(pch->band[0][w+g].energy == 0.0 || pch->band[1][w+g].energy == 0.0)
> > > continue;
> > > for(i = 0; i < cpe->ch[0].ics.swb_sizes[g]; i++){
> > > m = cpe->ch[0].coeffs[start+i] + cpe->ch[1].coeffs[start+i];
> > > s = cpe->ch[0].coeffs[start+i] - cpe->ch[1].coeffs[start+i];
> > > en_m += m*m;
> > > en_s += s*s;
> > > }
> > > en_m /= 262144.0*4.0;
> > > en_s /= 262144.0*4.0;
> > > minthr = FFMIN(pch->band[0][w+g].thr, pch->band[1][w+g].thr);
> >
> > > if(minthr * minthr * pch->band[0][w+g].energy * pch->band[1][w+g].energy >= (pch->band[0][w+g].thr * pch->band[1][w+g].thr * en_m * en_s)){
> >
> > i have in my previous review simplified this line already
> >
> > anyway before the AAC encoder can reach svn it MUST be significantly improved
> > in terms of optimality as well as code cleanliness
> > basically everything should be RD optimal unless either a faster and equally
> > good heuristic exists or the RD optimal code is too slow.
>
> well, that requires developing a much better RD-aware psy model.
Are you saying that what you implemented is much worse than whats possible?
Well if you are saying that then ill belive it, so please implement the best
;)
Besides a psychoacoustic model IMHO produces perceptual weights, either per
bands or coefficients.
Everything else should be done per RD theory.
Now in principle other decissions could also be done on a psychoacoustic
aware way but as we have seen from the quantization this is clearly not the
case in the current model.
What the current model does is it calculates these weights (in the form of
scale factors) and the rest has absolutely nothing to do with psychoacoustics
its just a trivial reference quantizer, trivial M/S selection based on better
decorrelation, trivial IIR filter based short window selection [and this one
is even suboptimal in its own way as it limits itself to 9 out of 128
groupings].
the scalefactors from the psy model should be useable as RD factors for
weighting between rate and distortion. Iam pretty sure a relation like
lambda = A*sf^B with A and B constants should be more than good enough
for our purposes, it is for mpeg4 ASP. I guess loren or dark shikari can
comment on what the relation commonly used in h.264 between the quantization
factor (NOT QP which has a log scale) and lambda is?
And to preempt the question about the values of A and B, they can be found
simply by comparing a RD based encoder (which selects scalefactors based on
the lamda values) to the 3gpp model, A and B should be selected so that
both encoders choose most similar scalefactors
i will review the patch later, you have plenty of ideas to work on left as
far as i can see.
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Those who are too smart to engage in politics are punished by being
governed by those who are dumber. -- Plato
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080817/39072b42/attachment.pgp>
More information about the ffmpeg-devel
mailing list