[FFmpeg-devel] [PATCH 01/N] RV30/40 Decoder - Core

Sun Dec 9 00:00:04 CET 2007

On Sat, Dec 08, 2007 at 12:10:17PM +0200, Kostya wrote:
> On Sat, Dec 08, 2007 at 05:32:07AM +0100, Michael Niedermayer wrote:
> > On Fri, Dec 07, 2007 at 09:04:07AM +0200, Kostya wrote:
> [...]
> > > > please check if the following are identical (+-1 doesnt count as
> > > > different)
> > > > 
> > > > svq3_get_ue_golomb() and 
> > > > svq3_get_se_golomb()
> > > > 
> > > > and if they are identical post START/STOP_TIMER scores
> > > > also id like to see START/STOP_TIMER scores in that case for the
> > > > original vlc based code
> > > > 
> > > > the fastest should be placed in golomb.c/h
> > > > and used
> > > 
> > > I don't know when I can do benchmarks but switched to svq3_get_[su]e_golomb()
> > > anyway.
> > 
> > ok, just dont forget about the benchmarks please
> 
> Hmm, I don't have large and long enough clip to show timer values.

no? it shows nothing? just a blank screen? or all have the same value? but
then you can show them ...

you can also put the START/STOP_TIMER over a block or mb decode function
or over the for loop in svq3.c after
/* decode prediction codes for luma blocks */
that contains always 8 svq3_get_ue_golomb()

> I guess it is fast enough. For the reference, rv34_inv_transform() takes 50 dezicycles.

its not the question if its fast enough i want to know which is the fastest

>  
> [...]
> > [...]
> > > /** essential slice information */
> > > typedef struct SliceInfo{
> > >     int type;              ///< slice type (intra, inter)
> > >     int size;              ///< size of the slice in bits
> > >     int quant;             ///< quantizer used for this slice
> > >     int vlc_set;           ///< VLCs used for this slice
> > >     int start, end;        ///< start and end macroblocks of the slice
> > >     int width;             ///< coded width
> > >     int height;            ///< coded height
> > > }SliceInfo;
> > > 
> > > /** decoder context */
> > > typedef struct RV34DecContext{
> > >     MpegEncContext s;
> > >     int8_t *intra_types_hist;///< old block types, used for prediction
> > >     int8_t *intra_types;     ///< block types
> > >     int block_start;         ///< start of slice in blocks
> > >     uint8_t *luma_dc_quant_i;///< luma subblock DC quantizer for intraframes
> > >     uint8_t *luma_dc_quant_p;///< luma subblock DC quantizer for interframes
> > > 
> > >     RV34VLC *cur_vlcs;       ///< VLC set used for current frame decoding
> > >     int bits;                ///< slice size in bits
> > >     H264PredContext h;       ///< functions for 4x4 and 16x16 intra block prediction
> > >     SliceInfo si;            ///< current slice information
> > >     uint8_t *slice_data;     ///< saved slice data
> > > 
> > >     int *mb_type;            ///< internal macroblock types
> > >     int block_type;          ///< current block type
> > >     int luma_vlc;            ///< which VLC set will be used for decoding of luma blocks
> > >     int chroma_vlc;          ///< which VLC set will be used for decoding of chroma blocks
> > >     int is16;                ///< current block has additional 16x16 specific features or not
> > >     int dmv[4][2];           ///< differential motion vectors for the current macroblock
> > > 
> > >     int rv30;                ///< indicates which RV variasnt is currently decoded
> > >     int rpr;                 ///< one field size in RV30 slice header
> > > 
> > >     int avail[4];            ///< whether left, top, top rights and top left MBs are available
> > > 
> > >     int (*parse_slice_header)(struct RV34DecContext *r, GetBitContext *gb, SliceInfo *si);
> > >     int (*decode_mb_info)(struct RV34DecContext *r);
> > >     int (*decode_intra_types)(struct RV34DecContext *r, GetBitContext *gb, int8_t *dst);
> > >     void (*loop_filter)(struct RV34DecContext *r);
> > > }RV34DecContext;
> > 
> > SliceInfo.size  is redundant with bits
> > SliceInfo.quant with qscale
> > SliceInfo.start with block_start
> 
> dropped size, and start is used in calculations.

block_start=start
both are used, but they have the same value you can drop one if not
id like to hear what the difference between the 2 is

> Maybe it will be used in parallel decoding as well.

i thought you said its fast enough? 

>  
> > slice_data is a very ugly way to pass an argument
> > id use:
> >     slice_data = buf + offset;
> >     rv34_decode_slice(r, r->si.size, r->si.end, &last, slice_data);
> > instead of
> >     r->slice_data = buf + offset;
> >     rv34_decode_slice(r, r->si.size, r->si.end, &last);
> > theres also only one use of it ...
> 
> changed

you can drop slice_data out of the struct now as well as its unused

[...]

>     for(mask = 8; mask; mask >>= 1, curshift++){
>         if(!(pattern & mask)) continue;
>         cbp |= get_vlc2(gb, vlc->cbp[table][ones].table, vlc->cbp[table][ones].bits, 1) << curshift[0];

if(pattern & mask)
    cbp |= get_vlc2(gb, vlc->cbp[table][ones].table, vlc->cbp[table][ones].bits, 1) << curshift[0];

[...]
>     if(!is_block2){
>         decode_coeff(dst+1, coeffs[1], 2, gb, vlc);
>         decode_coeff(dst+8, coeffs[2], 2, gb, vlc);
>     }else{
>         decode_coeff(dst+8, coeffs[1], 2, gb, vlc);
>         decode_coeff(dst+1, coeffs[2], 2, gb, vlc);
>     }

if(is_block2){
    decode_coeff(dst+8, coeffs[1], 2, gb, vlc);
    decode_coeff(dst+1, coeffs[2], 2, gb, vlc);
}else{
    decode_coeff(dst+1, coeffs[1], 2, gb, vlc);
    decode_coeff(dst+8, coeffs[2], 2, gb, vlc);
}

[...]
>     switch(block_type){
>     case RV34_MB_P_16x16:
>     case RV34_MB_P_MIX16x16:
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+2][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+2][1];
>         }
>         break;
>     case RV34_MB_P_8x8:
>         mv_pos += (subblock_no & 1) + (subblock_no >> 1)*s->b8_stride;
>         if(subblock_no & 1) has_A = 1;
>         if(subblock_no & 2) has_B = 1;
>         if(subblock_no == 2) has_C = 1;
>         if(!subblock_no) has_C = has_B;
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+1][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+1][1];
>         }
>         if(subblock_no == 3){
>             has_C = 1;
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride-1][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride-1][1];
>         }
>         break;
>     case RV34_MB_P_16x8:
>         mv_pos += subblock_no*s->b8_stride;
>         has_B |= subblock_no;
>         has_C &= ~subblock_no;
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+2][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+2][1];
>         }
>         break;
>     case RV34_MB_P_8x16:
>         mv_pos += subblock_no;
>         has_A |= subblock_no;
>         if(!subblock_no) has_C = has_B;
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+1][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+1][1];
>         }
>         break;
>     default:
>         has_A = has_B = has_C = 0;
>     }

C_off= part_sizes_w[block_type];

mv_pos += (subblock_no & 1) + (subblock_no >> 1)*s->b8_stride;  
yes fix the subblock_no to make sense!

if(subblock_no & 1) has_A = 1;
if(subblock_no & 2) has_B = 1;
if(subblock_no == 3) C_off =-1;
if((subblock_no&1) + C_off < 2){
    has_C = has_B;
}else if(subblock_no&2)
    has_C =0;
if(has_C){
    C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+C_off][0];
    C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride+C_off][1];
}

[...]
>     if(!has_C){
>         if(!has_B || (!has_A && !r->rv30)){
>             C[0] = A[0];
>             C[1] = A[1];
>         }else{
>             C[0] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride-1][0];
>             C[1] = s->current_picture_ptr->motion_val[0][mv_pos-s->b8_stride-1][1];
>         }
>     }

this looks suspect, i have my doubts about it being correct
first for a 16x8 block at the left border it would set C from uninitalized
memory, possibly even segfaulting
second it would use previous slices for prediction

but then rv is crap so i guess your code is correct ...

[...]
> /**
>  * Predict motion vector for B-frame macroblock.
>  */
> static inline void rv34_pred_b_vector(int A[2], int B[2], int C[2], int no_A, int no_B, int no_C, int *mx, int *my)
> {
>     if(no_A + no_B + no_C){
>         *mx = A[0] + B[0] + C[0];
>         *my = A[1] + B[1] + C[1];
>         if(no_A + no_B + no_C == 1){
>             *mx /= 2;
>             *my /= 2;
>         }
>     }else{
>         *mx = mid_pred(A[0], B[0], C[0]);
>         *my = mid_pred(A[1], B[1], C[1]);
>     }
> }

could you please here as well replace the negated flags by normal ones, it
would besides better readability avoid the negations on the calls

[...]
>     if(!r->avail[0])
>         has_A = 0;
>     else{
>         int A_type = r->mb_type[mb_pos - 1];
>         has_A = A_type == RV34_MB_B_BIDIR || A_type == (dir ? RV34_MB_B_BACKWARD : RV34_MB_B_FORWARD);
> 
>         if(has_A){
>             A[0] = s->current_picture_ptr->motion_val[dir][mv_pos - 1][0];
>             A[1] = s->current_picture_ptr->motion_val[dir][mv_pos - 1][1];
>         }
>     }
>     if(!r->avail[1]){
>         has_B = 0;
>     }else{
>         int B_type = r->mb_type[mb_pos - s->mb_stride];
>         has_B = B_type == RV34_MB_B_BIDIR || B_type == (dir ? RV34_MB_B_BACKWARD : RV34_MB_B_FORWARD);
> 
>         if(has_B){
>             B[0] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride][0];
>             B[1] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride][1];
>         }
>     }
>     if(r->avail[2]){
>         int C_type = r->mb_type[mb_pos - s->mb_stride + 1];
>         has_C = C_type == RV34_MB_B_BIDIR || C_type == (dir ? RV34_MB_B_BACKWARD : RV34_MB_B_FORWARD);
> 
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride + 1][0];
>             C[1] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride + 1][1];
>         }
>     }else if(r->avail[3]){
>         int C_type = r->mb_type[mb_pos - s->mb_stride - 1];
>         has_C = C_type == RV34_MB_B_BIDIR || C_type == (dir ? RV34_MB_B_BACKWARD : RV34_MB_B_FORWARD);
> 
>         if(has_C){
>             C[0] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride - 1][0];
>             C[1] = s->current_picture_ptr->motion_val[dir][mv_pos - s->b8_stride - 1][1];
>         }
>     }else
>         has_C = 0;

the code above is quadrupled
also with the normal lavc mb_types the direction would be just a bit and
wouldnt need long and obfuscated checks

[...]
> /**
>  * generic motion compensation function
>  *
>  * @param r decoder context
>  * @param block_type type of the current block
>  * @param xoff horizontal offset from the start of the current block
>  * @param yoff vertical offset from the start of the current block
>  * @param mv_off offset to the motion vector information
>  * @param width width of the current partition in 8x8 blocks
>  * @param height height of the current partition in 8x8 blocks
>  */
> static inline void rv34_mc(RV34DecContext *r, const int block_type,
>                           const int xoff, const int yoff, int mv_off,
>                           const int width, const int height, int dir,
>                           qpel_mc_func (*qpel_mc)[16],
>                           h264_chroma_mc_func (*chroma_mc))
> {
>     MpegEncContext *s = &r->s;
>     uint8_t *Y, *U, *V, *srcY, *srcU, *srcV;
>     int dxy, mx, my, uvmx, uvmy, src_x, src_y, uvsrc_x, uvsrc_y;
>     int mv_pos = s->mb_x * 2 + s->mb_y * 2 * s->b8_stride + mv_off;
>     int is16x16 = 1;
> 
>     mx = s->current_picture_ptr->motion_val[dir][mv_pos][0];
>     my = s->current_picture_ptr->motion_val[dir][mv_pos][1];
>     srcY = dir ? s->next_picture_ptr->data[0] : s->last_picture_ptr->data[0];
>     srcU = dir ? s->next_picture_ptr->data[1] : s->last_picture_ptr->data[1];
>     srcV = dir ? s->next_picture_ptr->data[2] : s->last_picture_ptr->data[2];
>     src_x = s->mb_x * 16 + xoff + (mx >> 2);
>     src_y = s->mb_y * 16 + yoff + (my >> 2);
>     uvsrc_x = s->mb_x * 8 + (xoff >> 1) + (mx >> 3);
>     uvsrc_y = s->mb_y * 8 + (yoff >> 1) + (my >> 3);
>     srcY += src_y * s->linesize + src_x;
>     srcU += uvsrc_y * s->uvlinesize + uvsrc_x;
>     srcV += uvsrc_y * s->uvlinesize + uvsrc_x;
>     if(   (unsigned)(src_x - !!(mx&3)*2) > s->h_edge_pos - !!(mx&3)*2 - (width <<3) - 3
>        || (unsigned)(src_y - !!(my&3)*2) > s->v_edge_pos - !!(my&3)*2 - (height<<3) - 3){
>         uint8_t *uvbuf= s->edge_emu_buffer + 20 * s->linesize;
> 
>         srcY -= 2 + 2*s->linesize;
>         ff_emulated_edge_mc(s->edge_emu_buffer, srcY, s->linesize, (width<<3)+4, (height<<3)+4,
>                             src_x - 2, src_y - 2, s->h_edge_pos, s->v_edge_pos);
>         srcY = s->edge_emu_buffer + 2 + 2*s->linesize;
>         ff_emulated_edge_mc(uvbuf     , srcU, s->uvlinesize, (width<<2)+1, (height<<2)+1,
>                             uvsrc_x, uvsrc_y, s->h_edge_pos >> 1, s->v_edge_pos >> 1);
>         ff_emulated_edge_mc(uvbuf + 16, srcV, s->uvlinesize, (width<<2)+1, (height<<2)+1,
>                             uvsrc_x, uvsrc_y, s->h_edge_pos >> 1, s->v_edge_pos >> 1);
>         srcU = uvbuf;
>         srcV = uvbuf + 16;
>     }
>     dxy = ((my & 3) << 2) | (mx & 3);
>     uvmx = mx & 6;
>     uvmy = my & 6;
>     Y = s->dest[0] + xoff      + yoff     *s->linesize;
>     U = s->dest[1] + (xoff>>1) + (yoff>>1)*s->uvlinesize;
>     V = s->dest[2] + (xoff>>1) + (yoff>>1)*s->uvlinesize;
> 
>     if(block_type == RV34_MB_P_16x8){
>         qpel_mc[1][dxy](Y, srcY, s->linesize);
>         Y    += 8;
>         srcY += 8;
>     }else if(block_type == RV34_MB_P_8x16){
>         qpel_mc[1][dxy](Y, srcY, s->linesize);
>         Y    += 8 * s->linesize;
>         srcY += 8 * s->linesize;
>     }
>     is16x16 = (block_type != RV34_MB_P_8x8) && (block_type != RV34_MB_P_16x8) && (block_type != RV34_MB_P_8x16);
>     qpel_mc[!is16x16][dxy](Y, srcY, s->linesize);
>     chroma_mc[2-width]   (U, srcU, s->uvlinesize, height*4, uvmx, uvmy);
>     chroma_mc[2-width]   (V, srcV, s->uvlinesize, height*4, uvmx, uvmy);
> }

how does this differ from the equivalent functions in h264.c?
that is mc_dir_part() IIRC

[...]
>         for(j = 0; j < 2; j++){
>             for(i = 0; i < 2; i++){
>                 s->current_picture_ptr->motion_val[0][s->mb_x * 2 + s->mb_y * 2 * s->b8_stride + i + j*s->b8_stride][0] = 0;
>                 s->current_picture_ptr->motion_val[0][s->mb_x * 2 + s->mb_y * 2 * s->b8_stride + i + j*s->b8_stride][1] = 0;
>             }
>         }

fill_rectangle() should be used here, yes please
move fill_rectangle() into a file from where it can be used, its a very
usefull function

>         return 0;

>     case RV34_MB_SKIP:
>         if(s->pict_type == P_TYPE){
>             rv34_pred_mv(r, block_type, 0);

makes no sense, the predictor is always 0,0

[...]
>     if(!is16){
>         for(j = 0; j < 4; j++){
>             no_left = !r->avail[0];
>             for(i = 0; i < 4; i++, cbp >>= 1, Y += 4){
>                 no_topright = no_up || (i==3 && j) || (i==3 && !j && (s->mb_x-1) == s->mb_width);
>                 rv34_pred_4x4_block(r, Y, s->linesize, ittrans[intra_types[i]], no_up, no_left, i || (j==3), no_topright);
>                 no_left = 0;
>                 if(cbp & 1)
>                     rv34_add_4x4_block(Y, s->linesize, s->block[(i>>1)+(j&2)], (i&1)*4+(j&1)*32);
>             }
>             no_up = 0;
>             Y += s->linesize * 4 - 4*4;
>             intra_types += s->b4_stride;
>         }

h264 uses a MUCH simpler system to keep track of block availability
take a small array and fill it with mb_types/border availability for the
current MB and then just use block-8 block-1 block-8+1

with that it would just be
has_topright= avail[i + 8*j - 7];

instead of
no_topright = no_up || (i==3 && j) || (i==3 && !j && (s->mb_x-1) == s->mb_width);

[...]
>             for(i = 0; i < 16; i++)
>                 intra_types[(i & 3) + (i>>2) * s->b4_stride] = 0;

fill_rectangle()

[...]

>             for(i = 0; i < 16; i++)
>                 intra_types[(i & 3) + (i>>2) * s->b4_stride] = t;

fill_rectangle()

[...]
> static int check_slice_end(RV34DecContext *r, MpegEncContext *s)
> {
>     int bits;
>     if(!s->mb_num_left)
>         return 1;
>     if(r->s.mb_skip_run > 1)
>         return 0;
>     if(s->mb_y >= s->mb_height)
>         return 1;

no this is wrong, if you are over s->mb_height then the slice has ended no
matter if mb_skip_run ia > 1 or not

[...]
>     ff_er_add_slice(s, s->resync_mb_x, s->resync_mb_y, s->mb_x-1, s->mb_y, AC_END|DC_END|MV_END);
>     if(s->mb_y >= s->mb_height)
>         return 1;
>     if(r->bits > get_bits_count(gb) && show_bits(gb, r->bits-get_bits_count(gb)))
>         return 1;

this is wrong, ff_er_add_slice() must be supplied with the correct information
you cannot just say the slice is ok while it isnt

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Breaking DRM is a little like attempting to break through a door even
though the window is wide open and the only thing in the house is a bunch
of things you dont want and which you would get tomorrow for free anyway
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20071209/63d2b0ee/attachment.pgp>