[MPlayer-dev-eng] Help with MMX asm code

Thu Oct 23 19:07:29 CEST 2003

Jason Tackaberry (tack at auc.ca):

> Representation of the image was taken from the original bmovl filter.
> The image is stored in YUVA where each channel is a separate array: Y
> and A channels being size width*height, and U and V channels being
> size width*height/4.  This is also how mplayer represents the image
> (except that there is no alpha channel).

  If your Cb/Cr channels (U and V) are only width*height/4, then you
have a 4:2:0 image not a 4:2:2 image.  For some explanation see:

  http://www.poynton.com/PDFs/Chroma_subsampling_notation.pdf

> So computation is done rather straightforwardly byte for byte between
> corresponding elements of the src and dst arrays.  Where mpimg is the
> video frame, and img is the image stored as described above (to be
> overlaid), the process is roughly this, if we assume mpimg and img are
> the same dimensions:
> 
> foreach y in height:
> 	foreach x in width:
> 		pos = y * width + x
> 		a = layer_alpha/255 * img.A[pos]

  I would recommend instead: a = layer_alpha/256 * img.A[pos] as
division by 255 is expensive and it's cheap to keep around 4 bytes
instead of only 1 byte for your layer alpha.

> 		mpimg.Y[pos] = blend(mpimg.Y[pos], img.Y[pos], a)
> 		if y % 2 and x % 2:
> 			pos = y/2 * width/2 + x/2
> 			mpimg.U[pos] = blend(mpimg.U[pos], img.U[pos], a)
> 			mpimg.V[pos] = blend(mpimg.V[pos], img.V[pos], a)

  First, this seems wrong.  If we look at a block of four pixels:

  A B
  C D

  You're using the alpha from pixel D to apply to the Cb/Cr components.
For MPEG2, the chroma samples are positioned halfway between A and C, so
if you want to be really correct, you should filter the alpha channel,
for example by taking the average alpha value between A and C.  If this
is expensive, at least use the alpha of pixel A and not pixel D.

> def blend(p1, p2, a):
> 	# Which you pointed out is wrong ...
> 	return ( (255-a)*p1 + a*p2 ) >> 8

  Yeah, you should fix that :)

> My thoughts were to use MMX to parallelize the blend computation
> several bytes at once.  But maybe for now I should go back to the
> beginning and rework the above approach?

  This memory layout is fine and you can optimize it like it is.  My
code might help as a starting point ...  I can definitely edit any code
you come up with too :)

  If you want help let me know.

  -Billy