SwScaler performance help (was Re: [MPlayer-dev-eng] [PATCH] vf_osd updates - fully baked?)

Tue Sep 13 18:51:35 CEST 2005

Hi

On Mon, Sep 12, 2005 at 10:34:58PM -0400, Jason Tackaberry wrote:
> On Mon, 2005-09-12 at 11:59 -0400, Jason Tackaberry wrote:
> > > As I mentioned before, since you have to seperate out the alpha in an
> > > extra plane anyway, you can first do that and then scale. I think. Btw.
> > > the swscaler can do the conversion and scaling in one step AFAIK.
> > 
> > It does.  I'll try to rework the code to use Swscaler.  I agree that
> > it's just a better design that way.  I may have to ask for help. :)
> 
> Initial results are not very encouraging.  This approach, using
> swscaler, is nearly 3 times slower than my current code.  My code will
> convert a 640x480 BGRA image to 5 planes (luma, 2 chroma, luma alpha,
> chroma alpha) in about 4200 usec.  Using swscaler to convert BGR32 to
> YV12, then separating the alpha channel to a separate plane and using
> swscaler to scale Y800 for luma and chroma alpha, this takes about 11500
> usec.
> 
> Here's the code I'm using for swscaler.  In vf_config:
> 
>     sws_getFlagsAndFilterFromCmdLine(&sws_flags, &srcFilterParam,
> &dstFilterParam);
>     priv->sws_bgr32 = sws_getContext(priv->w, priv->h, IMGFMT_BGR32, width, height, IMGFMT_YV12,
>                                        get_sws_cpuflags() | sws_flags | SWS_PRINT_INFO,
>                                        srcFilterParam, dstFilterParam, NULL);
>     priv->sws_y800_l = sws_getContext(priv->w, priv->h, IMGFMT_Y800, width, height, IMGFMT_Y800,
>                                        get_sws_cpuflags() | sws_flags | SWS_PRINT_INFO,
>                                        srcFilterParam, dstFilterParam, NULL);
>     priv->sws_y800_c = sws_getContext(priv->w, priv->h, IMGFMT_Y800, width>>1, height>>1, IMGFMT_Y800,
>                                        get_sws_cpuflags() | sws_flags | SWS_PRINT_INFO,
>                                        srcFilterParam, dstFilterParam, NULL);
> 
> Note that I'm testing with a fixed OSD, so that means priv->w == width
> and priv->h == height.  (In other words, no scaling is happening except
> for sws_y800_c.)
> 
> And for the conversion (it's messy, but it's just test code):
> 
>     unsigned char *alpha = malloc(priv->w*priv->h);
>     int i, j;
>     for (i=3, j=0; i < priv->w * priv->h * 4; i+=4, j++)
>         alpha[j] = priv->bgra_imgbuf[i];
>     {
>     uint8_t *src[3] = {priv->bgra_imgbuf, NULL, NULL};
>     int src_strides[3] = {priv->w * 4, 0, 0};
>     uint8_t *dst[3] = {priv->y, priv->u, priv->v};
>     int dst_strides[3] = {priv->mpi_w, priv->mpi_w>>1, priv->mpi_w>>1};
>     sws_scale_ordered(priv->sws_bgr32, src, src_strides, 0, priv->h, dst, dst_strides);
>     }
>     uint8_t *src[3] = {alpha, NULL, NULL};
>     int src_strides[3] = {priv->w, 0, 0};
>     {
>     uint8_t *dst[3] = {priv->a, NULL, NULL};
>     int dst_strides[3] = {priv->w, 0, 0};
>     sws_scale_ordered(priv->sws_y800_l, src, src_strides, 0, priv->h, dst, dst_strides);
>     }
>     {
>     uint8_t *dst[3] = {priv->uva, NULL, NULL};
>     int dst_strides[3] = {priv->w>>1, 0, 0};
>     sws_scale_ordered(priv->sws_y800_c, src, src_strides, 0, priv->h, dst, dst_strides);
>     }
>     free(alpha);
> 
> (Note the malloc/free isn't being included in the timings since it should be moved elsewhere.)
> 
> Here's the info messages from swscaler:
> 
>         SwScaler: using unscaled Planar YV12 -> Planar YV12 special converter
>         
>         SwScaler: BICUBIC scaler, from Planar YV12 to Planar YV12 using MMX2
>         
>         SwScaler: BICUBIC scaler, from BGRA to Planar YV12 using MMX2
>         SwScaler: using unscaled Planar Y800 -> Planar Y800 special converter
>         
>         SwScaler: BICUBIC scaler, from Planar Y800 to Planar Y800 using MMX2
> 
> (Note that I've aligned bgra_imgbuf.)
> 
> An increase from 4200 usec to 11500 usec is no small potatoes.  Am I
> doing anything wrong?  I must be.  When I comment out the two last
> scales and just do BGR32 to YV12, it's still slower (about 8000 usec).
> I would have expected swscaler to be faster.

bgr32->yv12 sws doesnt seem to be optimized at all, its uncommon for playback
of some codecs to require bgr32->yv12 conversation

i really think you should add your code to the swscaler, and if its really to
hard then at least put it in postproc/... a filter is not the correct place
for it

btw, ensure that all arrays are aliged at 16byte boundaries and linesizes/strides too

[...]

-- 
Michael