[FFmpeg-devel] [PATCH v2] fbdetile cpu based framebuffer layout detiling v02
Paul B Mahol
onemda at gmail.com
Sat Jun 27 23:00:02 EEST 2020
What is this?
Missing documentation.
NAK
On 6/27/20, hanishkvc <hanishkvc at gmail.com> wrote:
> v02-20200627IST2331
>
> Unrolled Intel Legacy Tile-Y detiling logic.
>
> Also a consolidated patch file, instead of the previous development
> flow based multiple patch files.
>
> v01-20200627IST1308
>
> Implemented Intel Legacy Tile-X and Tile-Y detiling logic
>
> NOTES:
>
> This video filter allows framebuffers which are tiled to be detiled
> using logic running on the cpu, into a linear layout.
>
> Currently it supports Intel Legacy Tile-X and Tile-Y layout detiling.
> THis should help one to work with frames captured (say using kmsgrab)
> on laptops having Intel GPU.
>
> Tile-X conversion logic has been explicitly cross checked, with Tile-X
> based frames. However Tile-Y conv logic hasnt been tested with Tile-Y
> based frames, but it should potentially do the job, based on my current
> understanding of the Tile-Y layout format.
>
> TODO1: At a later time have to generate Tile-Y based frames, and then
> cross check the corresponding logic explicitly.
>
> TODO2: May be use OpenGL or Vulcan buffer helper routines to do the
> layout conversion. But some online discussions from sometime back seem
> to indicate that this path is not fully bug free currently.
> ---
> Changelog | 1 +
> doc/filters.texi | 62 ++++++++
> libavfilter/Makefile | 1 +
> libavfilter/allfilters.c | 1 +
> libavfilter/vf_fbdetile.c | 309 ++++++++++++++++++++++++++++++++++++++
> 5 files changed, 374 insertions(+)
> create mode 100644 libavfilter/vf_fbdetile.c
>
> diff --git a/Changelog b/Changelog
> index a60e7d2eb8..0e03491f6a 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -2,6 +2,7 @@ Entries are sorted chronologically from oldest to youngest
> within each release,
> releases are sorted from youngest to oldest.
>
> version <next>:
> +- fbdetile cpu based framebuffer layout detiling video filter
> - AudioToolbox output device
> - MacCaption demuxer
>
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 3c2dd2eb90..73ba21af89 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -12210,6 +12210,68 @@ It accepts the following optional parameters:
> The number of the CUDA device to use
> @end table
>
> + at anchor{fbdetile}
> + at section fbdetile
> +
> +Detiles the Framebuffer tile layout into a linear layout using CPU.
> +
> +It currently supports conversion from Intel legacy tile-x and tile-y
> layouts
> +into a linear layout. This is useful if one is using kmsgrab and hwdownload
> +to capture a screen which is using one of these non-linear layouts.
> +
> +Currently it expects the data to be a 32bit RGB based pixel format. However
> +the logic doesnt do any pixel format conversion or so. Later will be
> enabling
> +16bit RGB data also, as the logic is transparent to it at one level.
> +
> +One could either insert this into the filter chain while capturing itself,
> +or else, if it is slowing things down or so, then one could instead insert
> +it into the filter chain during playback or transcoding or so.
> +
> +It supports the following optional parameters
> +
> + at table @option
> + at item type
> +Specify which detiling conversion to apply. The supported values are
> + at table @var
> + at item 0
> +intel tile-x to linear conversion (the default)
> + at item 1
> +intel tile-y to linear conversion.
> + at end table
> + at end table
> +
> +If one wants to convert during capture itself, one could do
> + at example
> +ffmpeg -f kmsgrab -i - -vf "hwdownload, fbdetile" OUTPUT
> + at end example
> +
> +However if one wants to convert after the tiled data has been already
> captured
> + at example
> +ffmpeg -i INPUT -vf "fbdetile" OUTPUT
> + at end example
> + at example
> +ffplay -i INPUT -vf "fbdetile"
> + at end example
> +
> +NOTE: While transcoding a test 1080p h264 stream, with 276 frames, with two
> +runs of each situation, the performance was has given below. However this
> +was for the older | initial version of the logic, as well as it was run on
> +the default linux chromebook->vm->container, so the perf values need not be
> +proper. But in a relative sense the overhead would be similar.
> + at example
> +rm out.mp4; time ./ffmpeg -i input.mp4 out.mp4
> +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=0 out.mp4
> +rm out.mp4; time ./ffmpeg -i input.mp4 -vf fbdetile=1 out.mp4
> + at end example
> + at table @option
> + at item with no fbdetile filter
> +it took ~7.28 secs,
> + at item with fbdetile=0 filter
> +it took ~8.69 secs,
> + at item with fbdetile=1 filter
> +it took ~9.20 secs.
> + at end table
> +
> @section hqx
>
> Apply a high-quality magnification filter designed for pixel art. This
> filter
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index 5123540653..bdb0c379ae 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -280,6 +280,7 @@ OBJS-$(CONFIG_HWDOWNLOAD_FILTER) +=
> vf_hwdownload.o
> OBJS-$(CONFIG_HWMAP_FILTER) += vf_hwmap.o
> OBJS-$(CONFIG_HWUPLOAD_CUDA_FILTER) += vf_hwupload_cuda.o
> OBJS-$(CONFIG_HWUPLOAD_FILTER) += vf_hwupload.o
> +OBJS-$(CONFIG_FBDETILE_FILTER) += vf_fbdetile.o
> OBJS-$(CONFIG_HYSTERESIS_FILTER) += vf_hysteresis.o framesync.o
> OBJS-$(CONFIG_IDET_FILTER) += vf_idet.o
> OBJS-$(CONFIG_IL_FILTER) += vf_il.o
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 1183e40267..f8dceb2a88 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -265,6 +265,7 @@ extern AVFilter ff_vf_hwdownload;
> extern AVFilter ff_vf_hwmap;
> extern AVFilter ff_vf_hwupload;
> extern AVFilter ff_vf_hwupload_cuda;
> +extern AVFilter ff_vf_fbdetile;
> extern AVFilter ff_vf_hysteresis;
> extern AVFilter ff_vf_idet;
> extern AVFilter ff_vf_il;
> diff --git a/libavfilter/vf_fbdetile.c b/libavfilter/vf_fbdetile.c
> new file mode 100644
> index 0000000000..8b20c96d2c
> --- /dev/null
> +++ b/libavfilter/vf_fbdetile.c
> @@ -0,0 +1,309 @@
> +/*
> + * Copyright (c) 2020 HanishKVC
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301
> USA
> + */
> +
> +/**
> + * @file
> + * Detile the Frame buffer's tile layout using the cpu
> + * Currently it supports the legacy Intel Tile X layout detiling.
> + *
> + */
> +
> +/*
> + * ToThink|Check: Optimisations
> + *
> + * Does gcc setting used by ffmpeg allows memcpy | stringops inlining,
> + * loop unrolling, better native matching instructions, additional
> + * optimisations, ...
> + *
> + * Does gcc map to optimal memcpy logic, based on the situation it is
> + * used in.
> + *
> + * If not, may be look at vector_size or intrinsics or appropriate arch
> + * and cpu specific inline asm or ...
> + *
> + */
> +
> +#include "libavutil/avassert.h"
> +#include "libavutil/imgutils.h"
> +#include "libavutil/opt.h"
> +#include "avfilter.h"
> +#include "formats.h"
> +#include "internal.h"
> +#include "video.h"
> +
> +enum FilterMode {
> + TYPE_INTELX,
> + TYPE_INTELY,
> + NB_TYPE
> +};
> +
> +typedef struct FBDetileContext {
> + const AVClass *class;
> + int width, height;
> + int type;
> +} FBDetileContext;
> +
> +#define OFFSET(x) offsetof(FBDetileContext, x)
> +#define FLAGS AV_OPT_FLAG_FILTERING_PARAM|AV_OPT_FLAG_VIDEO_PARAM
> +static const AVOption fbdetile_options[] = {
> + { "type", "set framebuffer format_modifier type", OFFSET(type),
> AV_OPT_TYPE_INT, {.i64=TYPE_INTELX}, 0, NB_TYPE-1, FLAGS, "type" },
> + { "intelx", "Intel Tile-X layout", 0, AV_OPT_TYPE_CONST,
> {.i64=TYPE_INTELX}, INT_MIN, INT_MAX, FLAGS, "type" },
> + { "intely", "Intel Tile-Y layout", 0, AV_OPT_TYPE_CONST,
> {.i64=TYPE_INTELY}, INT_MIN, INT_MAX, FLAGS, "type" },
> + { NULL }
> +};
> +
> +AVFILTER_DEFINE_CLASS(fbdetile);
> +
> +static av_cold int init(AVFilterContext *ctx)
> +{
> + FBDetileContext *fbdetile = ctx->priv;
> +
> + if (fbdetile->type == TYPE_INTELX) {
> + fprintf(stderr,"INFO:fbdetile:init: Intel tile-x to linear\n");
> + } else if (fbdetile->type == TYPE_INTELY) {
> + fprintf(stderr,"INFO:fbdetile:init: Intel tile-y to linear\n");
> + } else {
> + fprintf(stderr,"DBUG:fbdetile:init: Unknown Tile format specified,
> shouldnt reach here\n");
> + }
> + fbdetile->width = 1920;
> + fbdetile->height = 1080;
> + return 0;
> +}
> +
> +static int query_formats(AVFilterContext *ctx)
> +{
> + // Currently only RGB based 32bit formats are specified
> + // TODO: Technically the logic is transparent to 16bit RGB formats also
> + static const enum AVPixelFormat pix_fmts[] = {AV_PIX_FMT_RGB0,
> AV_PIX_FMT_0RGB, AV_PIX_FMT_BGR0, AV_PIX_FMT_0BGR,
> + AV_PIX_FMT_RGBA,
> AV_PIX_FMT_ARGB, AV_PIX_FMT_BGRA, AV_PIX_FMT_ABGR,
> + AV_PIX_FMT_NONE};
> + AVFilterFormats *fmts_list;
> +
> + fmts_list = ff_make_format_list(pix_fmts);
> + if (!fmts_list)
> + return AVERROR(ENOMEM);
> + return ff_set_common_formats(ctx, fmts_list);
> +}
> +
> +static int config_props(AVFilterLink *inlink)
> +{
> + AVFilterContext *ctx = inlink->dst;
> + FBDetileContext *fbdetile = ctx->priv;
> +
> + fbdetile->width = inlink->w;
> + fbdetile->height = inlink->h;
> + fprintf(stderr,"DBUG:fbdetile:config_props: %d x %d\n",
> fbdetile->width, fbdetile->height);
> +
> + return 0;
> +}
> +
> +static void detile_intelx(AVFilterContext *ctx, int w, int h,
> + uint8_t *dst, int dstLineSize,
> + const uint8_t *src, int srcLineSize)
> +{
> + // Offsets and LineSize are in bytes
> + int tileW = 128; // For a 32Bit / Pixel framebuffer, 512/4
> + int tileH = 8;
> +
> + if (w*4 != srcLineSize) {
> + fprintf(stderr,"DBUG:fbdetile:intelx: w%dxh%d, dL%d, sL%d\n", w, h,
> dstLineSize, srcLineSize);
> + fprintf(stderr,"ERRR:fbdetile:intelx: dont support LineSize | Pitch
> going beyond width\n");
> + }
> + int sO = 0;
> + int dX = 0;
> + int dY = 0;
> + int nTRows = (w*h)/tileW;
> + int cTR = 0;
> + while (cTR < nTRows) {
> + int dO = dY*dstLineSize + dX*4;
> +#ifdef DEBUG_FBTILE
> + fprintf(stderr,"DBUG:fbdetile:intelx: dX%d dY%d, sO%d, dO%d\n", dX,
> dY, sO, dO);
> +#endif
> + memcpy(dst+dO+0*dstLineSize, src+sO+0*512, 512);
> + memcpy(dst+dO+1*dstLineSize, src+sO+1*512, 512);
> + memcpy(dst+dO+2*dstLineSize, src+sO+2*512, 512);
> + memcpy(dst+dO+3*dstLineSize, src+sO+3*512, 512);
> + memcpy(dst+dO+4*dstLineSize, src+sO+4*512, 512);
> + memcpy(dst+dO+5*dstLineSize, src+sO+5*512, 512);
> + memcpy(dst+dO+6*dstLineSize, src+sO+6*512, 512);
> + memcpy(dst+dO+7*dstLineSize, src+sO+7*512, 512);
> + dX += tileW;
> + if (dX >= w) {
> + dX = 0;
> + dY += 8;
> + }
> + sO = sO + 8*512;
> + cTR += 8;
> + }
> +}
> +
> +/*
> + * Intel Legacy Tile-Y layout conversion support
> + *
> + * currently done in a simple dumb way. Two low hanging optimisations
> + * that could be readily applied are
> + *
> + * a) unrolling the inner for loop
> + * --- Given small size memcpy, should help, DONE
> + *
> + * b) using simd based 128bit loading and storing along with prefetch
> + * hinting.
> + *
> + * TOTHINK|CHECK: Does memcpy already does this and more if situation
> + * is right?!
> + *
> + * As code (or even intrinsics) would be specific to each architecture,
> + * avoiding for now. Later have to check if vector_size attribute and
> + * corresponding implementation by gcc can handle different
> architectures
> + * properly, such that it wont become worse than memcpy provided for
> that
> + * architecture.
> + *
> + * Or maybe I could even merge the two intel detiling logics into one, as
> + * the semantic and flow is almost same for both logics.
> + *
> + */
> +static void detile_intely(AVFilterContext *ctx, int w, int h,
> + uint8_t *dst, int dstLineSize,
> + const uint8_t *src, int srcLineSize)
> +{
> + // Offsets and LineSize are in bytes
> + int tileW = 4; // For a 32Bit / Pixel framebuffer, 16/4
> + int tileH = 32;
> +
> + if (w*4 != srcLineSize) {
> + fprintf(stderr,"DBUG:fbdetile:intely: w%dxh%d, dL%d, sL%d\n", w, h,
> dstLineSize, srcLineSize);
> + fprintf(stderr,"ERRR:fbdetile:intely: dont support LineSize | Pitch
> going beyond width\n");
> + }
> + int sO = 0;
> + int dX = 0;
> + int dY = 0;
> + int nTRows = (w*h)/tileW;
> + int cTR = 0;
> + while (cTR < nTRows) {
> + int dO = dY*dstLineSize + dX*4;
> +#ifdef DEBUG_FBTILE
> + fprintf(stderr,"DBUG:fbdetile:intely: dX%d dY%d, sO%d, dO%d\n", dX,
> dY, sO, dO);
> +#endif
> +
> + memcpy(dst+dO+0*dstLineSize, src+sO+0*16, 16);
> + memcpy(dst+dO+1*dstLineSize, src+sO+1*16, 16);
> + memcpy(dst+dO+2*dstLineSize, src+sO+2*16, 16);
> + memcpy(dst+dO+3*dstLineSize, src+sO+3*16, 16);
> + memcpy(dst+dO+4*dstLineSize, src+sO+4*16, 16);
> + memcpy(dst+dO+5*dstLineSize, src+sO+5*16, 16);
> + memcpy(dst+dO+6*dstLineSize, src+sO+6*16, 16);
> + memcpy(dst+dO+7*dstLineSize, src+sO+7*16, 16);
> + memcpy(dst+dO+8*dstLineSize, src+sO+8*16, 16);
> + memcpy(dst+dO+9*dstLineSize, src+sO+9*16, 16);
> + memcpy(dst+dO+10*dstLineSize, src+sO+10*16, 16);
> + memcpy(dst+dO+11*dstLineSize, src+sO+11*16, 16);
> + memcpy(dst+dO+12*dstLineSize, src+sO+12*16, 16);
> + memcpy(dst+dO+13*dstLineSize, src+sO+13*16, 16);
> + memcpy(dst+dO+14*dstLineSize, src+sO+14*16, 16);
> + memcpy(dst+dO+15*dstLineSize, src+sO+15*16, 16);
> + memcpy(dst+dO+16*dstLineSize, src+sO+16*16, 16);
> + memcpy(dst+dO+17*dstLineSize, src+sO+17*16, 16);
> + memcpy(dst+dO+18*dstLineSize, src+sO+18*16, 16);
> + memcpy(dst+dO+19*dstLineSize, src+sO+19*16, 16);
> + memcpy(dst+dO+20*dstLineSize, src+sO+20*16, 16);
> + memcpy(dst+dO+21*dstLineSize, src+sO+21*16, 16);
> + memcpy(dst+dO+22*dstLineSize, src+sO+22*16, 16);
> + memcpy(dst+dO+23*dstLineSize, src+sO+23*16, 16);
> + memcpy(dst+dO+24*dstLineSize, src+sO+24*16, 16);
> + memcpy(dst+dO+25*dstLineSize, src+sO+25*16, 16);
> + memcpy(dst+dO+26*dstLineSize, src+sO+26*16, 16);
> + memcpy(dst+dO+27*dstLineSize, src+sO+27*16, 16);
> + memcpy(dst+dO+28*dstLineSize, src+sO+28*16, 16);
> + memcpy(dst+dO+29*dstLineSize, src+sO+29*16, 16);
> + memcpy(dst+dO+30*dstLineSize, src+sO+30*16, 16);
> + memcpy(dst+dO+31*dstLineSize, src+sO+31*16, 16);
> +
> + dX += tileW;
> + if (dX >= w) {
> + dX = 0;
> + dY += 32;
> + }
> + sO = sO + 32*16;
> + cTR += 32;
> + }
> +}
> +
> +static int filter_frame(AVFilterLink *inlink, AVFrame *in)
> +{
> + AVFilterContext *ctx = inlink->dst;
> + FBDetileContext *fbdetile = ctx->priv;
> + AVFilterLink *outlink = ctx->outputs[0];
> + AVFrame *out;
> +
> + out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
> + if (!out) {
> + av_frame_free(&in);
> + return AVERROR(ENOMEM);
> + }
> + av_frame_copy_props(out, in);
> +
> + if (fbdetile->type == TYPE_INTELX) {
> + detile_intelx(ctx, fbdetile->width, fbdetile->height,
> + out->data[0], out->linesize[0],
> + in->data[0], in->linesize[0]);
> + } else if (fbdetile->type == TYPE_INTELY) {
> + detile_intely(ctx, fbdetile->width, fbdetile->height,
> + out->data[0], out->linesize[0],
> + in->data[0], in->linesize[0]);
> + }
> +
> + av_frame_free(&in);
> + return ff_filter_frame(outlink, out);
> +}
> +
> +static av_cold void uninit(AVFilterContext *ctx)
> +{
> +
> +}
> +
> +static const AVFilterPad fbdetile_inputs[] = {
> + {
> + .name = "default",
> + .type = AVMEDIA_TYPE_VIDEO,
> + .config_props = config_props,
> + .filter_frame = filter_frame,
> + },
> + { NULL }
> +};
> +
> +static const AVFilterPad fbdetile_outputs[] = {
> + {
> + .name = "default",
> + .type = AVMEDIA_TYPE_VIDEO,
> + },
> + { NULL }
> +};
> +
> +AVFilter ff_vf_fbdetile = {
> + .name = "fbdetile",
> + .description = NULL_IF_CONFIG_SMALL("Detile Framebuffer using CPU"),
> + .priv_size = sizeof(FBDetileContext),
> + .init = init,
> + .uninit = uninit,
> + .query_formats = query_formats,
> + .inputs = fbdetile_inputs,
> + .outputs = fbdetile_outputs,
> + .priv_class = &fbdetile_class,
> +};
> --
> 2.20.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request at ffmpeg.org with subject "unsubscribe".
More information about the ffmpeg-devel
mailing list