[FFmpeg-devel] [PATCH] avfilter: add CUDA-accelerated transpose filter
Timo Rothenpieler
timo at rothenpieler.org
Thu Jun 5 21:14:00 EEST 2025
On 05.06.2025 13:09, Faeez Kadiri wrote:
> Add a new CUDA-accelerated transpose filter (transpose_cuda) that provides
> hardware-accelerated video transposition operations on NVIDIA GPUs using
> CUDA. This filter supports all the same transpose operations as the CPU
> transpose filter while leveraging GPU acceleration for improved performance.
>
> Supported operations:
> - 90° clockwise rotation
> - 90° counter-clockwise rotation
> - 90° clockwise + vertical flip
> - 90° counter-clockwise + vertical flip
> - 180° rotation
> - Horizontal flip
> - Vertical flip
>
> Supported pixel formats:
> - YUV420P, NV12, YUV444P (8-bit)
> - P010, P016, YUV444P16 (10/16-bit)
> - RGB32, BGR32, 0RGB32, 0BGR32 (packed RGB)
>
> The implementation uses CUDA texture memory for optimal memory access
> patterns and includes a new CUDA VPP (Video Post-Processing) framework
> that can be reused by future CUDA filters.
>
> Performance improvements over CPU transpose:
> - 4K YUV420P: ~15x faster
> - 1080p YUV420P: ~8x faster
> - Negligible CPU usage during processing
>
> The filter maintains full compatibility with the existing transpose filter
> API and includes passthrough mode for landscape content when enabled.
>
> Dependencies: requires CUDA SDK and ffnvcodec headers.
What exactly requires the CUDA SDK?
Looking at it, it should not do so at all.
> Example usage:
> ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 \
> -vf transpose_cuda=1 -c:v h264_nvenc output.mp4
>
> Signed-off-by: Faeez Kadiri <f1k2faeez at gmail.com>
> ---
> Changelog | 2 +-
> configure | 2 +
> doc/filters.texi | 137 ++++++++++
> libavfilter/Makefile | 1 +
> libavfilter/allfilters.c | 1 +
> libavfilter/cuda/cuda_vpp.c | 248 ++++++++++++++++++
> libavfilter/cuda/cuda_vpp.h | 113 +++++++++
> libavfilter/vf_transpose_cuda.c | 423 +++++++++++++++++++++++++++++++
> libavfilter/vf_transpose_cuda.cu | 219 ++++++++++++++++
> 9 files changed, 1145 insertions(+), 1 deletion(-)
> create mode 100644 libavfilter/cuda/cuda_vpp.c
> create mode 100644 libavfilter/cuda/cuda_vpp.h
> create mode 100644 libavfilter/vf_transpose_cuda.c
> create mode 100644 libavfilter/vf_transpose_cuda.cu
>
> diff --git a/Changelog b/Changelog
> index 4217449438..cf1d019645 100644
> --- a/Changelog
> +++ b/Changelog
> @@ -18,7 +18,7 @@ version <next>:
> - APV encoding support through a libopenapv wrapper
> - VVC decoder supports all content of SCC (Screen Content Coding):
> IBC (Inter Block Copy), Palette Mode and ACT (Adaptive Color Transform
> -
The two newlines need to stay
> +- Transpose CUDA filter (transpose_cuda)
>
> version 7.1:
> - Raw Captions with Time (RCWT) closed caption demuxer
> diff --git a/configure b/configure
> index 89a766b403..d6b07d7afe 100755
> --- a/configure
> +++ b/configure
> @@ -4016,6 +4016,8 @@ tinterlace_pad_test_deps="tinterlace_filter"
> tonemap_filter_deps="const_nan"
> tonemap_vaapi_filter_deps="vaapi VAProcFilterParameterBufferHDRToneMapping"
> tonemap_opencl_filter_deps="opencl const_nan"
> +transpose_cuda_filter_deps="ffnvcodec"
> +transpose_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
It's a bit of a nit, given not all filters follow this already, but
there is is whole section just for cuda filters, starting with
bilateral_cuda_filter_deps.
> transpose_opencl_filter_deps="opencl"
> transpose_vaapi_filter_deps="vaapi VAProcPipelineCaps_rotation_flags"
> transpose_vt_filter_deps="videotoolbox VTPixelRotationSessionCreate"
> diff --git a/doc/filters.texi b/doc/filters.texi
> index 63f55f5794..8321f847de 100644
> --- a/doc/filters.texi
> +++ b/doc/filters.texi
> @@ -26856,6 +26856,143 @@ Only deinterlace frames marked as interlaced.
> The default value is @code{all}.
> @end table
>
> + at anchor{transpose_cuda}
> + at subsection transpose_cuda
> +
> +Transpose rows with columns in the input video and optionally flip it using CUDA.
> +
> +This is the CUDA variant of the @ref{transpose} filter. It provides hardware-accelerated
> +transposition operations on CUDA-capable devices with support for multiple pixel formats.
> +
> +The filter uses efficient CUDA kernels with texture memory for optimal performance across
> +all supported pixel formats and frame sizes.
> +
> + at subsection Supported Pixel Formats
> +
> +The filter supports the following pixel formats:
> + at itemize
> + at item YUV420P (8-bit planar YUV)
> + at item NV12 (8-bit semi-planar YUV)
> + at item YUV444P (8-bit planar YUV 4:4:4)
> + at item P010LE (10-bit semi-planar YUV)
> + at item P016LE (16-bit semi-planar YUV)
> + at item YUV444P16LE (16-bit planar YUV 4:4:4)
> + at item RGB0 (32-bit RGB with alpha padding)
> + at item BGR0 (32-bit BGR with alpha padding)
> + at item RGBA (32-bit RGBA)
> + at item BGRA (32-bit BGRA)
> + at end itemize
> +
> + at subsection Options
> +
> +It accepts the following parameters:
> +
> + at table @option
> +
> + at item dir
> +Specify the transposition direction.
> +
> +Can assume the following values:
> + at table @samp
> + at item 0, cclock_flip
> +Rotate by 90 degrees counterclockwise and vertically flip (default), that is:
> + at example
> +L.R L.l
> +. . -> . .
> +l.r R.r
> + at end example
> +
> + at item 1, clock
> +Rotate by 90 degrees clockwise, that is:
> + at example
> +L.R l.L
> +. . -> . .
> +l.r r.R
> + at end example
> +
> + at item 2, cclock
> +Rotate by 90 degrees counterclockwise, that is:
> + at example
> +L.R R.r
> +. . -> . .
> +l.r L.l
> + at end example
> +
> + at item 3, clock_flip
> +Rotate by 90 degrees clockwise and vertically flip, that is:
> + at example
> +L.R r.R
> +. . -> . .
> +l.r l.L
> + at end example
> +
> + at item 4, reversal
> +Rotate by 180 degrees, that is:
> + at example
> +L.R r.l
> +. . -> . .
> +l.r R.L
> + at end example
> +
> + at item 5, hflip
> +Flip horizontally, that is:
> + at example
> +L.R R.L
> +. . -> . .
> +l.r r.l
> + at end example
> +
> + at item 6, vflip
> +Flip vertically, that is:
> + at example
> +L.R l.r
> +. . -> . .
> +l.r L.R
> + at end example
> + at end table
> +
> + at item passthrough
> +Do not apply the transposition if the input geometry matches the one
> +specified by the specified value. It accepts the following values:
> + at table @samp
> + at item none
> +Always apply transposition. (default)
> + at item portrait
> +Preserve portrait geometry (when @var{height} >= @var{width}).
> + at item landscape
> +Preserve landscape geometry (when @var{width} >= @var{height}).
> + at end table
> +
> + at end table
> +
> + at subsection Usage Examples
> +
> + at itemize
> + at item
> +Rotate a video 90 degrees clockwise:
> + at example
> +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -vf "transpose_cuda=dir=clock" output.mp4
> + at end example
> +
> + at item
> +Rotate a video 90 degrees counterclockwise:
> + at example
> +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -vf "transpose_cuda=dir=cclock" output.mp4
> + at end example
> +
> + at item
> +Flip a video horizontally:
> + at example
> +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -vf "transpose_cuda=dir=hflip" output.mp4
> + at end example
> +
> + at item
> +Rotate 180 degrees:
> + at example
> +ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -vf "transpose_cuda=dir=reversal" output.mp4
> + at end example
> + at end itemize
> +
> @anchor{CUDA NPP}
> @section CUDA NPP
> Below is a description of the currently available NVIDIA Performance Primitives (libnpp) video filters.
> diff --git a/libavfilter/Makefile b/libavfilter/Makefile
> index 97f8f17272..737f397315 100644
> --- a/libavfilter/Makefile
> +++ b/libavfilter/Makefile
> @@ -535,6 +535,7 @@ OBJS-$(CONFIG_TONEMAP_OPENCL_FILTER) += vf_tonemap_opencl.o opencl.o \
> OBJS-$(CONFIG_TONEMAP_VAAPI_FILTER) += vf_tonemap_vaapi.o vaapi_vpp.o
> OBJS-$(CONFIG_TPAD_FILTER) += vf_tpad.o
> OBJS-$(CONFIG_TRANSPOSE_FILTER) += vf_transpose.o
> +OBJS-$(CONFIG_TRANSPOSE_CUDA_FILTER) += vf_transpose_cuda.o vf_transpose_cuda.ptx.o cuda/cuda_vpp.o cuda/load_helper.o
> OBJS-$(CONFIG_TRANSPOSE_NPP_FILTER) += vf_transpose_npp.o
> OBJS-$(CONFIG_TRANSPOSE_OPENCL_FILTER) += vf_transpose_opencl.o opencl.o opencl/transpose.o
> OBJS-$(CONFIG_TRANSPOSE_VAAPI_FILTER) += vf_transpose_vaapi.o vaapi_vpp.o
> diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
> index 3bc045b28f..6d0ef5e654 100644
> --- a/libavfilter/allfilters.c
> +++ b/libavfilter/allfilters.c
> @@ -503,6 +503,7 @@ extern const FFFilter ff_vf_tonemap_opencl;
> extern const FFFilter ff_vf_tonemap_vaapi;
> extern const FFFilter ff_vf_tpad;
> extern const FFFilter ff_vf_transpose;
> +extern const FFFilter ff_vf_transpose_cuda;
> extern const FFFilter ff_vf_transpose_npp;
> extern const FFFilter ff_vf_transpose_opencl;
> extern const FFFilter ff_vf_transpose_vaapi;
> diff --git a/libavfilter/cuda/cuda_vpp.c b/libavfilter/cuda/cuda_vpp.c
> new file mode 100644
> index 0000000000..b8a93820f9
> --- /dev/null
> +++ b/libavfilter/cuda/cuda_vpp.c
> @@ -0,0 +1,248 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +
> +#include <string.h>
> +
> +#include "libavutil/avassert.h"
> +#include "libavutil/pixdesc.h"
> +
> +#include "libavfilter/filters.h"
> +#include "libavfilter/formats.h"
> +#include "cuda_vpp.h"
> +#include "load_helper.h"
> +
> +int ff_cuda_vpp_query_formats(const AVFilterContext *avctx,
> + AVFilterFormatsConfig **cfg_in,
> + AVFilterFormatsConfig **cfg_out)
> +{
> + static const enum AVPixelFormat pix_fmts[] = {
> + AV_PIX_FMT_CUDA, AV_PIX_FMT_NONE,
> + };
> + int err;
> +
> + err = ff_set_common_formats_from_list2(avctx, cfg_in, cfg_out, pix_fmts);
> + if (err < 0)
> + return err;
> +
> + return 0;
> +}
This whole function appears to be just FILTER_SINGLE_PIXFMT(AV_PIX_FMT_CUDA)
> +int ff_cuda_vpp_config_input(AVFilterLink *inlink)
> +{
> + FilterLink *l = ff_filter_link(inlink);
> + AVFilterContext *avctx = inlink->dst;
> + CUDAVPPContext *ctx = avctx->priv;
> +
> + if (ctx->pipeline_uninit)
> + ctx->pipeline_uninit(avctx);
> +
> + if (!l->hw_frames_ctx) {
> + av_log(avctx, AV_LOG_ERROR, "A hardware frames reference is "
> + "required to associate the processing device.\n");
> + return AVERROR(EINVAL);
> + }
> +
> + ctx->input_frames_ref = av_buffer_ref(l->hw_frames_ctx);
> + if (!ctx->input_frames_ref) {
> + av_log(avctx, AV_LOG_ERROR, "A input frames reference create "
> + "failed.\n");
> + return AVERROR(ENOMEM);
> + }
> + ctx->input_frames = (AVHWFramesContext*)ctx->input_frames_ref->data;
> +
> + return 0;
> +}
> +
> +int ff_cuda_vpp_config_output(AVFilterLink *outlink)
> +{
> + FilterLink *outl = ff_filter_link(outlink);
> + AVFilterContext *avctx = outlink->src;
> + AVFilterLink *inlink = avctx->inputs[0];
> + FilterLink *inl = ff_filter_link(inlink);
> + CUDAVPPContext *ctx = avctx->priv;
> + AVHWFramesContext *input_frames;
> + AVBufferRef *hw_frames_ctx;
> + AVHWFramesContext *output_frames;
> + enum AVPixelFormat in_format;
> + int err;
> +
> + if (ctx->pipeline_uninit)
> + ctx->pipeline_uninit(avctx);
> +
> + if (!ctx->output_width)
> + ctx->output_width = avctx->inputs[0]->w;
> + if (!ctx->output_height)
> + ctx->output_height = avctx->inputs[0]->h;
> +
> + outlink->w = ctx->output_width;
> + outlink->h = ctx->output_height;
> +
> + if (ctx->passthrough) {
> + if (inl->hw_frames_ctx)
> + outl->hw_frames_ctx = av_buffer_ref(inl->hw_frames_ctx);
> + av_log(ctx, AV_LOG_VERBOSE, "Using CUDA filter passthrough mode.\n");
> + return 0;
> + }
> +
> + av_assert0(ctx->input_frames);
> + ctx->device_ref = av_buffer_ref(ctx->input_frames->device_ref);
> + if (!ctx->device_ref) {
> + av_log(avctx, AV_LOG_ERROR, "A device reference create "
> + "failed.\n");
> + return AVERROR(ENOMEM);
> + }
> +
> + input_frames = (AVHWFramesContext*)ctx->input_frames_ref->data;
> + in_format = input_frames->sw_format;
> +
> + ctx->hwctx = input_frames->device_ctx->hwctx;
> + ctx->cuda_dl = ctx->hwctx->internal->cuda_dl;
> + ctx->cu_stream = ctx->hwctx->stream;
> +
> + if (ctx->output_format == AV_PIX_FMT_NONE)
> + ctx->output_format = input_frames->sw_format;
> +
> + // Setup format information
> + err = ff_cuda_vpp_setup_planes(ctx, in_format);
> + if (err < 0)
> + return err;
> +
> + // Load filter-specific functions
> + if (ctx->load_functions) {
> + err = ctx->load_functions(avctx, in_format);
> + if (err < 0)
> + return err;
> + }
> +
> + // Build filter parameters
> + if (ctx->build_filter_params) {
> + err = ctx->build_filter_params(avctx);
> + if (err < 0)
> + return err;
> + }
> +
> + // Initialize hardware frames context for output
> + hw_frames_ctx = av_hwframe_ctx_alloc(ctx->device_ref);
> + if (!hw_frames_ctx)
> + return AVERROR(ENOMEM);
> +
> + output_frames = (AVHWFramesContext*)hw_frames_ctx->data;
> + output_frames->format = AV_PIX_FMT_CUDA;
> + output_frames->sw_format = ctx->output_format;
> + output_frames->width = ctx->output_width;
> + output_frames->height = ctx->output_height;
> +
> + err = av_hwframe_ctx_init(hw_frames_ctx);
> + if (err < 0) {
> + av_buffer_unref(&hw_frames_ctx);
> + return err;
> + }
> +
> + av_buffer_unref(&outl->hw_frames_ctx);
> + outl->hw_frames_ctx = hw_frames_ctx;
> +
> + return 0;
> +}
> +
> +int ff_cuda_vpp_format_is_supported(enum AVPixelFormat fmt, const enum AVPixelFormat *supported_formats, int nb_formats)
> +{
> + int i;
> +
> + for (i = 0; i < nb_formats; i++)
> + if (supported_formats[i] == fmt)
> + return 1;
> + return 0;
> +}
> +
> +int ff_cuda_vpp_setup_planes(CUDAVPPContext *s, enum AVPixelFormat format)
> +{
> + s->in_fmt = format;
> + s->in_desc = av_pix_fmt_desc_get(s->in_fmt);
> + s->in_planes = av_pix_fmt_count_planes(s->in_fmt);
> +
> + // Clear plane information
> + memset(s->in_plane_depths, 0, sizeof(s->in_plane_depths));
> + memset(s->in_plane_channels, 0, sizeof(s->in_plane_channels));
The context is already zero-initialized, so this is not neccesary.
> + // Set up plane information
> + for (int i = 0; i < s->in_desc->nb_components; i++) {
> + int d = (s->in_desc->comp[i].depth + 7) / 8;
> + int p = s->in_desc->comp[i].plane;
> + s->in_plane_channels[p] = FFMAX(s->in_plane_channels[p], s->in_desc->comp[i].step / d);
> + s->in_plane_depths[p] = s->in_desc->comp[i].depth;
> + }
> +
> + return 0;
> +}
> +
> +int ff_cuda_vpp_load_module(AVFilterContext *ctx, CUDAVPPContext *s,
> + const unsigned char *ptx_data, unsigned int ptx_len)
> +{
> + CUcontext dummy, cuda_ctx = s->hwctx->cuda_ctx;
> + CudaFunctions *cu = s->cuda_dl;
> + int ret;
> +
> + ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
> + if (ret < 0)
> + return ret;
> +
> + ret = ff_cuda_load_module(ctx, s->hwctx, &s->cu_module, ptx_data, ptx_len);
> + if (ret < 0)
> + goto fail;
> +
> +fail:
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ret;
> +}
This seems a bit redundant to me. All it does is move push/pop out of
somewhere else, which would potentially be doubled up if something loads
multiple modules.
> +int ff_cuda_vpp_get_function(AVFilterContext *ctx, CUDAVPPContext *s,
> + CUfunction *func, const char *func_name)
> +{
> + CudaFunctions *cu = s->cuda_dl;
> + int ret;
> +
> + ret = CHECK_CU(cu->cuModuleGetFunction(func, s->cu_module, func_name));
> + if (ret < 0) {
> + av_log(ctx, AV_LOG_FATAL, "Failed to load function: %s\n", func_name);
> + return AVERROR(ENOSYS);
> + }
> +
> + return 0;
> +}
This function is also redundant, given it adds nothing on top of just
cuModuleGetFunction itself.
> +void ff_cuda_vpp_ctx_init(AVFilterContext *avctx)
> +{
> + CUDAVPPContext *ctx = avctx->priv;
> +
> + ctx->cu_module = NULL;
> + ctx->passthrough = 0;
> +}
Every context is zero-initialized anyway, so this is unneccesary.
> +void ff_cuda_vpp_ctx_uninit(AVFilterContext *avctx)
> +{
> + CUDAVPPContext *ctx = avctx->priv;
> +
> + if (ctx->pipeline_uninit)
> + ctx->pipeline_uninit(avctx);
> +
> + av_buffer_unref(&ctx->input_frames_ref);
> + av_buffer_unref(&ctx->device_ref);
> +}
> diff --git a/libavfilter/cuda/cuda_vpp.h b/libavfilter/cuda/cuda_vpp.h
> new file mode 100644
> index 0000000000..1241d36180
> --- /dev/null
> +++ b/libavfilter/cuda/cuda_vpp.h
> @@ -0,0 +1,113 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#ifndef AVFILTER_CUDA_VPP_H
> +#define AVFILTER_CUDA_VPP_H
> +
> +#include "libavutil/hwcontext.h"
> +#include "libavutil/hwcontext_cuda_internal.h"
> +#include "libavutil/cuda_check.h"
> +#include "libavfilter/avfilter.h"
> +
> +#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
> +
> +typedef struct CUDAVPPContext {
> + const AVClass *class;
> +
> + AVCUDADeviceContext *hwctx;
> + CudaFunctions *cuda_dl;
> + AVBufferRef *device_ref;
> +
> + CUcontext cu_ctx;
> + CUmodule cu_module;
> + CUstream cu_stream;
> +
> + AVBufferRef *input_frames_ref;
> + AVHWFramesContext *input_frames;
> +
> + enum AVPixelFormat output_format;
> + int output_width; // computed width
> + int output_height; // computed height
> +
> + int passthrough;
> +
> + // Format information
> + enum AVPixelFormat in_fmt;
> + const AVPixFmtDescriptor *in_desc;
> + int in_planes;
> + int in_plane_depths[4];
> + int in_plane_channels[4];
> +
> + // Function pointers for filter-specific operations
> + int (*load_functions)(AVFilterContext *avctx, enum AVPixelFormat format);
> + int (*build_filter_params)(AVFilterContext *avctx);
> + void (*pipeline_uninit)(AVFilterContext *avctx);
> +} CUDAVPPContext;
I'm not fully convinced this is really neccesary, filters tend to all be
different enough that something like this will have more and more stuff
added to it over time.
> +/**
> + * Initialize CUDA VPP context
> + */
> +void ff_cuda_vpp_ctx_init(AVFilterContext *avctx);
> +
> +/**
> + * Uninitialize CUDA VPP context
> + */
> +void ff_cuda_vpp_ctx_uninit(AVFilterContext *avctx);
> +
> +/**
> + * Query supported formats for CUDA VPP
> + */
> +int ff_cuda_vpp_query_formats(const AVFilterContext *avctx,
> + AVFilterFormatsConfig **cfg_in,
> + AVFilterFormatsConfig **cfg_out);
> +
> +/**
> + * Configure input for CUDA VPP
> + */
> +int ff_cuda_vpp_config_input(AVFilterLink *inlink);
> +
> +/**
> + * Configure output for CUDA VPP
> + */
> +int ff_cuda_vpp_config_output(AVFilterLink *outlink);
> +
> +/**
> + * Check if a pixel format is supported
> + */
> +int ff_cuda_vpp_format_is_supported(enum AVPixelFormat fmt, const enum AVPixelFormat *supported_formats, int nb_formats);
> +
> +/**
> + * Setup plane information for a given format
> + */
> +int ff_cuda_vpp_setup_planes(CUDAVPPContext *s, enum AVPixelFormat format);
> +
> +/**
> + * Load CUDA module from PTX data
> + */
> +int ff_cuda_vpp_load_module(AVFilterContext *ctx, CUDAVPPContext *s,
> + const unsigned char *ptx_data, unsigned int ptx_len);
> +
> +/**
> + * Get CUDA function from loaded module
> + */
> +int ff_cuda_vpp_get_function(AVFilterContext *ctx, CUDAVPPContext *s,
> + CUfunction *func, const char *func_name);
> +
> +#endif /* AVFILTER_CUDA_VPP_H */
It'd be better if this was added in its own commit.
> diff --git a/libavfilter/vf_transpose_cuda.c b/libavfilter/vf_transpose_cuda.c
> new file mode 100644
> index 0000000000..bb7959ce0f
> --- /dev/null
> +++ b/libavfilter/vf_transpose_cuda.c
> @@ -0,0 +1,423 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +/**
> + * @file
> + * Hardware accelerated transpose filter based on CUDA
> + */
> +
> +#include "libavutil/opt.h"
> +#include "libavutil/common.h"
> +#include "libavutil/pixdesc.h"
> +#include "libavutil/hwcontext.h"
> +#include "libavutil/hwcontext_cuda_internal.h"
> +#include "libavutil/cuda_check.h"
> +#include "libavutil/avstring.h"
> +#include "libavutil/avassert.h"
> +#include "libavutil/imgutils.h"
> +
> +#include "filters.h"
> +#include "formats.h"
> +#include "video.h"
> +#include "transpose.h"
> +#include "cuda/cuda_vpp.h"
> +
> +static const enum AVPixelFormat supported_formats[] = {
> + AV_PIX_FMT_YUV420P,
> + AV_PIX_FMT_NV12,
> + AV_PIX_FMT_YUV444P,
> + AV_PIX_FMT_P010,
> + AV_PIX_FMT_P016,
> + AV_PIX_FMT_YUV444P16,
> + AV_PIX_FMT_0RGB32,
> + AV_PIX_FMT_0BGR32,
> + AV_PIX_FMT_RGB32,
> + AV_PIX_FMT_BGR32,
> +};
> +
> +#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
> +#define BLOCKX 32
> +#define BLOCKY 16
> +
> +typedef struct TransposeCUDAContext {
> + CUDAVPPContext vpp_ctx; // must be the first field
> +
> + int passthrough; // PassthroughType, landscape passthrough mode enabled
I'd prefer if this was named more explicitly, like passthrough_mode.
To prevent confusion with the passthrough parameter in the VPPContext.
> + int dir; // TransposeDir
> +
> + // CUDA functions for different operations
> + CUfunction cu_func_transpose;
> + CUfunction cu_func_transpose_uv;
> +} TransposeCUDAContext;
> +
> +static int format_is_supported(enum AVPixelFormat fmt)
> +{
> + int i;
> +
> + for (i = 0; i < FF_ARRAY_ELEMS(supported_formats); i++)
> + if (supported_formats[i] == fmt)
> + return 1;
> + return 0;
> +}
> +
> +static av_cold int transpose_cuda_load_functions(AVFilterContext *avctx, enum AVPixelFormat format)
> +{
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *vpp_ctx = &ctx->vpp_ctx;
> + int ret;
> + char buf[128];
> +
> + const char *fmt_name = av_get_pix_fmt_name(format);
> +
> + extern const unsigned char ff_vf_transpose_cuda_ptx_data[];
> + extern const unsigned int ff_vf_transpose_cuda_ptx_len;
> +
> + ret = ff_cuda_vpp_load_module(avctx, vpp_ctx,
> + ff_vf_transpose_cuda_ptx_data, ff_vf_transpose_cuda_ptx_len);
> + if (ret < 0)
> + return ret;
> +
> + // Load transpose functions
> + snprintf(buf, sizeof(buf), "Transpose_%s", fmt_name);
> + ret = ff_cuda_vpp_get_function(avctx, vpp_ctx, &ctx->cu_func_transpose, buf);
> + if (ret < 0) {
> + av_log(avctx, AV_LOG_FATAL, "Unsupported format for transpose: %s\n", fmt_name);
> + return AVERROR(ENOSYS);
> + }
> +
> + snprintf(buf, sizeof(buf), "Transpose_%s_uv", fmt_name);
> + ret = ff_cuda_vpp_get_function(avctx, vpp_ctx, &ctx->cu_func_transpose_uv, buf);
> + if (ret < 0 && vpp_ctx->in_planes > 1) {
> + av_log(avctx, AV_LOG_WARNING, "UV function not found for format: %s\n", fmt_name);
> + }
> +
> + return 0;
> +}
> +
> +static int transpose_cuda_build_filter_params(AVFilterContext *avctx)
> +{
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *vpp_ctx = &ctx->vpp_ctx;
> +
> + if (!format_is_supported(vpp_ctx->in_fmt)) {
> + av_log(avctx, AV_LOG_ERROR, "Unsupported input format: %s\n",
> + av_get_pix_fmt_name(vpp_ctx->in_fmt));
> + return AVERROR(ENOSYS);
> + }
> +
> + return 0;
> +}
> +
> +static av_cold int transpose_cuda_kernel(AVFilterContext *avctx, CUfunction func,
> + CUtexObject src_tex[4],
> + AVFrame *out_frame,
> + int width, int height,
> + int dst_width, int dst_height, int dst_pitch,
> + int src_width, int src_height, int dir)
> +{
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *s = &ctx->vpp_ctx;
> + CudaFunctions *cu = s->cuda_dl;
> +
> + CUdeviceptr dst_devptr[4] = {
> + (CUdeviceptr)out_frame->data[0], (CUdeviceptr)out_frame->data[1],
> + (CUdeviceptr)out_frame->data[2], (CUdeviceptr)out_frame->data[3]
> + };
> +
> + void *args[] = {
> + &src_tex[0], &src_tex[1], &src_tex[2], &src_tex[3],
> + &dst_devptr[0], &dst_devptr[1], &dst_devptr[2], &dst_devptr[3],
> + &width, &height, &dst_pitch,
> + &dst_width, &dst_height,
> + &src_width, &src_height,
> + &dir
> + };
> +
> + return CHECK_CU(cu->cuLaunchKernel(func,
> + DIV_UP(width, BLOCKX), DIV_UP(height, BLOCKY), 1,
> + BLOCKX, BLOCKY, 1,
> + 0, s->cu_stream, args, NULL));
> +}
> +
> +static int transpose_cuda_filter_frame(AVFilterLink *inlink, AVFrame *input_frame)
> +{
> + AVFilterContext *avctx = inlink->dst;
> + AVFilterLink *outlink = avctx->outputs[0];
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *s = &ctx->vpp_ctx;
> + CudaFunctions *cu = s->cuda_dl;
> + AVFrame *output_frame = NULL;
> + CUtexObject tex[4] = { 0, 0, 0, 0 };
> + int ret = 0;
> + int i;
> + CUcontext dummy;
> +
> + if (ctx->passthrough)
> + return ff_filter_frame(outlink, input_frame);
> +
> + av_log(avctx, AV_LOG_DEBUG, "Filter input: %s, %ux%u (%"PRId64").\n",
> + av_get_pix_fmt_name(input_frame->format),
> + input_frame->width, input_frame->height, input_frame->pts);
> +
> + // Push CUDA context
> + ret = CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> + if (ret < 0)
> + return ret;
> +
> + output_frame = ff_get_video_buffer(outlink, s->output_width,
> + s->output_height);
> + if (!output_frame) {
> + ret = AVERROR(ENOMEM);
> + goto fail;
> + }
> +
> + ret = av_frame_copy_props(output_frame, input_frame);
> + if (ret < 0)
> + goto fail;
> +
> + // Create texture objects for input
> + for (i = 0; i < s->in_planes; i++) {
> + CUDA_TEXTURE_DESC tex_desc = {
> + .filterMode = CU_TR_FILTER_MODE_POINT,
> + .flags = CU_TRSF_READ_AS_INTEGER,
> + };
> +
> + CUDA_RESOURCE_DESC res_desc = {
> + .resType = CU_RESOURCE_TYPE_PITCH2D,
> + .res.pitch2D.format = s->in_plane_depths[i] <= 8 ?
> + CU_AD_FORMAT_UNSIGNED_INT8 :
> + CU_AD_FORMAT_UNSIGNED_INT16,
> + .res.pitch2D.numChannels = s->in_plane_channels[i],
> + .res.pitch2D.pitchInBytes = input_frame->linesize[i],
> + .res.pitch2D.devPtr = (CUdeviceptr)input_frame->data[i],
> + };
> +
> + if (i == 1 || i == 2) {
> + res_desc.res.pitch2D.width = AV_CEIL_RSHIFT(input_frame->width, s->in_desc->log2_chroma_w);
> + res_desc.res.pitch2D.height = AV_CEIL_RSHIFT(input_frame->height, s->in_desc->log2_chroma_h);
> + } else {
> + res_desc.res.pitch2D.width = input_frame->width;
> + res_desc.res.pitch2D.height = input_frame->height;
> + }
> +
> + ret = CHECK_CU(cu->cuTexObjectCreate(&tex[i], &res_desc, &tex_desc, NULL));
This is unrelated to this filter and review, but I do wonder if these
texture objects couldn't be cached in the frame somehow.
Not sure how high the overhead of constantly creating and destroying
these, but it can't be fully free.
> + if (ret < 0)
> + goto fail;
> + }
> +
> + // Process luma plane
> + ret = transpose_cuda_kernel(avctx, ctx->cu_func_transpose, tex, output_frame,
> + output_frame->width, output_frame->height,
> + output_frame->width, output_frame->height,
> + output_frame->linesize[0],
> + input_frame->width, input_frame->height, ctx->dir);
> + if (ret < 0) {
> + av_log(avctx, AV_LOG_ERROR, "Error during luma transpose: %d\n", ret);
> + goto fail;
> + }
> +
> + // Process chroma planes if present
> + if (s->in_planes > 1) {
> + ret = transpose_cuda_kernel(avctx, ctx->cu_func_transpose_uv, tex, output_frame,
> + AV_CEIL_RSHIFT(output_frame->width, s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(output_frame->height, s->in_desc->log2_chroma_h),
> + output_frame->width, output_frame->height,
> + output_frame->linesize[1],
> + AV_CEIL_RSHIFT(input_frame->width, s->in_desc->log2_chroma_w),
> + AV_CEIL_RSHIFT(input_frame->height, s->in_desc->log2_chroma_h),
> + ctx->dir);
> + if (ret < 0) {
> + av_log(avctx, AV_LOG_ERROR, "Error during chroma transpose: %d\n", ret);
> + goto fail;
> + }
> + }
> +
> + // Handle sample aspect ratio
> + if (input_frame->sample_aspect_ratio.num == 0) {
> + output_frame->sample_aspect_ratio = input_frame->sample_aspect_ratio;
> + } else {
> + output_frame->sample_aspect_ratio.num = input_frame->sample_aspect_ratio.den;
> + output_frame->sample_aspect_ratio.den = input_frame->sample_aspect_ratio.num;
> + }
> +
> + av_frame_free(&input_frame);
> +
> + av_log(avctx, AV_LOG_DEBUG, "Filter output: %s, %ux%u (%"PRId64").\n",
> + av_get_pix_fmt_name(output_frame->format),
> + output_frame->width, output_frame->height, output_frame->pts);
> +
> + // Cleanup texture objects
> + for (i = 0; i < FF_ARRAY_ELEMS(tex); i++)
> + if (tex[i])
> + CHECK_CU(cu->cuTexObjectDestroy(tex[i]));
> +
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ff_filter_frame(outlink, output_frame);
> +
> +fail:
> + for (i = 0; i < FF_ARRAY_ELEMS(tex); i++)
> + if (tex[i])
> + CHECK_CU(cu->cuTexObjectDestroy(tex[i]));
> +
> + av_frame_free(&input_frame);
> + av_frame_free(&output_frame);
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + return ret;
> +}
> +
> +static void transpose_cuda_uninit(AVFilterContext *avctx)
> +{
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *s = &ctx->vpp_ctx;
> +
> + if (s->cu_module) {
> + CudaFunctions *cu = s->cuda_dl;
> + CUcontext dummy;
> +
> + if (s->hwctx) {
> + CHECK_CU(cu->cuCtxPushCurrent(s->hwctx->cuda_ctx));
> + CHECK_CU(cu->cuModuleUnload(s->cu_module));
> + s->cu_module = NULL;
> + CHECK_CU(cu->cuCtxPopCurrent(&dummy));
> + }
> + }
> +
> + av_buffer_unref(&s->device_ref);
> + s->hwctx = NULL;
> +}
> +
> +static av_cold int transpose_cuda_init(AVFilterContext *avctx)
> +{
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *vpp_ctx = &ctx->vpp_ctx;
> +
> + ff_cuda_vpp_ctx_init(avctx);
> + vpp_ctx->load_functions = transpose_cuda_load_functions;
> + vpp_ctx->build_filter_params = transpose_cuda_build_filter_params;
> + vpp_ctx->pipeline_uninit = transpose_cuda_uninit;
> + vpp_ctx->output_format = AV_PIX_FMT_NONE;
> +
> + return 0;
> +}
> +
> +static int transpose_cuda_config_output(AVFilterLink *outlink)
> +{
> + AVFilterContext *avctx = outlink->src;
> + TransposeCUDAContext *ctx = avctx->priv;
> + CUDAVPPContext *vpp_ctx = &ctx->vpp_ctx;
> + AVFilterLink *inlink = avctx->inputs[0];
> +
> + if ((inlink->w >= inlink->h && ctx->passthrough == TRANSPOSE_PT_TYPE_LANDSCAPE) ||
> + (inlink->w <= inlink->h && ctx->passthrough == TRANSPOSE_PT_TYPE_PORTRAIT)) {
> + vpp_ctx->passthrough = 1;
> + av_log(avctx, AV_LOG_VERBOSE,
> + "w:%d h:%d -> w:%d h:%d (passthrough mode)\n",
> + inlink->w, inlink->h, inlink->w, inlink->h);
> + return ff_cuda_vpp_config_output(outlink);
> + }
> + ctx->passthrough = TRANSPOSE_PT_TYPE_NONE;
> +
> + // For transpose operations that swap dimensions
> + switch (ctx->dir) {
> + case TRANSPOSE_CCLOCK_FLIP:
> + case TRANSPOSE_CCLOCK:
> + case TRANSPOSE_CLOCK:
> + case TRANSPOSE_CLOCK_FLIP:
> + vpp_ctx->output_width = avctx->inputs[0]->h;
> + vpp_ctx->output_height = avctx->inputs[0]->w;
> + av_log(avctx, AV_LOG_DEBUG, "swap width and height for clock/cclock rotation\n");
> + break;
> + default:
> + vpp_ctx->output_width = avctx->inputs[0]->w;
> + vpp_ctx->output_height = avctx->inputs[0]->h;
> + break;
> + }
> +
> + av_log(avctx, AV_LOG_VERBOSE,
> + "w:%d h:%d dir:%d -> w:%d h:%d rotation:%s vflip:%d\n",
> + inlink->w, inlink->h, ctx->dir, vpp_ctx->output_width, vpp_ctx->output_height,
> + ctx->dir == 1 || ctx->dir == 3 ? "clockwise" : "counterclockwise",
> + ctx->dir == 0 || ctx->dir == 3);
> +
> + return ff_cuda_vpp_config_output(outlink);
> +}
> +
> +static AVFrame *get_video_buffer(AVFilterLink *inlink, int w, int h)
> +{
> + TransposeCUDAContext *ctx = inlink->dst->priv;
> +
> + return ctx->passthrough ?
> + ff_null_get_video_buffer(inlink, w, h) :
> + ff_default_get_video_buffer(inlink, w, h);
> +}
> +
> +#define OFFSET(x) offsetof(TransposeCUDAContext, x)
> +#define FLAGS (AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_FILTERING_PARAM)
> +static const AVOption transpose_cuda_options[] = {
> + { "dir", "set transpose direction", OFFSET(dir), AV_OPT_TYPE_INT, { .i64 = TRANSPOSE_CCLOCK_FLIP }, 0, 6, FLAGS, .unit = "dir" },
> + { "cclock_flip", "rotate counter-clockwise with vertical flip", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_CCLOCK_FLIP }, .flags=FLAGS, .unit = "dir" },
> + { "clock", "rotate clockwise", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_CLOCK }, .flags=FLAGS, .unit = "dir" },
> + { "cclock", "rotate counter-clockwise", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_CCLOCK }, .flags=FLAGS, .unit = "dir" },
> + { "clock_flip", "rotate clockwise with vertical flip", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_CLOCK_FLIP }, .flags=FLAGS, .unit = "dir" },
> + { "reversal", "rotate by half-turn", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_REVERSAL }, .flags=FLAGS, .unit = "dir" },
> + { "hflip", "flip horizontally", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_HFLIP }, .flags=FLAGS, .unit = "dir" },
> + { "vflip", "flip vertically", 0, AV_OPT_TYPE_CONST, { .i64 = TRANSPOSE_VFLIP }, .flags=FLAGS, .unit = "dir" },
> +
> + { "passthrough", "do not apply transposition if the input matches the specified geometry",
> + OFFSET(passthrough), AV_OPT_TYPE_INT, {.i64=TRANSPOSE_PT_TYPE_NONE}, 0, INT_MAX, FLAGS, .unit = "passthrough" },
> + { "none", "always apply transposition", 0, AV_OPT_TYPE_CONST, {.i64=TRANSPOSE_PT_TYPE_NONE}, INT_MIN, INT_MAX, FLAGS, .unit = "passthrough" },
> + { "portrait", "preserve portrait geometry", 0, AV_OPT_TYPE_CONST, {.i64=TRANSPOSE_PT_TYPE_PORTRAIT}, INT_MIN, INT_MAX, FLAGS, .unit = "passthrough" },
> + { "landscape", "preserve landscape geometry", 0, AV_OPT_TYPE_CONST, {.i64=TRANSPOSE_PT_TYPE_LANDSCAPE}, INT_MIN, INT_MAX, FLAGS, .unit = "passthrough" },
> +
> + { NULL }
> +};
> +
> +AVFILTER_DEFINE_CLASS(transpose_cuda);
> +
> +static const AVFilterPad transpose_cuda_inputs[] = {
> + {
> + .name = "default",
> + .type = AVMEDIA_TYPE_VIDEO,
> + .filter_frame = transpose_cuda_filter_frame,
> + .get_buffer.video = get_video_buffer,
> + .config_props = ff_cuda_vpp_config_input,
> + },
> +};
> +
> +static const AVFilterPad transpose_cuda_outputs[] = {
> + {
> + .name = "default",
> + .type = AVMEDIA_TYPE_VIDEO,
> + .config_props = transpose_cuda_config_output,
> + },
> +};
> +
> +const FFFilter ff_vf_transpose_cuda = {
> + .p.name = "transpose_cuda",
> + .p.description = NULL_IF_CONFIG_SMALL("CUDA accelerated video transpose"),
> + .p.priv_class = &transpose_cuda_class,
> + .priv_size = sizeof(TransposeCUDAContext),
> + .init = transpose_cuda_init,
> + .uninit = ff_cuda_vpp_ctx_uninit,
> + FILTER_INPUTS(transpose_cuda_inputs),
> + FILTER_OUTPUTS(transpose_cuda_outputs),
> + FILTER_QUERY_FUNC2(ff_cuda_vpp_query_formats),
> + .flags_internal = FF_FILTER_FLAG_HWFRAME_AWARE,
> +};
> diff --git a/libavfilter/vf_transpose_cuda.cu b/libavfilter/vf_transpose_cuda.cu
> new file mode 100644
> index 0000000000..1384c228e3
> --- /dev/null
> +++ b/libavfilter/vf_transpose_cuda.cu
> @@ -0,0 +1,219 @@
> +/*
> + * Copyright (c) 2025, Faeez Kadiri < f1k2faeez at gmail dot com>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "cuda/vector_helpers.cuh"
> +
> +// Transpose direction constants (from transpose.h)
> +#define TRANSPOSE_CCLOCK_FLIP 0
> +#define TRANSPOSE_CLOCK 1
> +#define TRANSPOSE_CCLOCK 2
> +#define TRANSPOSE_CLOCK_FLIP 3
> +#define TRANSPOSE_REVERSAL 4
> +#define TRANSPOSE_HFLIP 5
> +#define TRANSPOSE_VFLIP 6
You should be able to include the header here, it's just normal C++.
> +// FFmpeg passes pitch in bytes, CUDA uses potentially larger types
> +#define FIXED_PITCH(T) \
> + (dst_pitch/sizeof(T))
> +
> +#define DEFAULT_DST(n, T) \
> + dst[n][yo*FIXED_PITCH(T)+xo]
> +
> +// --- COORDINATE TRANSFORMATION FUNCTIONS ---
> +
> +__device__ static inline void get_transpose_coords(int src_x, int src_y, int src_width, int src_height,
> + int *dst_x, int *dst_y, int dst_width, int dst_height, int dir)
> +{
> + switch (dir) {
> + case TRANSPOSE_CCLOCK_FLIP: // 90° CCW + vertical flip
> + *dst_x = src_y;
> + *dst_y = src_x;
> + break;
> + case TRANSPOSE_CLOCK: // 90° CW
> + *dst_x = src_y;
> + *dst_y = src_width - 1 - src_x;
> + break;
> + case TRANSPOSE_CCLOCK: // 90° CCW
> + *dst_x = src_height - 1 - src_y;
> + *dst_y = src_x;
> + break;
> + case TRANSPOSE_CLOCK_FLIP: // 90° CW + vertical flip
> + *dst_x = src_height - 1 - src_y;
> + *dst_y = src_width - 1 - src_x;
> + break;
> + case TRANSPOSE_REVERSAL: // 180° rotation
> + *dst_x = src_width - 1 - src_x;
> + *dst_y = src_height - 1 - src_y;
> + break;
> + case TRANSPOSE_HFLIP: // Horizontal flip
> + *dst_x = src_width - 1 - src_x;
> + *dst_y = src_y;
> + break;
> + case TRANSPOSE_VFLIP: // Vertical flip
> + *dst_x = src_x;
> + *dst_y = src_height - 1 - src_y;
> + break;
> + default:
> + *dst_x = src_x;
> + *dst_y = src_y;
> + break;
> + }
> +}
> +
> +// --- TRANSPOSE KERNELS ---
> +
> +#define TRANSPOSE_DEF(name, in_type, out_type) \
> +__device__ static inline void Transpose_##name##_impl( \
> + cudaTextureObject_t src_tex[4], out_type *dst[4], \
> + int xo, int yo, int width, int height, int dst_pitch, \
> + int dst_width, int dst_height, int src_width, int src_height, int dir) \
> +{ \
> + int src_x, src_y; \
> + get_transpose_coords(xo, yo, width, height, &src_x, &src_y, src_width, src_height, dir); \
> + \
> + in_type pixel = tex2D<in_type>(src_tex[0], src_x + 0.5f, src_y + 0.5f); \
> + DEFAULT_DST(0, out_type) = pixel; \
> +}
> +
> +#define TRANSPOSE_UV_DEF(name, in_type_uv, out_type_uv) \
> +__device__ static inline void Transpose_##name##_uv_impl( \
> + cudaTextureObject_t src_tex[4], out_type_uv *dst[4], \
> + int xo, int yo, int width, int height, int dst_pitch, \
> + int dst_width, int dst_height, int src_width, int src_height, int dir) \
> +{ \
> + int src_x, src_y; \
> + get_transpose_coords(xo, yo, width, height, &src_x, &src_y, src_width, src_height, dir); \
> + \
> + in_type_uv pixel_u = tex2D<in_type_uv>(src_tex[1], src_x + 0.5f, src_y + 0.5f); \
> + in_type_uv pixel_v = tex2D<in_type_uv>(src_tex[2], src_x + 0.5f, src_y + 0.5f); \
> + DEFAULT_DST(1, out_type_uv) = pixel_u; \
> + DEFAULT_DST(2, out_type_uv) = pixel_v; \
> +}
> +
> +#define TRANSPOSE_NV_UV_DEF(name, in_type_uv, out_type_uv) \
> +__device__ static inline void Transpose_##name##_uv_impl( \
> + cudaTextureObject_t src_tex[4], out_type_uv *dst[4], \
> + int xo, int yo, int width, int height, int dst_pitch, \
> + int dst_width, int dst_height, int src_width, int src_height, int dir) \
> +{ \
> + int src_x, src_y; \
> + get_transpose_coords(xo, yo, width, height, &src_x, &src_y, src_width, src_height, dir); \
> + \
> + in_type_uv pixel_uv = tex2D<in_type_uv>(src_tex[1], src_x + 0.5f, src_y + 0.5f); \
> + DEFAULT_DST(1, out_type_uv) = pixel_uv; \
> +}
> +
> +
> +// Define transpose implementations for all formats
> +TRANSPOSE_DEF(yuv420p, uchar, uchar)
> +TRANSPOSE_UV_DEF(yuv420p, uchar, uchar)
> +
> +TRANSPOSE_DEF(nv12, uchar, uchar)
> +TRANSPOSE_NV_UV_DEF(nv12, uchar2, uchar2)
> +
> +TRANSPOSE_DEF(yuv444p, uchar, uchar)
> +TRANSPOSE_UV_DEF(yuv444p, uchar, uchar)
> +
> +TRANSPOSE_DEF(p010le, ushort, ushort)
> +TRANSPOSE_NV_UV_DEF(p010le, ushort2, ushort2)
> +
> +TRANSPOSE_DEF(p016le, ushort, ushort)
> +TRANSPOSE_NV_UV_DEF(p016le, ushort2, ushort2)
> +
> +TRANSPOSE_DEF(yuv444p16le, ushort, ushort)
> +TRANSPOSE_UV_DEF(yuv444p16le, ushort, ushort)
> +
> +TRANSPOSE_DEF(rgb0, uchar4, uchar4)
> +TRANSPOSE_DEF(bgr0, uchar4, uchar4)
> +TRANSPOSE_DEF(rgba, uchar4, uchar4)
> +TRANSPOSE_DEF(bgra, uchar4, uchar4)
Shouldn't it be possible to reduce the number of kernels quite
drastically here, by just having one per element-size and layout?
The kernels then don't need to care about the pix_fmts anymore.
The filter than combines the correct kernels from the pixel format info.
> +// --- KERNEL ARGUMENT DEFINITIONS ---
> +
> +#define TRANSPOSE_KERNEL_ARGS(T) \
> + cudaTextureObject_t src_tex_0, cudaTextureObject_t src_tex_1, \
> + cudaTextureObject_t src_tex_2, cudaTextureObject_t src_tex_3, \
> + T *dst_0, T *dst_1, T *dst_2, T *dst_3, \
> + int width, int height, int dst_pitch, \
> + int dst_width, int dst_height, \
> + int src_width, int src_height, int dir
> +
> +#define TRANSPOSE_KERNEL_IMPL(func_impl, T) \
> + cudaTextureObject_t src_tex[4] = { src_tex_0, src_tex_1, src_tex_2, src_tex_3 }; \
> + T *dst[4] = { dst_0, dst_1, dst_2, dst_3 }; \
> + int xo = blockIdx.x * blockDim.x + threadIdx.x; \
> + int yo = blockIdx.y * blockDim.y + threadIdx.y; \
> + if (xo >= width || yo >= height) return; \
> + \
> + func_impl(src_tex, dst, xo, yo, width, height, dst_pitch, \
> + dst_width, dst_height, src_width, src_height, dir);
> +
> +extern "C" {
> +
> +// --- TRANSPOSE KERNELS ---
> +
> +#define TRANSPOSE_KERNEL(name, T) \
> +__global__ void Transpose_##name(TRANSPOSE_KERNEL_ARGS(T)) \
> +{ \
> + TRANSPOSE_KERNEL_IMPL(Transpose_##name##_impl, T) \
> +}
> +
> +#define TRANSPOSE_UV_KERNEL(name, T) \
> +__global__ void Transpose_##name##_uv(TRANSPOSE_KERNEL_ARGS(T)) \
> +{ \
> + TRANSPOSE_KERNEL_IMPL(Transpose_##name##_uv_impl, T) \
> +}
> +
> +// Transpose kernels for all formats
> +TRANSPOSE_KERNEL(yuv420p, uchar)
> +TRANSPOSE_UV_KERNEL(yuv420p, uchar)
> +
> +TRANSPOSE_KERNEL(nv12, uchar)
> +TRANSPOSE_UV_KERNEL(nv12, uchar2)
> +
> +TRANSPOSE_KERNEL(yuv444p, uchar)
> +TRANSPOSE_UV_KERNEL(yuv444p, uchar)
> +
> +TRANSPOSE_KERNEL(p010le, ushort)
> +TRANSPOSE_UV_KERNEL(p010le, ushort2)
> +
> +TRANSPOSE_KERNEL(p016le, ushort)
> +TRANSPOSE_UV_KERNEL(p016le, ushort2)
> +
> +TRANSPOSE_KERNEL(yuv444p16le, ushort)
> +TRANSPOSE_UV_KERNEL(yuv444p16le, ushort)
> +
> +TRANSPOSE_KERNEL(rgb0, uchar4)
> +TRANSPOSE_KERNEL(bgr0, uchar4)
> +TRANSPOSE_KERNEL(rgba, uchar4)
> +TRANSPOSE_KERNEL(bgra, uchar4)
> +
> +// For RGB formats, UV kernels are not needed, but we provide empty implementations
> +// to maintain consistency with the function loading logic
> +
> +#define EMPTY_UV_KERNEL(name, T) \
> +__global__ void Transpose_##name##_uv(TRANSPOSE_KERNEL_ARGS(T)) { } \
> +
> +EMPTY_UV_KERNEL(rgb0, uchar)
> +EMPTY_UV_KERNEL(bgr0, uchar)
> +EMPTY_UV_KERNEL(rgba, uchar)
> +EMPTY_UV_KERNEL(bgra, uchar)
Same as above, this should all be possible to simplyfy by not having one
kernel per pixel-format.
> +}
More information about the ffmpeg-devel
mailing list