[FFmpeg-devel] [PATCH FFmpeg 12/15] doc: move classify Filter doc to Multimedia Filters chapter

Sat Mar 8 17:01:56 EET 2025

Try the new filters using my Github Repo https://github.com/MaximilianKaindl/DeepFFMPEGVideoClassification. 

Any Feedback is appreciated!

Signed-off-by: MaximilianKaindl <m.kaindl0208 at gmail.com>
---
 doc/filters.texi | 170 +++++++++++++++++++++++------------------------
 1 file changed, 85 insertions(+), 85 deletions(-)

diff --git a/doc/filters.texi b/doc/filters.texi
index bd75982d7d..915e0244cd 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -11970,91 +11970,6 @@ ffmpeg -i INPUT -f lavfi -i nullsrc=hd720,geq='r=128+80*(sin(sqrt((X-W/2)*(X-W/2
 @end example
 @end itemize
 
- at section dnn_classify
-Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content.
-This filter supports three classification modes:
-
- at itemize @bullet
- at item Standard image classification (OpenVINO backend)
- at item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend)
- at item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend)
- at end itemize
-
-The filter accepts the following options:
- at table @option
- at item dnn_backend
-Specify which DNN backend to use for model loading and execution. Currently supports:
- at table @samp
- at item openvino
-Use OpenVINO backend (standard image classification only).
- at item torch
-Use LibTorch backend (supports CLIP for images and CLAP for audio).
- at end table
- at item confidence
-Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out.
- at item labels
-Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification.
-Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped.
- at item categories
-Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels.
- at item tokenizer
-Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation.
- at item target
-Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified.
- at item is_audio
-Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames.
- at item logit_scale
-Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default.
- at item temperature
-Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother.
- at item forward_order
-Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type).
- at item normalize
-Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases.
- at item input_res
-Expected input resolution for video processing models (default: automatically detected).
- at item sample_rate
-Expected sample rate for audio processing models (default: 44100).
- at item sample_duration
-Expected sample duration in seconds for audio processing models (default: 7).
- at item token_dimension
-Dimension of token vector for text embeddings (default: 77).
- at item optimize
-Enable graph executor optimization (0: disabled, 1: enabled).
- at end table
- at subsection Category Files Format
-For CLIP/CLAP models, a hierarchical categories file can be provided with the following format:
- at example
-[RecordingSystem]
-(Professional)
-a photo with high level of detail
-a professionally recorded sound
-(HomeRecording)
-a photo with low level of detail
-an amateur recording
-[ContentType]
-(Nature)
-trees
-mountains
-birds singing
-(Urban)
-buildings
-street noise
-traffic sounds
- at end example
-Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input.
- at subsection Examples
- at example
-Classify video using OpenVINO
-ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4
-Classify video using CLIP
-ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4
-Classify only person objects in a video
-ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4
-Classify audio using CLAP
-ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3
- at end example
-
 @section dnn_detect
 
 Do object detection with deep neural networks.
@@ -30925,6 +30840,91 @@ bench=start,selectivecolor=reds=-.2 .12 -.49,bench=stop
 @end example
 @end itemize
 
+ at section dnn_classify
+Analyze media (video frames or audio) using deep neural networks to apply classifications based on the content.
+This filter supports three classification modes:
+
+ at itemize @bullet
+ at item Standard image classification (OpenVINO backend)
+ at item CLIP (Contrastive Language-Image Pre-training) classification (Torch backend)
+ at item CLAP (Contrastive Language-Audio Pre-training) classification (Torch backend)
+ at end itemize
+
+The filter accepts the following options:
+ at table @option
+ at item dnn_backend
+Specify which DNN backend to use for model loading and execution. Currently supports:
+ at table @samp
+ at item openvino
+Use OpenVINO backend (standard image classification only).
+ at item torch
+Use LibTorch backend (supports CLIP for images and CLAP for audio).
+ at end table
+ at item confidence
+Set the confidence threshold (default: 0.5). Classifications with confidence below this value will be filtered out.
+ at item labels
+Set path to a label file specifying classification labels. This is required for standard classification and can be used for CLIP/CLAP classification.
+Each label is written on a separate line in the file. Trailing spaces and empty lines are skipped.
+ at item categories
+Path to a categories file for hierarchical classification (CLIP/CLAP only). This allows classification to be organized into multiple category units with individual categories containing related labels.
+ at item tokenizer
+Path to the text tokenizer.json file (CLIP/CLAP only). Required for text embedding generation.
+ at item target
+Specify which objects to classify. When omitted, the entire frame is classified. When specified, only bounding boxes with detection labels matching this value are classified.
+ at item is_audio
+Enable audio processing mode for CLAP models (default: 0). Set to 1 to process audio input instead of video frames.
+ at item logit_scale
+Logit scale for similarity calculation in CLIP/CLAP (default: 4.6052 for CLIP, 33.37 for CLAP). Values below 0 use the default.
+ at item temperature
+Softmax temperature for CLIP/CLAP models (default: 1.0). Lower values make the output more peaked, higher values make it smoother.
+ at item forward_order
+Order of forward output for CLIP/CLAP: 0 for media-text order, 1 for text-media order (default depends on model type).
+ at item normalize
+Whether to normalize the input tensor for CLIP/CLAP (default depends on model type). Some scripted models already do this in the forward, so this is not necessary in some cases.
+ at item input_res
+Expected input resolution for video processing models (default: automatically detected).
+ at item sample_rate
+Expected sample rate for audio processing models (default: 44100).
+ at item sample_duration
+Expected sample duration in seconds for audio processing models (default: 7).
+ at item token_dimension
+Dimension of token vector for text embeddings (default: 77).
+ at item optimize
+Enable graph executor optimization (0: disabled, 1: enabled).
+ at end table
+ at subsection Category Files Format
+For CLIP/CLAP models, a hierarchical categories file can be provided with the following format:
+ at example
+[RecordingSystem]
+(Professional)
+a photo with high level of detail
+a professionally recorded sound
+(HomeRecording)
+a photo with low level of detail
+an amateur recording
+[ContentType]
+(Nature)
+trees
+mountains
+birds singing
+(Urban)
+buildings
+street noise
+traffic sounds
+ at end example
+Each unit enclosed in square brackets [] creates a classification group. Within each group, categories are defined with parentheses () and the labels under each category are used to classify the input.
+ at subsection Examples
+ at example
+Classify video using OpenVINO
+ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=openvino:model=model.xml:labels=labels.txt" output.mp4
+Classify video using CLIP
+ffmpeg -i input.mp4 -vf "dnn_classify=dnn_backend=torch:model=clip_model.pt:categories=categories.txt:tokenizer=tokenizer.json" output.mp4
+Classify only person objects in a video
+ffmpeg -i input.mp4 -vf "dnn_detect=model=detection.xml:input=data:output=detection_out:confidence=0.5,dnn_classify=model=clip_model.pt:dnn_backend=torch:tokenizer=tokenizer.json:labels=labels.txt:target=person" output.mp4
+Classify audio using CLAP
+ffmpeg -i input.mp3 -af "dnn_classify=dnn_backend=torch:model=clap_model.pt:categories=audio_categories.txt:tokenizer=tokenizer.json:is_audio=1:sample_rate=44100:sample_duration=7" output.mp3
+ at end example
+
 @section concat
 
 Concatenate audio and video streams, joining them together one after the
-- 
2.34.1