[FFmpeg-devel] [PATCH V2 08/10] libavutil: add side data AVDnnBoundingBox for dnn based detect/classify filters

Guo, Yejun yejun.guo at intel.com
Thu Feb 11 10:15:26 EET 2021



> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Mark
> Thompson
> Sent: 2021年2月11日 6:19
> To: ffmpeg-devel at ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH V2 08/10] libavutil: add side data
> AVDnnBoundingBox for dnn based detect/classify filters
> 
> On 10/02/2021 09:34, Guo, Yejun wrote:
> > Signed-off-by: Guo, Yejun <yejun.guo at intel.com>
> > ---
> >   doc/APIchanges       |  2 ++
> >   libavutil/Makefile   |  1 +
> >   libavutil/dnn_bbox.h | 68
> ++++++++++++++++++++++++++++++++++++++++++++
> >   libavutil/frame.c    |  1 +
> >   libavutil/frame.h    |  7 +++++
> >   libavutil/version.h  |  2 +-
> >   6 files changed, 80 insertions(+), 1 deletion(-)
> >   create mode 100644 libavutil/dnn_bbox.h
> 
> What is the intended consumer of this box information?  (Is there some other
> filter which will read these are do something with them, or some sort of user
> program?)
> 
> If there is no use in ffmpeg outside libavfilter then the header should probably
> be in libavfilter.


Thanks for the feedback. 

For most case, other filters will use this box information,
for example, a classify filter will recognize the car number after the car plate is detected,
another filter can apply 'beauty' if a face is detected, and updated drawbox filter (in plan)
can draw the box for visualization, and a new filter such as bbox_to_roi can be added
to apply roi encoding for the detected result.

It is possible that some others will use it,
for example, the new codec is adding AI labels and so libavcodec might need it in the future,
and a user program might do something special such as:
1. use libavcodec to decode
2. use filter detect
3. write his own code to handle the detect result

As the first step, how about to put it in the libavfilter (so do not expose it
at API level and we are free to change it when needed)? And we can move
it to libavutil once it is required.

> 
> How tied is this to the DNN implementation, and hence the DNN name?  If
> someone made a standalone filter doing object detection by some other
> method, would it make sense for them to reuse this structure?

Yes, this structure is general, I add dnn prefix because of two reasons:
1. There's already bounding box in libavfilter/bbox.h, see below, it's simple and we could
not reuse it, so we need a new name.
typedef struct FFBoundingBox {
    int x1, x2, y1, y2;
} FFBoundingBox;

2. DNN is currently the dominate method for object detection.

How about to use 'struct BoudingBox' when it is under libavfilter (added into current file bbox.h),
and rename to 'AVBoudingBox' when it is needed to move to libavutil?

> 
> > diff --git a/libavutil/dnn_bbox.h b/libavutil/dnn_bbox.h new file mode
> > 100644 index 0000000000..50899c4486
> > --- /dev/null
> > +++ b/libavutil/dnn_bbox.h
> > +
> > +    /**
> > +     * Object detection is usually applied to a smaller image that
> > +     * is scaled down from the original frame.
> > +     * width and height are attributes of the scaled image, in pixel.
> > +     */
> > +    int model_input_width;
> > +    int model_input_height;
> 
> Other than to interpret the distances below, what will the user do with this
> information?  (Alternatively: why not map the distances back onto the
> original frame size?)

My idea was: if the original frame is scaled sometime later, the side data (bbox)
keeps correct without any modification.

If we use the distance on the original frame size, shall we say the behavior
is undefined when it is scaled sometime later?

> 
> > +
> > +    /**
> > +     * Distance in pixels from the top edge of the scaled image to top
> > +     * and bottom, and from the left edge of the scaled image to left and
> > +     * right, defining the bounding box.
> > +     */
> > +    int top;
> > +    int left;
> > +    int bottom;
> > +    int right;
> > +
> > +    /**
> > +     * Detect result
> > +     */
> > +    int detect_label;
> 
> How does a user interpret this label?  Is it from some known enum?

The mappings between label_id (int) and label_name (string) are not part of
the model file, it is usually another file provided together with the model file.
The mappings are specific with the model file.

My idea was to provide the mapping file only when it is needed, for example,
when draw the box and text with filter drawbox/drawtext to visualize the bounding box.

If we don't care the size in side_data, we can add label name into this struct,
for example.
#define BBOX_LABEL_NAME_MAX_LENGTH 32
int detect_label_id;
char detect_label_name[BBOX_LABEL_NAME_MAX_LENGTH+1];
int classify_label_ids[AV_NUM_BBOX_CLASSIFY];
char classify_label_names[AV_NUM_BBOX_CLASSIFY][ BBOX_LABEL_NAME_MAX_LENGTH+1]

> 
> > +    AVRational detect_conf;
> 
> "conf"... idence?  A longer name and a descriptive comment might help.

sure, will change to detect_confidence and classify_confidences[]

> 
> > +
> > +    /**
> > +     * At most 4 classifications based on the detected bounding box.
> > +     * For example, we can get max 4 different attributes with 4 different
> > +     * DNN models on one bounding box.
> > +     * classify_count is zero if no classification.
> > +     */
> > +#define AV_NUM_BBOX_CLASSIFY 4
> > +    uint32_t classify_count;
> > +    int classify_labels[AV_NUM_BBOX_CLASSIFY];
> > +    AVRational classify_confs[AV_NUM_BBOX_CLASSIFY];
> 
> Same comment on these.

and, Happy Chinese New Year!

Thanks
Yejun


More information about the ffmpeg-devel mailing list