[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Thu Apr 8 14:35:26 EEST 2021

Apr 8, 2021, 07:35 by yejun.guo at intel.com:

>
>
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
>> Sent: 2021å¹´4æœˆ8æ—¥ 12:14
>> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> AV_FRAME_DATA_BOUNDING_BOXES
>>
>> Apr 8, 2021, 04:48 by yejun.guo at intel.com:
>>
>> >> > +
>> >> > +#ifndef AVUTIL_BOUNDINGBOX_H
>> >> > +#define AVUTIL_BOUNDINGBOX_H
>> >> > +
>> >> > +#include "rational.h"
>> >> > +#include "avassert.h"
>> >> > +#include "frame.h"
>> >> > +
>> >> > +typedef struct AVBoundingBox {
>> >> > +    /**
>> >> > +     * Distance in pixels from the top edge of the frame to top
>> >> > +     * and bottom, and from the left edge of the frame to left and
>> >> > +     * right, defining the bounding box.
>> >> > +     */
>> >> > +    int top;
>> >> > +    int left;
>> >> > +    int bottom;
>> >> > +    int right;
>> >> >
>> >>
>> >> Why not x, y, w, h?
>> >> It makes more sense, all of coordinates go like this.
>> >>
>> >
>> > We want to keep it consistent with 'struct AVRegionOfInterest',
>> > we also have a plan to add a filter which converts from bounding box
>> > to RegionOfInterest which impacts the encoder.
>> >
>>
>> Not a good idea. Region of interest is only useful to indicate a
>> single region, while this API describes multiple regions
>> within the frame. The video_enc_params API follows the
>> x, y, w, h because it's the easiest to work with, so I'm sure
>> it's well worth going to such coordinates.
>>
>
> RegionOfInterest is similar with boundingbox, it is used for multiple
> regions in an array, see AV_FRAME_DATA_REGIONS_OF_INTEREST.
>
> anyway, I'm ok to change to x,y,w,h.
>
> btw, do we need to change to x,y,w,h for RegionOfInterest? Is such
> change allowed after current major version change?
>
>> >> > +
>> >> > +typedef struct AVBoundingBoxHeader {
>> >> > +    /**
>> >> > +     * Information about how the bounding box is generated.
>> >> > +     * for example, the DNN model name.
>> >> > +     */
>> >> > +    char source[128];
>> >> > +
>> >> > +    /**
>> >> > +     * The size of frame when it is detected.
>> >> > +     */
>> >> > +    int frame_width;
>> >> > +    int frame_height;
>> >> >
>> >>
>> >> Why? This side data is attached to AVFrames only, where we
>> >> already have width and height.
>> >>
>> >
>> > The detection result will be used by other filters, for example,
>> > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify).
>> >
>> > The filter dnn_detect detects all the objects (cat, dog, person ...) in a
>> > frame, while dnn_classify classifies one detected object (for example, person)
>> > for its attribute (for example, emotion, etc.)
>> >
>> > The filter dnn_classify have to check if the frame size is changed after
>> > it is detected, to handle the below filter chain:
>> > dnn_detect -> scale -> dnn_classify
>> >
>>
>> This doesn't look good. Why is dnn_classify needing to know
>> the original frame size at all?
>>
>
> For example, the original size of the frame is 100*100, and dnn_detect
> detects a face at place (10, 10) -> (30, 40), such data will be saved in
> AVBoundingBox.top/left/right/bottom.
>
> Then, the frame is scaled into 50*50.
>
> Then, dnn_classify is used to analyze the emotion of the face, it needs to
> know the frame size (100*100) when it is detected, otherwise, it does not
> work with just (10,10), (30,40) and 50*50.
>

Why can't the scale filter also rescale this side data as well?

>> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
>> >> > index a5ed91b20a..41e22de02a 100644
>> >> > --- a/libavutil/frame.h
>> >> > +++ b/libavutil/frame.h
>> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
>> >> >  * Must be present for every frame which should have film grain applied.
>> >> >  */
>> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
>> >> > +
>> >> > +    /**
>> >> > +     * Bounding boxes for object detection and classification, the data is
>> a
>> >> AVBoundingBoxHeader
>> >> > +     * followed with an array of AVBoudingBox, and
>> >> AVBoundingBoxHeader.nb_bboxes is the number
>> >> > +     * of array element.
>> >> > +     */
>> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
>> >> >  };
>> >> >
>> >>
>> >> Finally, why call it a Bounding Box? It's not descriptive at all.
>> >> How about "Object Classification"? It makes much more sense, it's
>> >> exactly what this is. So AVObjectClassification, AVObjectClassification,
>> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
>> >>
>> >
>> > In object detection papers, bounding box is usually used.
>> > We'd better use the same term, imho, thanks.
>> >
>>
>> Not in this case, API users won't have any idea what this is or what
>> it's for. This is user-facing code after all.
>> Papers in fields can get away with overloading language, but we're
>> trying to make a concise API. Object classification makes sense
>> because this is exactly what this is.
>>
>
> The term bounding box is dominating the domain, for example, even
> HEVC spec uses this term, see page 317 of
> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!PDF-E&type=items
>
> also copy some here for your convenient.
> ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
> ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
> ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
> ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
>
> I would prefer to use bounding box.
>

It's for an entirely different thing, and like I said, it's just an overloaded
language because they can get away. We're trying to be generic.
This side data is for detecting _and_ classifying objects. In fact, most of
the structure is dedicated towards classifying. If you'd like users to actually
use this, give it a good name and don't leave them guessing what this
structure is by throwing vague jargon some other paper or spec has
because it's close enough.