[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES
Guo, Yejun
yejun.guo at intel.com
Sat Apr 10 12:44:35 EEST 2021
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
> Lynne
> Sent: 2021å¹´4æ9æ¥ 23:16
> To: FFmpeg development discussions and patches
> <ffmpeg-devel at ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> AV_FRAME_DATA_BOUNDING_BOXES
>
> Apr 9, 2021, 16:35 by bygrandao at gmail.com:
>
> > Em sex., 9 de abr. de 2021 Ã s 01:13, Guo, Yejun <yejun.guo at intel.com>
> escreveu:
> >
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf
> Of Lynne
> >> > Sent: 2021å¹´4æ9æ¥ 0:57
> >> > To: FFmpeg development discussions and patches
> <ffmpeg-devel at ffmpeg.org>
> >> > Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
> >> > AV_FRAME_DATA_BOUNDING_BOXES
> >> >
> >>
> >> First of all, thanks for the quick replies, I see, all the
> discussions/comments are to
> >> make this patch better, thank you.
> >>
> >> > >> >
> >> > >> >> >> > +
> >> > >> >> >> > +typedef struct AVBoundingBoxHeader {
> >> > >> >> >> > + /**
> >> > >> >> >> > + * Information about how the bounding box is
> generated.
> >> > >> >> >> > + * for example, the DNN model name.
> >> > >> >> >> > + */
> >> > >> >> >> > + char source[128];
> >> > >> >> >> > +
> >> > >> >> >> > + /**
> >> > >> >> >> > + * The size of frame when it is detected.
> >> > >> >> >> > + */
> >> > >> >> >> > + int frame_width;
> >> > >> >> >> > + int frame_height;
> >> > >> >> >> >
> >> > >> >> >>
> >> > >> >> >> Why? This side data is attached to AVFrames only, where we
> >> > >> >> >> already have width and height.
> >> > >> >> >>
> >> > >> >> >
> >> > >> >> > The detection result will be used by other filters, for example,
> >> > >> >> > dnn_classify (see
> https://github.com/guoyejun/ffmpeg/tree/classify).
> >> > >> >> >
> >> > >> >> > The filter dnn_detect detects all the objects (cat, dog,
> person ...) in a
> >> > >> >> > frame, while dnn_classify classifies one detected object (for
> example,
> >> > >> person)
> >> > >> >> > for its attribute (for example, emotion, etc.)
> >> > >> >> >
> >> > >> >> > The filter dnn_classify have to check if the frame size is
> changed after
> >> > >> >> > it is detected, to handle the below filter chain:
> >> > >> >> > dnn_detect -> scale -> dnn_classify
> >> > >> >> >
> >> > >> >>
> >> > >> >> This doesn't look good. Why is dnn_classify needing to know
> >> > >> >> the original frame size at all?
> >> > >> >>
> >> > >> >
> >> > >> > For example, the original size of the frame is 100*100, and
> dnn_detect
> >> > >> > detects a face at place (10, 10) -> (30, 40), such data will be saved
> in
> >> > >> > AVBoundingBox.top/left/right/bottom.
> >> > >> >
> >> > >> > Then, the frame is scaled into 50*50.
> >> > >> >
> >> > >> > Then, dnn_classify is used to analyze the emotion of the face, it
> needs to
> >> > >> > know the frame size (100*100) when it is detected, otherwise, it
> does not
> >> > >> > work with just (10,10), (30,40) and 50*50.
> >> > >> >
> >> > >>
> >> > >> Why can't the scale filter also rescale this side data as well?
> >> > >>
> >> > >
> >> > > I'm afraid that we could not make sure all such filters (including
> filters in the
> >> > > future) to do the rescale. And in the previous discussion, I got to
> know that
> >> > > 'many other existing side-data types are invalidated by scaling'.
> >> > >
> >> > > So, we need frame_width and frame_height here.
> >> > >
> >> >
> >> > No, you don't. You just need to make sure filters which change
> resolution
> >> > or do cropping also change the side data parameters.
> >> > It's called maintainership. As-is, this won't even work with cropping,
> >> > only with basic aspect ratio preserving scaling.
> >> > For the lack of a better term, this is a hack.
> >>
> >> As discussed in previous email, for the frame size change case,
> dnn_classify
> >> (and other filters which use the detection result, for example drawbox)
> can
> >> just output a warning message to tell user what happens, and don't do
> the
> >> classification, otherwise, it will give a wrong/weird result which makes
> the
> >> user confused.
> >>
> >> >
> >> > I would accept just specifying that if the frame dimensions are
> >> > altered in any way, the side-data is no longer valid and it's up
> >> > to users to figure that out by out of bound coordinates.
> >> > This is what we currently do with video_enc_params.
> >>
> >> frame_width/frame_height is not perfect (for the cases such as: scale
> down
> >> + crop + scale up to the same size), but it provides more info than the
> checking
> >> of 'out of bound coordinates'. There are many other possible issues
> when the
> >> coordinates are within the frame.
> >>
> >> If we think we'd better not let user get more info from the warning
> message,
> >> I'm ok to remove them.
> >>
> >> I'll remove them if there's another comment supporting the removal,
> and
> >> there's no objection.
> >>
> >> >
> >> >
> >> > >> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
> >> > >> >> >> > index a5ed91b20a..41e22de02a 100644
> >> > >> >> >> > --- a/libavutil/frame.h
> >> > >> >> >> > +++ b/libavutil/frame.h
> >> > >> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
> >> > >> >> >> > * Must be present for every frame which should have
> film grain
> >> > >> applied.
> >> > >> >> >> > */
> >> > >> >> >> > AV_FRAME_DATA_FILM_GRAIN_PARAMS,
> >> > >> >> >> > +
> >> > >> >> >> > + /**
> >> > >> >> >> > + * Bounding boxes for object detection and
> classification, the
> >> > >> data is
> >> > >> >> a
> >> > >> >> >> AVBoundingBoxHeader
> >> > >> >> >> > + * followed with an array of AVBoudingBox, and
> >> > >> >> >> AVBoundingBoxHeader.nb_bboxes is the number
> >> > >> >> >> > + * of array element.
> >> > >> >> >> > + */
> >> > >> >> >> > + AV_FRAME_DATA_BOUNDING_BOXES,
> >> > >> >> >> > };
> >> > >> >> >> >
> >> > >> >> >>
> >> > >> >> >> Finally, why call it a Bounding Box? It's not descriptive at all.
> >> > >> >> >> How about "Object Classification"? It makes much more
> sense, it's
> >> > >> >> >> exactly what this is. So AVObjectClassification,
> AVObjectClassification,
> >> > >> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
> >> > >> >> >>
> >> > >> >> >
> >> > >> >> > In object detection papers, bounding box is usually used.
> >> > >> >> > We'd better use the same term, imho, thanks.
> >> > >> >> >
> >> > >> >>
> >> > >> >> Not in this case, API users won't have any idea what this is or
> what
> >> > >> >> it's for. This is user-facing code after all.
> >> > >> >> Papers in fields can get away with overloading language, but
> we're
> >> > >> >> trying to make a concise API. Object classification makes sense
> >> > >> >> because this is exactly what this is.
> >> > >> >>
> >> > >> >
> >> > >> > The term bounding box is dominating the domain, for example,
> even
> >> > >> > HEVC spec uses this term, see page 317 of
> >> > >> >
> >> > >>
> >> >
> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I
> !!P
> >> > >> DF-E&type=items
> >> > >> >
> >> > >> > also copy some here for your convenient.
> >> > >> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
> >> > >> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
> >> > >> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
> >> > >> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
> >> > >> >
> >> > >> > I would prefer to use bounding box.
> >> > >> >
> >> > >>
> >> > >> It's for an entirely different thing, and like I said, it's just an
> overloaded
> >> > >> language because they can get away. We're trying to be generic.
> >> > >> This side data is for detecting _and_ classifying objects. In fact,
> most of
> >> > >> the structure is dedicated towards classifying. If you'd like users to
> actually
> >> > >> use this, give it a good name and don't leave them guessing what
> this
> >> > >> structure is by throwing vague jargon some other paper or spec
> has
> >> > >> because it's close enough.
> >> > >>
> >> > >
> >> > > all the people in the domain accepts bounding box, they can
> understand this
> >> > > struct name easily and clearly, they might be bothered if we use
> another
> >> > name.
> >> > >
> >> > > btw, AVObjectClassification confuses people who just want object
> detection.
> >> > >
> >> >
> >> > As I said, the name "bounding box" makes no sense once it gets
> overloaded
> >> > with object classification.
> >>
> >> dnn_detect creates an array of 'bounding box' for all detected objects,
> and
> >> dnn_classify assigns attributes for a set of bounding boxes (with same
> object
> >> id). 'bounding box' serves both detection and classification properly.
> >>
> >>
> >> > Object classification is still the main use of the filters,
> >> > because the original proposal was to have all of this info be
> ffmpeg-private,
> >> > which would forbid simple object detection.
> >>
> >> The original proposal is to add it as side data which is ffmpeg-public,
> and then,
> >> we spent much time discussing/trying with ffmpeg-private as an
> temporary
> >> method, and since it is not good to be temporary, we now switch back
> to
> >> ffmpeg-public.
> >>
> >> During the whole period, we don't have any intention to
> >> 'forbid simple object detection', not quite understand your point here.
> >>
> >>
> >> > So I still maintain this should be called "Object classification". I'd
> accept
> >> > "Object detection" as well, but definitely not "bounding box".
> >>
> >> imho, ' Object detection' and ' Object classification' are worse, they just
> >> describe one aspect of the struct. The users might just use filter
> dnn_detect,
> >> they might use filters dnn_detect + dnn_classify.
> >>
> >>
> >> >
> >> > Since the decision was made to make the side data public, we have to
> make
> >> > very sure it contains no hacks or is impossible to extend, since we
> don't want
> >> > to have an
> >> >
> "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWE
> D_
> >> > UP"
> >> > faster than you can say "Recursive cascade correlation artificial neural
> >> > networks".
> >>
> >> sorry, not quite understand your point here.
> >>
> >> 'bounding box' is designed for general purpose to contain the info for
> >> detection/classification. It doesn't matter which DNN model is used, it
> doesn't
> >> matter if a traditional algorithm (non-dnn) is used.
> >>
> >> I'm open to use a better name. And bounding box is the best one for
> me till now.
> >> Everyone in the domain knows the exact meaning of bounding box
> without
> >> extra explanation. This word has been extended/evolved with such
> meaning in
> >> this domain.
> >>
> > +1
> >
> > I think it is wise to use the name which is widely used in the field.
> >
>
> Way to go, ignoring half the exchange to voice an opinion after we
> came to an agreement.
Some people take days for a thorough thinking and then give comment.
It's nice and welcome. I even received some comments in the last day
before pushing.
I might be too busy for preparing the next version patch (we have some
video analytic feature codes/plans depending on this one, and hope to
move things forward better). I need to wait for several more days to get
more comments.
My point is that bounding box is a good (might the best) name if we have
object detection/classification background, it might be not the best name
without such background.
Anyway, I'm open to use a better name.
More information about the ffmpeg-devel
mailing list