[FFmpeg-devel] [PATCH V7 4/6] lavu: add side data AV_FRAME_DATA_BOUNDING_BOXES

Lynne dev at lynne.ee
Thu Apr 8 19:57:00 EEST 2021


Apr 8, 2021, 16:51 by yejun.guo at intel.com:

>
>
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of Lynne
>> Sent: 2021年4月8日 19:35
>> To: FFmpeg development discussions and patches <ffmpeg-devel at ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> AV_FRAME_DATA_BOUNDING_BOXES
>>
>> Apr 8, 2021, 07:35 by yejun.guo at intel.com:
>>
>> >
>> >
>> >> -----Original Message-----
>> >> From: ffmpeg-devel <ffmpeg-devel-bounces at ffmpeg.org> On Behalf Of
>> Lynne
>> >> Sent: 2021年4月8日 12:14
>> >> To: FFmpeg development discussions and patches
>> <ffmpeg-devel at ffmpeg.org>
>> >> Subject: Re: [FFmpeg-devel] [PATCH V7 4/6] lavu: add side data
>> >> AV_FRAME_DATA_BOUNDING_BOXES
>> >>
>> >> Apr 8, 2021, 04:48 by yejun.guo at intel.com:
>> >>
>> >> >> > +
>> >> >> > +#ifndef AVUTIL_BOUNDINGBOX_H
>> >> >> > +#define AVUTIL_BOUNDINGBOX_H
>> >> >> > +
>> >> >> > +#include "rational.h"
>> >> >> > +#include "avassert.h"
>> >> >> > +#include "frame.h"
>> >> >> > +
>> >> >> > +typedef struct AVBoundingBox {
>> >> >> > +    /**
>> >> >> > +     * Distance in pixels from the top edge of the frame to top
>> >> >> > +     * and bottom, and from the left edge of the frame to left and
>> >> >> > +     * right, defining the bounding box.
>> >> >> > +     */
>> >> >> > +    int top;
>> >> >> > +    int left;
>> >> >> > +    int bottom;
>> >> >> > +    int right;
>> >> >> >
>> >> >>
>> >> >> Why not x, y, w, h?
>> >> >> It makes more sense, all of coordinates go like this.
>> >> >>
>> >> >
>> >> > We want to keep it consistent with 'struct AVRegionOfInterest',
>> >> > we also have a plan to add a filter which converts from bounding box
>> >> > to RegionOfInterest which impacts the encoder.
>> >> >
>> >>
>> >> Not a good idea. Region of interest is only useful to indicate a
>> >> single region, while this API describes multiple regions
>> >> within the frame. The video_enc_params API follows the
>> >> x, y, w, h because it's the easiest to work with, so I'm sure
>> >> it's well worth going to such coordinates.
>> >>
>> >
>> > RegionOfInterest is similar with boundingbox, it is used for multiple
>> > regions in an array, see AV_FRAME_DATA_REGIONS_OF_INTEREST.
>> >
>> > anyway, I'm ok to change to x,y,w,h.
>> >
>> > btw, do we need to change to x,y,w,h for RegionOfInterest? Is such
>> > change allowed after current major version change?
>>
>
> any comment on this about RegionOfInterest? thanks.
>
>> >
>> >> >> > +
>> >> >> > +typedef struct AVBoundingBoxHeader {
>> >> >> > +    /**
>> >> >> > +     * Information about how the bounding box is generated.
>> >> >> > +     * for example, the DNN model name.
>> >> >> > +     */
>> >> >> > +    char source[128];
>> >> >> > +
>> >> >> > +    /**
>> >> >> > +     * The size of frame when it is detected.
>> >> >> > +     */
>> >> >> > +    int frame_width;
>> >> >> > +    int frame_height;
>> >> >> >
>> >> >>
>> >> >> Why? This side data is attached to AVFrames only, where we
>> >> >> already have width and height.
>> >> >>
>> >> >
>> >> > The detection result will be used by other filters, for example,
>> >> > dnn_classify (see https://github.com/guoyejun/ffmpeg/tree/classify).
>> >> >
>> >> > The filter dnn_detect detects all the objects (cat, dog, person ...) in a
>> >> > frame, while dnn_classify classifies one detected object (for example,
>> person)
>> >> > for its attribute (for example, emotion, etc.)
>> >> >
>> >> > The filter dnn_classify have to check if the frame size is changed after
>> >> > it is detected, to handle the below filter chain:
>> >> > dnn_detect -> scale -> dnn_classify
>> >> >
>> >>
>> >> This doesn't look good. Why is dnn_classify needing to know
>> >> the original frame size at all?
>> >>
>> >
>> > For example, the original size of the frame is 100*100, and dnn_detect
>> > detects a face at place (10, 10) -> (30, 40), such data will be saved in
>> > AVBoundingBox.top/left/right/bottom.
>> >
>> > Then, the frame is scaled into 50*50.
>> >
>> > Then, dnn_classify is used to analyze the emotion of the face, it needs to
>> > know the frame size (100*100) when it is detected, otherwise, it does not
>> > work with just (10,10), (30,40) and 50*50.
>> >
>>
>> Why can't the scale filter also rescale this side data as well?
>>
>
> I'm afraid that we could not make sure all such filters (including filters in the
> future) to do the rescale. And in the previous discussion, I got to know that
> 'many other existing side-data types are invalidated by scaling'.
>
> So, we need frame_width and frame_height here.
>

No, you don't. You just need to make sure filters which change resolution
or do cropping also change the side data parameters.
It's called maintainership. As-is, this won't even work with cropping,
only with basic aspect ratio preserving scaling.
For the lack of a better term, this is a hack.

I would accept just specifying that if the frame dimensions are
altered in any way, the side-data is no longer valid and it's up
to users to figure that out by out of bound coordinates.
This is what we currently do with video_enc_params.


>> >> >> > diff --git a/libavutil/frame.h b/libavutil/frame.h
>> >> >> > index a5ed91b20a..41e22de02a 100644
>> >> >> > --- a/libavutil/frame.h
>> >> >> > +++ b/libavutil/frame.h
>> >> >> > @@ -198,6 +198,13 @@ enum AVFrameSideDataType {
>> >> >> >  * Must be present for every frame which should have film grain
>> applied.
>> >> >> >  */
>> >> >> >  AV_FRAME_DATA_FILM_GRAIN_PARAMS,
>> >> >> > +
>> >> >> > +    /**
>> >> >> > +     * Bounding boxes for object detection and classification, the
>> data is
>> >> a
>> >> >> AVBoundingBoxHeader
>> >> >> > +     * followed with an array of AVBoudingBox, and
>> >> >> AVBoundingBoxHeader.nb_bboxes is the number
>> >> >> > +     * of array element.
>> >> >> > +     */
>> >> >> > +    AV_FRAME_DATA_BOUNDING_BOXES,
>> >> >> >  };
>> >> >> >
>> >> >>
>> >> >> Finally, why call it a Bounding Box? It's not descriptive at all.
>> >> >> How about "Object Classification"? It makes much more sense, it's
>> >> >> exactly what this is. So AVObjectClassification, AVObjectClassification,
>> >> >> AV_FRAME_DATA_OBJECT_CLASSIFICATION and so on.
>> >> >>
>> >> >
>> >> > In object detection papers, bounding box is usually used.
>> >> > We'd better use the same term, imho, thanks.
>> >> >
>> >>
>> >> Not in this case, API users won't have any idea what this is or what
>> >> it's for. This is user-facing code after all.
>> >> Papers in fields can get away with overloading language, but we're
>> >> trying to make a concise API. Object classification makes sense
>> >> because this is exactly what this is.
>> >>
>> >
>> > The term bounding box is dominating the domain, for example, even
>> > HEVC spec uses this term, see page 317 of
>> >
>> https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-H.265-201911-I!!P
>> DF-E&type=items
>> >
>> > also copy some here for your convenient.
>> > ar_bounding_box_top[ ar_object_idx[ i ] ] u(16)
>> > ar_bounding_box_left[ ar_object_idx[ i ] ] u(16)
>> > ar_bounding_box_width[ ar_object_idx[ i ] ] u(16)
>> > ar_bounding_box_height[ ar_object_idx[ i ] ] u(16)
>> >
>> > I would prefer to use bounding box.
>> >
>>
>> It's for an entirely different thing, and like I said, it's just an overloaded
>> language because they can get away. We're trying to be generic.
>> This side data is for detecting _and_ classifying objects. In fact, most of
>> the structure is dedicated towards classifying. If you'd like users to actually
>> use this, give it a good name and don't leave them guessing what this
>> structure is by throwing vague jargon some other paper or spec has
>> because it's close enough.
>>
>
> all the people in the domain accepts bounding box, they can understand this
> struct name easily and clearly, they might be bothered if we use another name.
>
> btw, AVObjectClassification confuses people who just want object detection.
>

As I said, the name "bounding box" makes no sense once it gets overloaded
with object classification. Object classification is still the main use of the filters,
because the original proposal was to have all of this info be ffmpeg-private,
which would forbid simple object detection.
So I still maintain this should be called "Object classification". I'd accept
"Object detection" as well, but definitely not "bounding box".

Since the decision was made to make the side data public, we have to make
very sure it contains no hacks or is impossible to extend, since we don't want
to have an "AV_SIDE_DATA_OBJECT_CLASSIFICATION_VERSION_2_SORRY_WE_SCREWED_UP"
faster than you can say "Recursive cascade correlation artificial neural networks".


More information about the ffmpeg-devel mailing list