CN116958652A

CN116958652A - Scene graph generation method based on diffusion model

Info

Publication number: CN116958652A
Application number: CN202310761058.7A
Authority: CN
Inventors: 袁晓洁; 李伟; 张海威
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-10-27

Abstract

The invention relates to the technical field of computer vision processing, and provides a scene graph generation method based on a diffusion model. The method comprises the following steps: acquiring training data containing labeling information, acquiring an entity candidate frame and a relation candidate frame, and adding noise to acquire the entity noise candidate frame and the relation noise candidate frame; extracting features of the image to be processed through the entity noise candidate frame and the relation noise candidate frame to obtain entity features and relation features; constructing a deep learning network based on the entity characteristics and the relation characteristics and learning a back diffusion process of entity detection and relation detection to obtain a diffusion model; acquiring an entity position frame and a relation position frame of an image to be processed through the diffusion model, calculating an intersection ratio, and matching according to the highest intersection ratio to obtain a relation triplet; and generating a scene graph based on the relation triples and the graph structure constraint. The method can complete the flexible and expandable end-to-end scene graph generating task by using the diffusion model.

Description

Scene graph generation method based on diffusion model

Technical Field

The invention relates to the technical field of computer vision processing, in particular to a scene graph generation method based on a diffusion model.

Background

With the progress of the internet age, a large amount of data, including a large amount of image data, has been generated and recorded in human production and life. Under the condition of the same data volume, the image can present the information volume far exceeding that contained in the simple text. A picture may contain tens of target entities and a large number of relationships between the entities, which may be modeled as "subject-predicate-object" triples in the visual understanding task, based on which the modeled relationship triples, the image Scene can be further organized into a Graph structure, i.e. Scene Graph (Scene Graph), wherein the nodes and edges of the Scene Graph represent the relationships between target instances and pairs of objects in the image, respectively. The scene graph generating task takes pictures as input to generate the scene graph, and understanding information of image structural semantics is provided, so that stronger capability can be shown in visual reasoning compared with other visual understanding tasks, and the scene graph has wide application prospects in the fields of image retrieval, visual question-answering, image generation, editing and the like.

Besides, at present, a diffusion model has a great number of applications in the field of image semantic understanding, and obtains good application effects in the fields of image segmentation, image target detection and the like, but diffusion model research and application in the field of scene graph generation are lacking, and the biggest challenge is that the traditional relational modeling architecture is difficult to adapt to a probabilistic sampling and gradual denoising optimization mechanism of the diffusion model. The flexibility and the expansibility of the diffusion model and the characteristic of easy training of the diffusion model have great application prospects in the field of scene graph generation.

Identifying relationships between target entities in an image is a very important task in the context of scene graph generation because the identification of relationships between entities is central to the deep understanding of image semantics. Most existing scene graph generation work can be largely categorized into two categories for modeling relationships: graph structure-based modeling and pairwise query-based modeling. The relation modeling based on the graph structure mainly comprises two stages, namely firstly, using an existing pre-trained target detection model to carry out target entity detection on an input image, obtaining an entity candidate set and corresponding entity characteristics in the image, then taking the obtained entity candidate set as a graph structure point set of a directed acyclic graph, initially constructing a graph structure by taking the connecting edges between every two entities as the relation between the entities to model the image semantics, modeling the entity and the relation context based on the graph structure modeling, and further classifying the entity and the relation category to realize the prediction of the relation triplet.

However, such graph structure-based relational modeling is heavily dependent on the performance of the target detector, and graph structure-based contextual feature learning can introduce some contextual noise and significant time complexity. Recently, research has proposed implementing scene graph generation in a single-stage-based manner, which primarily treats relational modeling as a sparse triplet query task and trains on an end-to-end basis, thereby reducing time costs and reducing reliance on target detectors, whereas existing single-stage approaches suffer from poor performance due to lack of explicit modeling of target entities and lack of flexibility due to highly coupled model structures.

Disclosure of Invention

The present invention is directed to solving at least one of the technical problems existing in the related art. For this purpose, the invention provides a scene graph generation method based on a diffusion model.

The invention provides a scene graph generation method based on a diffusion model, which comprises the following steps:

s100: acquiring training data containing labeling information, acquiring entity candidate frames and relation candidate frames according to the training data, adding noise to the entity candidate frames and the relation candidate frames, and acquiring entity noise candidate frames and relation noise candidate frames;

s200: extracting features of the image to be processed through the entity noise candidate frame and the relation noise candidate frame to obtain entity features and relation features;

s300: constructing a deep learning network based on the entity characteristics and the relation characteristics, and obtaining a diffusion model through a back diffusion process of entity detection and relation detection of the deep learning network;

s400: acquiring an entity position frame and a relation position frame of an image to be processed through the diffusion model, calculating the intersection ratio of the entity position frame and the relation position frame, and matching according to the highest intersection ratio to obtain a relation triplet;

s500: and generating a scene graph based on the relation triples and the graph structure constraint.

According to the scene graph generation method based on the diffusion model, the noise is random noise conforming to Gaussian distribution.

According to the scene graph generation method based on the diffusion model provided by the invention, the step S100 comprises the following steps:

s110: introducing a scene graph generation data set, and selecting training data from the scene graph generation data set;

s120: extracting entity candidate frames and relation candidate frames in the training data;

s130: converting the representation space of the entity candidate frame and the relation candidate frame from the representation space of the left upper right lower coordinates to the representation space of the central coordinate size, and obtaining the entity candidate frame and the relation candidate frame under the central coordinate size space;

s140: adding noise to the entity candidate frame and the relation candidate frame under the space of the central coordinate size to obtain a pre-entity noise candidate frame and a pre-relation noise candidate frame;

s150: and converting the pre-entity noise candidate frame and the pre-relation noise candidate frame from the representation space of the central coordinate size to the representation space of the upper left and lower right coordinates to obtain an entity noise candidate frame and a relation noise candidate frame.

According to the scene graph generating method based on the diffusion model provided by the invention, the noise adding process in the step S140 is expressed as follows:

wherein ,for the physical noise candidate box, +.>For the relational noise candidate box, +.>True value labeling for entity candidate box, +.>Labeling true values of relation candidate boxes, +.>For diffusion process, ->And is added noise at time t.

According to the scene graph generating method based on the diffusion model provided by the invention, the step S300 comprises the following steps:

s310: embedding sampling time into the entity features and the relation features, and correcting the entity noise candidate frame and the relation noise candidate frame according to the entity features and the relation features after embedding the sampling time;

s320: predicting the entity characteristics after the embedding sampling time and the relation characteristics after the embedding sampling time in combination with a classification network to obtain a predicted entity category and a predicted relation category;

s330: constructing the deep learning network based on the entity characteristics, the relationship characteristics, the predicted entity category and the predicted relationship category;

s340: distributing true values of the predicted entity category and the predicted relation category through optimal transmission to optimize the deep learning network;

s350: and (5) the optimized deep learning network learns the back diffusion process, and the diffusion model is obtained through training.

According to the scene graph generating method based on the diffusion model provided by the invention, step S340 comprises the following steps:

s341: predicting a training image by the deep learning network to obtain a prediction result;

s342: calculating a cross entropy loss function, an average absolute value error loss function and a generalized cross ratio loss function of the prediction result by an optimal transmission allocation method;

s343: and iterating the deep learning network for a plurality of rounds based on the cross entropy loss function, the average absolute value error loss function and the generalized cross ratio loss function to obtain an optimized deep learning network.

According to the scene graph generating method based on the diffusion model provided by the invention, the step S500 further comprises the following steps:

s510: the active relationship triples in the screening relationship triples are used to generate a scene graph in combination with the graph structure constraints.

According to the scene graph generating method based on the diffusion model provided by the invention, the step S510 comprises the following steps:

s511: for a relation triplet with a relation predicate between two entities, calibrating the relation triplet as an effective relation triplet;

s512: for a relation triplet with a plurality of relation predicates between two entities, calculating the prediction probability of any relation predicate, and calibrating the relation predicate with the highest prediction probability of the two entities as an effective relation triplet;

s513: and constructing a directed acyclic graph structure based on the effective relation triples, and generating the scene graph.

According to the scene graph generation method based on the diffusion model, the entity characteristics can be well subjected to explicit modeling so as to obtain a better scene graph generation effect, and the scene graph generation which can be flexibly and expansively realized can be effectively realized by utilizing the characteristics of gradual sampling and iterative optimization of the diffusion model, so that the end-to-end scene graph generation task can be better completed.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a scene graph generating method based on a diffusion model according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "coupled," "coupled," and "connected" should be construed broadly, and may be either a fixed connection, a removable connection, or an integral connection, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.

In embodiments of the invention, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

An embodiment provided by the present invention is described below with reference to fig. 1.

wherein the noise is random noise conforming to Gaussian distribution.

Wherein, step S100 includes:

The noise adding process in step S140 is represented as:

In some embodiments, training data is first obtained from a public scene graph generation dataset Visual Genome dataset, and real scene picture I and corresponding scene icon information in the dataset are used as dataset samples. Wherein the given scene graph annotation information comprises location information and category information of entities in the image, and relationship category information existing between entities in the image, i.e. the scene graph may be represented as a set of several relationship triples.

Further, a true value entity candidate frame and a true value relation candidate frame pair are obtained from position frame data of the marked data, position frame representation data in two-point coordinates are converted into a representation space of center coordinates-size, wherein the width and height of the center point coordinates of the position frame and the position frame are added, and the value space of the center coordinates and the size is scaled to be based on the image sizeAnd then, converting the position frame representation from the center coordinate-size representation space to the upper left-lower right coordinate representation space through the reverse conversion process, so as to obtain the entity noise candidate frame and the relation noise candidate frame.

in some embodiments, first, a physical feature and relationship feature extraction module is configured to extract, for any physical candidate box and subject and object candidate boxes of a relationship candidate box pair, corresponding physical features and relationship subject and object features from a feature map generated by a pre-trained backbone model using an ROI Pooling manner, with the following calculation method:

wherein ,for physical characteristics->For extracting features and representing operations in a unified way, +.>Backbone network for feature extraction +.>Representing the picture entered->For visual features corresponding to subject in the relational features, < ->For visual features corresponding to the object in the relational features, < ->The visual features corresponding to the union region in the relation features.

wherein, step S300 includes:

in some embodiments, based on the entity characteristics and the relationship characteristics acquired in step S200, the position frame correction offset of the entity and the relationship noise candidate frame can be predicted, and the calculation manner is as follows:

wherein ,correcting offset for physical noise candidate block, +.>Correction offset for subject or object candidate boxes related in relation to noise candidate boxes, +.>Modifying an offset prediction model for an entity, +.>Correcting the offset prediction model for the relationship, +.>For sampling time, +.>For the relation feature->For the physical noise candidate box after the position correction, < + >>For passing subject candidateFrame or object candidate frame position corrected relational noise candidate frame,/>For correction operation on physical noise candidate box +.>Is a correction operation for the relational noise candidate block.

in some embodiments, based on the entity features acquired in step S200, the corresponding entity class is predicted, and for the prediction of the entity class, the following calculation is mainly performed based on the input entity visual feature representation and the time embedding of the current sampling step:

wherein ,for predicting entity class->The network is classified for entity categories.

In some embodiments, based on the relationship feature acquired in step S200, particularly referring to a host-guest relationship feature, a predicate category that may exist between a host and an object position of the corresponding relationship can be predicted, and in particular, for the construction of the relationship feature representation, the present invention uses the visual feature stitching feature of the host, the object and the union region as the relationship feature representation, and uses the predicate classification network to perform the relationship predicate classification, with the following calculation method:

wherein ,for predictive relationship category->The relation predicate classification network predicts predicate categories of corresponding relation based on the spliced relation characteristic representation and time embedding of the current sampling step.

In some embodiments, step S340 further includes allocating a true value of a matched position frame to the corresponding entity and relationship, and in addition, for the entity noise candidate frame, the relationship noise candidate frame, and the relationship predicate category in the predicted entity category and the predicted relationship category obtained in the above steps, the present invention allocates a best matched true value label to the predicted result based on the optimal transmission allocation method, so as to calculate a loss function to the entity and the relationship detection result.

Further, the loss function based on the category multi-classification cross entropy and the position frame prediction consistency constraint is utilized, then the AdamW optimizer is matched, the scene graph generation is gradually optimized through multiple iterations, and the training model learning entity detection and relation detection are based on the back diffusion process of the step S100, so that the corresponding entity and position-based relation can be predicted from the randomly sampled noise candidate frame.

Wherein, step S340 includes:

in some embodiments, the objective of this stage is to obtain a relationship triplet, unlike traditional graph structure-based relationship modeling and query-pair-based relationship modeling approaches, the present invention detects the relationships of entities and locations, respectively, and matches the subject and object location boxes of each relationship with at most Q entities with the highest intersection ratio by means of an entity and location box matching strategy, thereby obtaining a relationship triplet.

Further, first randomly sampling entity noise candidate frames and relation noise candidate frame pairs from Gaussian noise, predicting candidate entities existing in an input image based on a deep learning model and a relation based on positions, and calculating the intersection ratio of a subject and object position frame of each relation and all candidate entity position frames.

Further, based on the calculated intersection ratio between the relationship subject/object position frame and the entity position frame, the Q entity prediction results with the maximum intersection ratio of the relationship position frame selected and detected by using the maximum Q value selection mode are used as the matching results of the relationship subject or object position frame, and the calculation mode is as follows:

wherein ,is maximum->Value selection operation +_>For entity-relationship matching quantity, +.>For calculating the intersection ratio of the subject or object of the relation to the detection entity position frame, +.>A mapping matrix representing the subject or object position boxes of the relationship to the matching entity.

Wherein, step S500 further includes:

Wherein, step S510 includes:

The effectiveness verification is carried out on the scene graph generating method based on the diffusion model, and experimental results show that the method is superior to other methods in the aspects of quality and expansibility of end-to-end scene graph generation.

The invention carries out scene graph generation experiments on widely used public scene graph generation data set Visual Genome, and particularly adopts a VG150 data set division mode, wherein the data set comprises 108000 pictures, 150 types of target entity labels and 50 types of predicate labels, and in addition, 70% of data in the data set is used for training and 30% of data is used for testing.

The experiment contains three aspects: predicate classification subtasks (predcs, given a pair of target entity categories and position boxes, classifying the relationship predicates between two entities), scene graph classification subtasks (SGCls, given the position of a target entity, classifying the relationship between a target entity and an entity), scene graph generation subtasks (SGDet, generating a scene graph of a given image), six commonly used evaluation indexes are used for three experimental contents: R@K (Recall of top K triplets, pre-K triplet recall), ng-R@K (no-graph constraint Recall of top K triplets, non-graph constrained pre-K triplet recall), mR@K (mean Recall of top K triplets, pre-K triplet average recall), ng-mR@K (no-graphconstraint mean Recall of top K triplets, non-graph constrained pre-K triplet average recall), zR@K (zero-shot Recall of top K triplets, pre-K triplet zero sample recall), ng-zR@K (no-graphconstraint zero-shot Recall of top K triplets, non-graph constrained pre-K triplet zero sample recall).

Experimental results show that compared with other end-to-end scene graph generation methods, the method provided by the invention has different degrees of improvement in performance under different experimental settings, and is particularly obviously superior to the existing end-to-end scene graph generation methods (FCSGG and CoRF) in scene graph generation subtasks; in addition, under different reasoning settings, the performance of the method can be obviously improved, which indicates that the method has flexibility and expansibility which are not possessed by the existing scene graph generation method; in particular, the scene graph generated by the method can more and more accurately detect the semantic relation existing in the image, and the comparison result fully shows that the method provided by the invention has excellent effect on the end-to-end scene graph generation task.

The scene graph generation method based on the diffusion model is used for generating the end-to-end image scene graph, and the entity and relation visual characteristics are explicitly extracted through the random noise frame, so that more accurate characteristic representation of the entity and the relation can be obtained, and the stronger relation detection capability is achieved; based on a diffusion model architecture, relationship predicates possibly exist at any two positions on a predicted image from a random sampling noise frame, so that the dependence of a traditional scene graph generating architecture on a target detection model is eliminated, the detection of the relationship between multi-scale target entities can be realized, and stronger performance is shown on the detection of complex relationships and zero-sample relationship predicates; in order to enhance the flexibility and the expansibility of a scene graph generation model, the training and reasoning architecture based on a diffusion model can realize that the model is trained once to cope with various reasoning settings, so that the trade-off of model performance and time cost is realized, that is, the invention can explicitly model image entities and relation features more accurately, and obviously improves the scene graph generation performance; meanwhile, the relation detection mechanism based on the random noise frame pair can get rid of the dependence on a target detection model, so that stronger zero sample and complex relation detection capability are realized; in addition, the training and reasoning architecture based on the diffusion model remarkably enhances the flexibility and the expandable capacity of the model in the reasoning stage.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A scene graph generation method based on a diffusion model, comprising:

2. The method of generating a scene graph based on a diffusion model according to claim 1, wherein the noise is random noise conforming to a gaussian distribution.

3. The scene graph generation method based on the diffusion model according to claim 1, wherein the step S100 includes:

4. A scene graph generation method based on a diffusion model according to claim 3, wherein the noise adding process in step S140 is represented as:

5. The scene graph generation method based on the diffusion model according to claim 1, wherein the step S300 includes:

6. The scene graph generation method based on the diffusion model according to claim 5, wherein step S340 includes:

7. The scene graph generation method based on the diffusion model according to claim 1, wherein the step S500 further comprises:

8. The scene graph generation method based on the diffusion model according to claim 7, wherein the step S510 includes: