CN114842248A

CN114842248A - Scene graph generation method and system based on causal association mining model

Info

Publication number: CN114842248A
Application number: CN202210425654.3A
Authority: CN
Inventors: 罗廷金; 周浩
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-02
Anticipated expiration: 2042-04-22
Also published as: CN114842248B

Abstract

The application provides a scene graph generation method and system based on a causal association mining model. The method comprises the following steps: acquiring image data; analyzing and processing a target object in the image data based on a pre-constructed causal association mining model to obtain a relation classification result, and outputting the relation classification result by the causal association mining model to finish an analysis and processing process; and constructing a scene graph according to the relation classification result. The method can mine the intrinsic cause and effect between the objects and the relations, and meanwhile, the object personality characteristics are strengthened to eliminate the side effect of the category commonality characteristics.

Description

Scene graph generation method and system based on causal association mining model

Technical Field

The application relates to the technical field of computer vision, in particular to a scene graph generation method and system based on a causal association mining model.

Background

The data set of the scene graph usually has strong frequency bias, the existing scene graph generation model usually fuses the statistical dependency between object categories and relations into the framework of the model to improve the accuracy of relation prediction, on the basis, the statistical dependency between object categories and predicate relations is implicitly fused into the scene graph generation process, and in the mainstream framework, the prediction of the relation classification result mainly depends on the characteristics of objects and the categories of the objects.

Based on the above situation, in the prior art, under the current model framework, due to the influence of the object classification task, the category commonality feature components in the object features are strengthened, so that in the relationship prediction, the category commonality feature dominates the object features, so that the object personality features are suppressed, an incorrect causal relationship between the object category and the relationship classification result is inevitably established by the framework, a high-deviation prediction relationship of the relationship classification result is caused, and therefore, learning of a correct causal relationship between the object and the relationship is crucial for improving the performance of the scene graph.

Disclosure of Invention

In view of the above, an objective of the present application is to provide a method and a system for generating a scene graph based on a causal association mining model, so as to solve the above technical problems.

In view of the above, a first aspect of the present application provides a scenegraph generation method based on a causal association mining model, including:

acquiring image data;

the process of analyzing and processing the target object in the image data based on the pre-constructed causal association mining model comprises the following steps:

extracting object features from the image data to obtain object feature data, and decomposing the object feature data into category common features and object individual features;

obtaining category average characteristics according to the category common characteristics and the object individual characteristics, obtaining first relation data based on the category common characteristics and the object individual characteristics, and obtaining second relation data based on the category average characteristics;

performing difference processing on the first relation data and the second relation data to obtain third relation data;

carrying out relation classification according to the first relation data and the third relation data to obtain a relation classification result, and outputting the relation classification result by the causal association mining model to finish the analysis processing process;

and constructing a scene graph according to the relation classification result.

A second aspect of the present application provides a scenegraph generation system based on a causal association mining model, including:

a data acquisition module configured to acquire image data;

an extraction classification module configured to perform analysis processing on a target object in the image data based on a pre-constructed causal association mining model, the extraction classification module including:

the extraction and decomposition unit is configured to extract object features of the image data to obtain object feature data, and decompose the object feature data into category common features and object individual features;

the relation data acquisition unit is configured to obtain a category average characteristic according to the category common characteristic and the object individual characteristic, obtain first relation data based on the category common characteristic and the object individual characteristic, and obtain second relation data based on the category average characteristic;

a difference processing unit configured to perform difference processing on the first relation data and the second relation data to obtain third relation data;

the relation classification unit is configured to perform relation classification according to the first relation data and the third relation data to obtain a relation classification result, and the causal association mining model outputs the relation classification result to complete the analysis processing process;

and the scene graph building module is configured to build a scene graph according to the relation classification result.

A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

A fourth aspect of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect.

As can be seen from the above, in the scene graph generation method and system based on the causal association mining model provided by the present application, the object feature data is extracted from the acquired image data through the pre-constructed causal association mining network model, the feature data is decomposed into the object individual feature data and the category common feature data, the first relationship data is obtained based on the category common feature and the object individual feature, the first relationship data is dominated by the category common feature and the object individual feature of the target object, the second relationship data is obtained based on the category average feature, the second relationship data is dominated by only the category common feature of the target object, the first relationship data and the second relationship data are subjected to difference processing to obtain the third relationship data, the third data is dominated by the object individual feature, so as to release suppression of the category common feature on the object individual feature, the method comprises the steps of fully playing the role of object personality characteristics in relation classification, carrying out relation classification according to first relation data and third relation data to obtain a relation classification result, constructing a scene graph according to the relation classification result, and realizing hierarchical classification from coarse-grained semantics to fine-grained semantics in double-unbalanced and semantically-overlapped data, thereby realizing causal association between the object personality data and the relation classification result.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a scene graph generation method according to an embodiment of the present application;

FIG. 2-a is a schematic causal diagram of an embodiment of the present application;

FIG. 2-b is an exploded view of object feature data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a scene graph generation framework according to an embodiment of the present application;

FIG. 4 is a block diagram illustrating a scene graph generation system based on a causal association mining model according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In the scene graph generation model in the related art, the relation classification between the object feature data and the statistical dependency between the relation classification results are usually merged into the framework of the model to improve the accuracy of the relation prediction, and on the basis, the statistical dependency between the relation classification between the object feature data and the predicate relation is implicitly merged into the process of generating the scene graph. In the mainstream framework, the prediction of the relation classification result mainly depends on the relation classification between the object feature data and the object feature data, however, the framework inevitably establishes a wrong causal relation between the relation classification between the object feature data and the relation classification result, which causes high deviation prediction of the relation classification result, and under the current model framework, because of the influence of the classification task, the category commonality characteristics in the object feature data are strengthened, so in the relation prediction, the relation prediction will be dominated by the category commonality characteristics, and because of the influence of statistical dependency, the scene graph generation methods take the category commonality characteristic dominance as the basis of the relation classification, thereby causing that the correct relation cannot be predicted according to richer semantics.

The embodiment of the application provides a scene graph generation method based on a causal association mining model, extracted feature data are decomposed through a pre-constructed causal association mining network model, the influence of category commonality features and object personality features on relationship classification is measured through a fact branch model to obtain first relationship feature data, the influence of the category commonality feature data on the relationship classification is measured through an inverse fact branch model to obtain second relationship data, more essential relationship features are extracted, the contribution of the object personality feature data is enhanced, and the adverse effect of the category commonality feature data is counteracted. And performing difference processing on the first relation data and the second relation data to obtain third relation coefficient data, so that the influence of the object individual characteristic data on relation classification is learned through the difference between the category common characteristic data and the object individual characteristic data on relation prediction, so that causal characteristics between the object characteristic data and the relation of the target object are mined, the relation classification is predicted through a classifier trained based on a multi-level droop loss function, a relation classification result is obtained, relation categories with similar semantics can be distinguished from coarse granularity to fine granularity, so that the corresponding relation between more essential predicate characteristics and predicate labels is learned on the classifier, the causal association mining model outputs the relation classification result, and finally, a scene graph is constructed according to the relation classification result.

As shown in fig. 1, the method of the present embodiment includes:

step 101, image data is acquired.

In this step, the acquired image data is image data for which a scene map is to be generated.

102, analyzing and processing the target object in the image data based on a pre-constructed causal association mining model to obtain a relationship classification result, and outputting the relationship classification result by the causal association mining model to complete the analysis processing process.

In the step, a target object in the image data is analyzed and processed through a causal association mining network model (CAE-Net), correct causal association between the interaction state of the target object and a relation classification result is established, the relation classification result is obtained, correct prediction is carried out according to richer semantics, and high deviation prediction of the relation classification result is avoided.

Step 102 comprises:

and 1021, performing object feature extraction on the image data to obtain object feature data, and decomposing the object feature data into category commonality features and object personality features.

In this step, the object feature data refers to corresponding features or characteristics of a certain type of object in the image data that are different from other types of object, or a collection of these features and characteristics. And performing feature extraction through a FasterR-CNN framework to obtain object feature data.

The FasterR-CNN integrates basic feature extraction, region suggestion, frame regression and classification into a network, so that the comprehensive performance is greatly improved, and the speed of extracting the object features of the image data is further increased.

The object feature data is decomposed into category commonality characteristics and object personality characteristics, as shown in fig. 2-a, object feature data O of the target object includes two parts: class commonality feature O _g And object personality characteristics O _s And R describes the relationship between two target objects, as shown in fig. 2-b, the category commonality characteristics mainly express the common characteristics in structure or appearance between the same type of target objects, for example, the category commonality characteristics of a dog includes information of Nose (Nose), Ears (Ears), Tail (Tail), legs (Fourlegs), etc., and the object personality characteristics of a target object mainly describe the characteristics of the target object distinguished from other categories of target objects, for example, the object personality characteristics of a dog includes information of upright front legs (Straight front legs), Crouched rear legs (Crouched rear legs), and Open mouth (Open mouth).

And 1022, obtaining a category average feature according to the category commonality feature and the object personality feature, obtaining first relation data based on the category commonality feature and the object personality feature, and obtaining second relation data based on the category average feature.

In the step, the category commonality characteristic can learn statistical knowledge of categories in the data set and natural interdependency of the categories so as to reduce a candidate set and a standard semantic space, the object personality characteristic can reflect real interaction between object pairs so as to be beneficial to a classifier to make a more accurate decision on a causal level, first relation data is obtained based on the category commonality characteristic and the object personality characteristic, meanwhile, second relation data is obtained based on the category average characteristic, the object personality characteristic and the category commonality characteristic are respectively processed, false association brought between the category commonality characteristic and the relation classification is avoided, contribution of the object personality characteristic data is inhibited, and the advantages of the category commonality characteristic are retained while the object personality characteristic is strengthened.

And 1023, performing difference processing on the first relation data and the second relation data to obtain third relation data.

In the step, the first relation data represents the influence of the category commonality characteristic and the object personality characteristic on the relation classification, the second relation characteristic data represents the influence of the category commonality characteristic data on the relation classification, and the influence of the object personality characteristic data on the relation classification is learned through the difference of the category commonality characteristic and the object personality characteristic on the relation classification, so that the causal characteristic between the object characteristic data and the relation of the target object is mined.

And 1024, performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, and outputting the relationship classification result by the causal association mining model to complete the analysis processing process.

In the step, the relationship classification is carried out according to the first relationship data and the third relationship data to obtain a relationship classification result, and the category commonality characteristic and the object personality characteristic are balanced to fully exert the function of target object interaction information in the object personality characteristic while keeping the advantages brought by the category commonality characteristic.

And 103, constructing a scene graph according to the relation classification result.

In the step, the scene graph is constructed according to the correct relation classification result with richer semantics, so that the performance of the scene graph is improved.

According to the scheme, the extracted feature data are decomposed into object individual feature data and class common feature data through a pre-constructed causal association mining network model, the class common feature data can learn statistical knowledge of classes in a data set and natural interdependence of the classes so as to reduce a candidate set and a standard semantic space, the object individual feature can reflect real interaction between object pairs so as to be beneficial to a classifier to carry out more accurate decision on a causal level, first relation data is obtained based on the class common feature and the object individual feature, meanwhile, second relation data is obtained based on the class average feature, the object individual feature and the class common feature are respectively processed, false association brought between the class common feature and the relation classification is avoided, contribution of the object individual feature data is inhibited, and while the object individual feature is strengthened, the advantages of the generic commonality feature are also retained. And performing difference processing on the first relation data and the second relation data to obtain third relation coefficient data, and learning the influence of the object individual characteristic data on the relation classification through the difference of the influence of the category commonality characteristic and the object individual characteristic on the relation classification, so as to mine the object characteristic data of the target object and the causal characteristic between the relations. And performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, balancing the category commonality characteristics and the object personality characteristics to fully play the role of target object interaction information in the object personality characteristics while retaining the advantages brought by the category commonality characteristics, outputting the relationship classification result through a causal association mining model, finally constructing a scene graph according to the relationship classification result, constructing the scene graph according to the correct relationship classification result with richer semantics, and improving the performance of the scene graph.

In some embodiments, the causal association mining model comprises a factual branch model and a counter-factual branch model;

step 1022, including:

inputting the category commonality characteristics and the object personality characteristics into the fact branch model, and outputting the first relation data after processing through the fact branch model;

meanwhile, the category average features are input into the counterfactual branch model, and the second relation data is output through the counterfactual branch model.

In the scheme, the fact branch model is used for measuring the influence of the category common characteristic and the object individual characteristic on the relationship classification, the counter-fact branch model is used for separately measuring the influence of the category common characteristic on the relationship classification, and the object individual characteristic and the category common characteristic are respectively processed, so that the suppression of the category common characteristic on the object individual characteristic is removed, the function of the object individual characteristic in the relationship classification is fully exerted, the false association brought between the category common characteristic and the relationship classification is avoided, the contribution of object individual characteristic data is suppressed, and the advantages of the category common characteristic are maintained while the object individual characteristic is strengthened.

In some embodiments, the causal association mining model comprises a classifier;

step 1024, including:

and inputting the first relation data and the third relation data into a trained classifier, performing relation classification through the classifier, and outputting a relation classification result.

In the above scheme, there are many semantic overlaps between relationship classification results, for example, "on" (at a certain position), "talking on" (stopping at a certain position), "standing on" (standing at a certain position), and "walking on" (walking at a certain position) of relationship classification results. Although they are similar in coarse-grained semantics, they are slightly different in fine-grained semantics. The trained classifier learns hierarchical relationship classification, and relationship categories with similar semantics are distinguished from coarse granularity to fine granularity, so that the corresponding relationship between more essential predicate features and predicate labels can be distinguished, and causal association between object feature data and relationships is realized.

In some embodiments, the obtaining of the classifier comprises:

acquiring training data, and constructing a softmax regression pre-training model;

determining a first loss function l based on the training data _bf ；

Based on the first loss function l _bf And determining a second loss function/from said training data _ht ；

According to said first loss function l _bf The second loss function l _ht And determining a third loss function/from said training data _fore ；

According to said first loss function l _bf The second loss function l _ht And said third loss function l _fore A multi-level droop loss function is determined,

wherein the multi-level droop loss function/ _MHD Expressed as:

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

wherein α represents a weight of the multi-layer degradation partial loss function;

inputting the training data into the softmax regression pre-training model, performing minimization processing based on the multi-level droop loss function, continuously training and adjusting the softmax regression pre-training model to obtain a trained softmax regression pre-training model, and taking the trained softmax regression pre-training model as the classifier.

In the above scheme, the distribution of the relationship classification result in the scene graph construction is doubly unbalanced, and the number of the backgrounds without the relationship classification result labels is significantly greater than that of the foregrounds, where the foregrounds refer to the target objects in the image data, and the backgrounds refer to objects other than the target objects in the image data. On the other hand, unlike the imbalance faced by other tasks, there are many semantic overlaps between relationship classes in the construction of a scene graph. For example, there are many semantic overlaps between relationship classification results, e.g., "on" (at a location), "talking on" (stopped at a location), "standing on" (standing at a location), and "walking on" (walking at a location) of relationship classification results, which are somewhat different in fine-grained semantics, although they are similar in coarse-grained semantics. Generally, the coarse-grained relational classification results are concentrated in a head with frequent samples, where the head refers to a certain relational classification result to which more target objects belong, and the fine-grained relational classification results are distributed in a smaller number of tails, where the tails refer to a certain relational classification result to which fewer target objects belong, so that information of the tails is usually suppressed by a background and a head foreground.

Therefore, through a multi-level bias loss function (MHD loss), the softmax regression pre-training model can learn hierarchical relation classification in training, and the relation classes with similar semantics are distinguished from coarse granularity to fine granularity, so that the corresponding relation between more essential predicate characteristics and predicate labels is learned on the softmax regression pre-training model. The multi-level droop loss function does not give higher priority to the tail in calculating the loss and gradient, nor does it introduce any related design that could harm the expression of the head features. Under the training of double unbalanced data, the classifier can effectively distinguish the difference between the background and the foreground and between the head and the tail through a multi-level bias-reducing loss function, the relationship is classified in a level mode from coarse granularity to fine granularity according to semantics, finally, causal association between the features and the relationship is achieved on the learning of the classifier, and the causal association between the object feature data and the relationship is achieved more accurately.

Wherein the first loss function l _bf A second loss function l for discriminating the first relation data and/or the third relation data as a binary loss function of the foreground or the background _ht To distinguish the first relation data and/or the third relation data as two classification loss functions of head class or tail class in the foreground relation, a third loss function l _fore A multi-classification loss function to distinguish all foreground classes.

In some embodiments, the training data comprises real relationship classification results of object feature data of objects in training image data and object feature data of the training data, wherein objects in the training data comprise target training objects and non-target training objects;

the first loss function l _bf Expressed as:

wherein the content of the first and second substances,

expressed as said first loss function/ _bf The object feature data of the object in the training image data is a foreground or a background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0Representing the background, n is 1 representing the foreground;

expressed as said first loss function/ _bf Object feature data of an object in the training image data for a foreground or background; sigma is sigmoid function;

said second loss function l _ht Expressed as:

wherein the content of the first and second substances,

expressed as l in said second loss function _ht The real relation classification result of the object feature data of the training data; n is 1 or 0, n is 0 and represents the head, the head is the real relation classification result containing a large number of objects, n is 1 and represents the tail, and the tail is the real relation classification result containing a small number of objects;

expressed as l in said second loss function _ht Object feature data of an object in the training image data of the head or tail; sigma is sigmoid function; beta is a weight parameter;

the third loss function l _fore Expressed as:

wherein, y' _j A true relationship classification result of object feature data of the training data expressed as a foreground; j represents the real relation classification result of the object characteristic data of the jth training data; r represents the number of real relation classification results of object feature data of the training data; p' _j Representing the result of a relational classification of object feature data as said training dataProbability information.

In the above scheme, object feature data of an object in training image data in the foreground is classified as a foreground or a background through a first loss function, object feature data of the object in training image data in the foreground is classified as a head or a tail through a second loss function, and finally object feature data of the object in all training image data in the foreground is subjected to relation classification through a third loss function.

The training data is object characteristic data x ═ x (x) of an object in the foreground training image data ₀ ，x ₁ ，...，x _R ) Where R is the number of object feature data, and the true relationship classification result in the form of a unique heat vector y ═ y (y ₀ ，y ₁ ，...，y _R )。

Obtaining probability information p ═ of a result of relational classification of object feature data of estimated training data from object feature data of objects in foreground training image data (p ═ p) ₀ ，p ₁ ，...，p _R )＝softmax(x)。

For the first loss function l _bf In (1),

expressed as:

if the prediction of the original object feature data is a category in the background, then

And is

Is the average of the object feature data of all foregrounds. On the contrary, the first step is to take the reverse,

and is

When y is ₀ When 1, y ^bf When it is equal to (1, 0), otherwise y ^bf ＝(0，1)。

For the second loss function l _ht In, true relationship classification result y ^ht The conversion of (2):

where head denotes a head, tail denotes a tail, and probability information p ═ of a result of relational classification of object feature data by training data (p ═ is ₀ ，p ₁ ，…，p _R ) Softmax (x) obtains the true relational classification result.

Conversion of object feature data of objects in training image data for a relationship of head foreground and a relationship of tail foreground:

where m represents the number of relational classification results for the head and n represents the number of relational classification results for the tail. When the predicted relationship classification result of the original probability distribution is the relationship classification result of the head, yh ^t Set to (1, 0), its corresponding x ^ht Head item of (1)

Element x of corresponding category for original relation characteristic data _i Term of tail

And setting the average value of all tail elements in the original relation characteristic data. When the prediction class of the original probability distribution is the tail relational classification result, y ^ht Set to (0, 1), its corresponding x ^ht The tail item in

Is set as originalElement x of corresponding category of relational feature data _i Head item

Is the average of all head elements in the original relational feature data.

In some embodiments, the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein FC (-) represents the full connectivity layer learning algorithm of the Faster R-CNN model; g (-) is expressed as a binary tree long-short term memory network learning algorithm; o is _i Object feature data expressed as an ith target object in the image data; o is _i Object feature data expressed as a jth target object in the image data;

the second relational data is expressed as:

wherein the content of the first and second substances,

represented as a class-average feature sum of an i-th target object in the image data derived based on the first relational feature data and the second relational feature data

The category average characteristics of jth target object obtained based on the first relation data and the second relation characteristic data in the image data are all expressed;

the class average feature is expressed as:

wherein i represents object feature data of the ith target object;t is expressed as the number of iterations; λ is represented as an update weight;

object feature data of the target object represented as the t-th iteration;

the class average feature of the target object represented as the t-1 th iteration;

representing the category average characteristic of the target object of the t-th iteration;

the third correlation data is expressed as:

in the above scheme, the binary tree long-term short-term memory network algorithm in the first relational data is a BiTreeLST network algorithm, and the first relational data in which the category common characteristic data and the object individual characteristic data cooperate with each other is generated through a fact branch model.

Second relational data to generate second relational feature data affected only by class commonality features, the counterfactual branch model uses as input a class average feature vector that is statistical in training, is independent of the current input image, does not belong to an object of the current scene, is a feature that conflicts with real presence, and is counterfactual.

Causal implications between feature data and relationships are implied in object personality characteristics because it involves interactions between objects. The first relation data of the fact branch model is a result of the common influence of the category commonality characteristics and the object personality characteristics, and the second relation data of the counter-fact branch model is mainly dominated by the category commonality characteristics. Therefore, the output first relational data L by comparing the fact branch model _f (o _i ，o _j ) Second relational data with output of counterfactual branch model thereof

The influence of the object personality characteristics on the relationship classification can be evaluated according to the difference between the object personality characteristics and the third relationship coefficient data L _sp (o _i ，o _j ). Finally, the causal association mining model synthesizes first relation data L of the fact branch model _f (o _i ，o _j ) Influence of object personality characteristics on relationship classification third relationship data L _sp (o _i ，o _j ) The relationship prediction is carried out, the individual characteristics of the objects are strengthened, the advantages of the category common characteristics are kept, the category common characteristics can learn the statistical knowledge of the categories in the data set and the natural interdependency of the categories, so that the candidate set and the standard semantic space are reduced, and the individual characteristics of the objects can reflect the real interaction between the pairs of the objects, so that the more accurate decision can be made on the causal level during the classification.

In some embodiments, the causal association mining model comprises a Faster R-CNN model;

step 1021, comprising:

obtaining a candidate region of the target object through the Faster R-CNN model based on the image data;

and extracting object features of the candidate region of the target object to obtain object feature data.

In the above scheme, the target detector in the Faster R-CNN model obtains the candidate region of the target object from the image data and also obtains its corresponding position coordinates.

The fast R-CNN mainly structurally comprises a convolution layer, an RPN (Region generation Network), a Rol posing layer and a classification regression layer, wherein the convolution layer mainly has the function of extracting a feature map of the whole image data, and the structure of the convolution layer is formed by convolution, an activation function and pooling operation.

The RPN network layer can quickly and more efficiently utilize the convolutional neural network, anchor points or anchor frames can be generated when key object candidate areas are generated, then anchorages are judged to belong to the foreground or the background through a discrimination function inside the anchor points or anchor frames, and then the anchor points or anchor frames are adjusted for the first time through frame regression to obtain accurate key object candidate areas.

The addition of the RoI posing layer is mainly to solve the problem that the size of a feature map of a final input full-connection layer is different, and the fixed size is obtained through upsampling.

And finally, respectively judging which category the object belongs to and finely adjusting the position of the candidate region of the target object through two classification layers and regression layers.

In some embodiments, as shown in fig. 3, a causal association mining network model (CAE-Net model) performs Feature extraction (Feature extraction) on a target Object in image data through a fast R-CNN model to obtain Object features (Object Feature data), wherein the target Object is a human and an animal, the Object Feature data is decomposed into a class commonality Feature and an Object personality Feature, and O: inputting Object features (category commonality characteristics and Object personality characteristics) into a failure bridge model (fact branch), performing join feature embedding (joint feature embedding) processing, obtaining L (namely first relation data) of relationships logit on O (category commonality characteristics and Object personality characteristics are jointly influenced) through the failure bridge model processing, simultaneously performing feature statistics, obtaining category average characteristics based on the category commonality characteristics and the Object personality characteristics, inputting the category average characteristics into a joint failure bridge model (anti-fact branch), performing join feature embedding (joint feature embedding) processing, and obtaining Relations logit on O through the joint failure bridge model processing _g (class-generic features commonality characteristics) affected L _cf (i.e., second relationship data) based on L (i.e., first relationship data) and L _cf (i.e., second relational data) to obtain relationships locations on O _s (subject personality characteristic influence) L _sp (i.e., third relationship data) by combining L (i.e., first relationship data) with L _sp (i.e., third relation data) inputs classifier trained based on MHD loss (multi-level reduced bias loss function), and outputs rAnd (4) constructing a scene graph based on the relationship classification (the relationship classification result).

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a scene graph generation system based on the causal association mining model.

Referring to fig. 4, the scene graph generation system based on the causal association mining model includes:

a data acquisition module 401 configured to acquire image data;

an extraction classification module 402 configured to perform an analysis process on a target object in the image data based on a pre-constructed causal association mining model, wherein the extraction classification module includes:

an extraction and decomposition unit 4021 configured to perform object feature extraction on the image data to obtain object feature data, and decompose the object feature data into category commonality features and object personality features;

the relationship data obtaining unit 4022 is configured to obtain a category average feature according to the category commonality feature and the object personality feature, obtain first relationship data based on the category commonality feature and the object personality feature, and obtain second relationship data based on the category average feature;

a difference processing unit 4023 configured to perform difference processing on the first relationship data and the second relationship data to obtain third relationship data;

the relationship classification unit 4024 is configured to perform relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, and the causal association mining model outputs the relationship classification result to complete the analysis processing process;

and a scene graph constructing module 403 configured to construct a scene graph according to the relationship classification result.

the relationship data obtaining unit 4022 is specifically configured to:

the relationship classification unit 4024 is specifically configured to:

In some embodiments, the obtaining of the classifier comprises:

determining a first loss function l based on the training data _bf ；

According to said first loss function l _bf The second loss function l _ht And said third loss function lfore determines a multi-level droop loss function,

wherein the multi-level droop loss function/ _MHD Expressed as:

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

In some embodiments, the training data comprises true relational classification results of object feature data of objects in training image data and object feature data of the training data, wherein the objects in the training data comprise target training objects and non-target training objects;

the first loss function l _bf Expressed as:

wherein the content of the first and second substances,

expressed as said first loss functionl _bf The object feature data of the object in the training image data is a foreground or a background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0 representing background, n is 1 representing foreground;

said second loss function l _ht Expressed as:

wherein, the first and the second end of the pipe are connected with each other,

expressed as l in said second loss function _ht Object feature data of an object in the training image data of the head or the tail; sigma is sigmoid function; beta is a weight parameter;

said third loss function/ _fore Expressed as:

wherein, y' _j A true relationship classification result of object feature data of the training data expressed as a foreground; j denotes the true relational classification of object feature data of jth training dataThe result is; r represents the number of real relation classification results of object feature data of the training data; p' _j Probability information representing a result of a relational classification of object feature data as the training data.

In some embodiments, the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein FC (-) represents the full connectivity layer learning algorithm of the FasterR-CNN model; g (-) is expressed as a binary tree long-short term memory network learning algorithm; o _i Object feature data expressed as an ith target object in the image data; o _j Object feature data expressed as a jth target object in the image data;

the second relational data is expressed as:

wherein the content of the first and second substances,

the class average feature is expressed as:

wherein i represents object feature data of the ith target object; t is expressed as the number of iterations; λ is represented as an update weight;

object feature data of the target object represented as the t-th iteration;

the category average characteristic of the target object expressed as the t-th iteration;

the third correlation data is expressed as:

in some embodiments, the causal association mining model comprises a FasterR-CNN model;

the extraction decomposition unit 4021 is specifically configured to:

the performing object feature extraction on the image data to obtain object feature data includes:

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The device of the above embodiment is used for implementing the corresponding scene graph generation method based on the causal association mining model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any embodiment described above, the application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the scene graph generation method based on the causal association mining model described in any embodiment described above is implemented.

Fig. 5 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 501, a memory 502, an input/output interface 503, a communication interface 504, and a bus 505. Wherein the processor 501, the memory 502, the input/output interface 503 and the communication interface 504 are communicatively connected to each other within the device via a bus 505.

The processor 501 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 502 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 502 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 502 and called to be executed by the processor 501.

The input/output interface 503 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 504 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 505 comprises a path that transfers information between the various components of the device, such as processor 501, memory 502, input/output interface 503, and communication interface 504.

It should be noted that although the above-mentioned device only shows the processor 501, the memory 502, the input/output interface 503, the communication interface 504 and the bus 505, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding scene graph generation method based on the causal association mining model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above embodiments, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the scenegraph generation method based on causal association mining model as described in any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the scene graph generation method based on the causal association mining model according to any of the above embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A scene graph generation method based on a causal association mining model is characterized by comprising the following steps:

acquiring image data;

and constructing a scene graph according to the relation classification result.

2. The method of claim 1, wherein the causal association mining model comprises a fact branch model and a counter-fact branch model;

the obtaining of the first relationship data based on the category commonality characteristic and the object personality characteristic, and the obtaining of the second relationship data based on the category average characteristic include:

3. The method of claim 1, wherein the causal association mining model comprises a classifier;

the performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result includes:

4. The method of claim 3, wherein the obtaining of the classifier comprises:

determining a first loss function l based on the training data _bf ；

According to said first loss function l _bf The second loss function l _ht And determining a third loss function/from said training data _fore ：

wherein the multi-level droop loss function/ _MHD Expressed as:

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

5. The method of claim 4, wherein the training data comprises real relationship classification results of object feature data of objects in training image data and object feature data of the training data, wherein objects in the training data comprise target training objects and non-target training objects;

the first loss function l _bf Expressed as:

wherein the content of the first and second substances,

expressed as said first loss function/ _bf The object feature data of the object in the training image data is a foreground or a background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0 and represents the background, and n is 1 and represents the foreground;

expressed as said first loss function/ _bf Object feature data of an object in the training image data for a medium foreground or background; sigma is sigmoid function;

said second loss function l _ht Expressed as:

wherein the content of the first and second substances,

the third loss function l _fore Expressed as:

wherein, y' _j A true relationship classification result of object feature data of the training data expressed as a foreground; j represents the real relation classification result of the object characteristic data of the jth training data; r represents the number of real relation classification results of object feature data of the training data; p' _j Probability information representing a result of a relational classification of object feature data as the training data.

6. The method of claim 2, wherein the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein FC (-) is represented by the fast R-A full connection layer learning algorithm of the CNN model; g (-) is expressed as a binary tree long-short term memory network learning algorithm; o is _i Object feature data expressed as an ith target object in the image data; o is _j Object feature data expressed as a jth target object in the image data;

the second relational data is expressed as:

wherein the content of the first and second substances,

the class average features are expressed as:

object feature data representing a target object for the t-th iteration;

the third correlation data is expressed as:

7. the method of claim 1, wherein the causal association mining model comprises a Faster R-CNN model;

8. A scene graph generation system based on a causal association mining model is characterized by comprising:

a data acquisition module configured to acquire image data;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.