CN114842248B

CN114842248B - Scene graph generation method and system based on causal association mining model

Info

Publication number: CN114842248B
Application number: CN202210425654.3A
Authority: CN
Inventors: 罗廷金; 周浩
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2024-02-02
Anticipated expiration: 2042-04-22
Also published as: CN114842248A

Abstract

The application provides a scene graph generation method and system based on a causal association mining model. The method comprises the following steps: acquiring image data; analyzing and processing a target object in the image data based on a pre-constructed causal association mining model to obtain a relation classification result, and outputting the relation classification result by the causal association mining model to finish an analysis and processing process; and constructing a scene graph according to the relation classification result. The method can mine the intrinsic cause and effect between the object and the relation, and strengthen the individual characteristics of the object to eliminate the side effect of the category commonality characteristics.

Description

Scene graph generation method and system based on causal association mining model

Technical Field

The application relates to the technical field of computer vision, in particular to a scene graph generation method and system based on a causal association mining model.

Background

There is usually a strong frequency bias in the data set of the scene graph, and the existing scene graph generation model usually integrates the statistical dependency between the object category and the relationship into the framework of the model to improve the accuracy of the relationship prediction, on the basis, the statistical dependency between the object category and the predicate relationship is implicitly integrated into the scene graph generation process, and in this mainstream framework, the prediction of the relationship classification result mainly depends on the characteristics of the object and the category of the object.

Based on the above situation, in the prior art, under the current model framework, due to the influence of an object classification task, category common feature components in object features are reinforced, so that in relation prediction, the category common feature dominates the object features, and therefore, the object individual features are suppressed, and the framework inevitably establishes a wrong causal relation between the object category and a relation classification result, and a high-deviation prediction relation of the relation classification result is caused, so that the correct causal relation between a learning object and the relation is crucial to improving the performance of a scene graph.

Disclosure of Invention

In view of the foregoing, an objective of the present application is to provide a method and a system for generating a scene graph based on a causal relation mining model, so as to solve the above technical problems.

With the above object in view, a first aspect of the present application provides a method for generating a scene graph based on a causal link mining model, including:

acquiring image data;

the analysis processing process for the target object in the image data based on the pre-constructed causal association mining model comprises the following steps:

extracting object features from the image data to obtain object feature data, and decomposing the object feature data into category commonality features and object personality features;

Obtaining class average characteristics according to the class commonality characteristics and the object individuality characteristics, obtaining first relation data based on the class commonality characteristics and the object individuality characteristics, and obtaining second relation data based on the class average characteristics;

performing difference processing on the first relationship data and the second relationship data to obtain third relationship data;

performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, and outputting the relationship classification result by the causal relation mining model to complete the analysis processing process;

and constructing a scene graph according to the relation classification result.

A second aspect of the present application provides a context graph generation system based on a causal link mining model, comprising:

a data acquisition module configured to acquire image data;

an extraction classification module configured to analyze a target object in the image data based on a pre-constructed causal link mining model, the extraction classification module comprising:

the extraction and decomposition unit is configured to extract object features of the image data to obtain object feature data, and decompose the object feature data into category commonality features and object personality features;

The relation data acquisition unit is configured to obtain class average characteristics according to the class commonality characteristics and the object individuality characteristics, obtain first relation data based on the class commonality characteristics and the object individuality characteristics, and obtain second relation data based on the class average characteristics;

the difference processing unit is configured to perform difference processing on the first relation data and the second relation data to obtain third relation data;

the relation classification unit is configured to perform relation classification according to the first relation data and the third relation data to obtain a relation classification result, and the causal relation mining model outputs the relation classification result to complete the analysis processing process;

and the scene graph construction module is configured to construct a scene graph according to the relation classification result.

A third aspect of the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

A fourth aspect of the present application provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect.

From the above, it can be seen that, the scene graph generating method and system based on the causal relation mining model provided by the present application extracts object feature data from the acquired image data through the pre-constructed causal relation mining network model, decomposes the feature data into object personality feature data and category commonality feature data, obtains first relation data based on the category commonality feature and the object personality feature of the target object, at this time, the first relation data is jointly dominant by the category commonality feature and the object personality feature of the target object, obtains second relation data based on the category average feature, at this time, the second relation data is only dominant by the category commonality feature of the target object, performs difference processing on the first relation data and the second relation data to obtain third relation data, at this time, the third relation data is dominant by the object personality feature, so as to release the suppression of the category commonality feature on the object personality feature, and fully exert the effect of the object personality feature in relation classification, obtain relation classification result, realize that from coarse granularity to granularity in the data overlapping in the rough semantic granularity, realize that the causal relation classification result is strong, and can only realize that the causal relation between the feature and the causal relation mining network feature can also be enhanced through the object intercommunity feature.

Drawings

In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1 is a flow chart of a scene graph generation method according to an embodiment of the present application;

FIG. 2-a is a schematic illustration of causal relationships according to an embodiment of the present application;

FIG. 2-b is an exploded view of object feature data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a scene graph generation framework according to an embodiment of the present application;

FIG. 4 is a block diagram of a scenario graph generation system based on a causal link mining model according to an embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail below with reference to the accompanying drawings.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The scene graph generation model in the related art generally integrates the relationship classification among the object feature data and the statistical dependency among the relationship classification results into the framework of the model to improve the accuracy of relationship prediction, and on the basis, the statistical dependency among the relationship classification among the object feature data and the predicate relationship is implicitly integrated into the scene graph generation process. In this mainstream framework, the prediction of the relationship classification result mainly depends on the relationship classification between the object feature data and the object feature data, however, this framework inevitably establishes the relationship classification between the object feature data and the erroneous causal relationship between the relationship classification results, resulting in high-bias prediction of the relationship classification result, under the current model framework, the category commonality feature in the object feature data is reinforced due to the influence of the classification task, so that in the relationship prediction, the relationship prediction will be dominated by the category commonality feature, and due to the influence of the statistical dependency, these scene graph generation methods take the category commonality feature as the basis of the relationship classification, so that the correct relationship cannot be predicted according to richer semantics.

The embodiment of the application provides a scene graph generation method based on a causal association mining model, which is characterized in that extracted feature data is decomposed through a pre-constructed causal association mining network model, then the influence of category commonality features and object individuality features on relationship classification is measured through a fact branch model to obtain first relationship feature data, the influence of the category commonality feature data on the relationship classification is independently measured through a counter fact branch model to obtain second relationship data, so that more essential relationship features are extracted, the contribution of the object individuality feature data is enhanced, and the adverse effect of the category commonality feature data is counteracted. And performing difference processing on the first relationship data and the second relationship data to obtain third relationship data, so that the influence of the object personality characteristic data on relationship classification is learned through the difference between the category commonality characteristic data and the object personality characteristic data on relationship prediction, thereby mining the causal characteristics between the object characteristic data of the target object and the relationship, predicting the relationship category through a classifier trained based on a multi-level bias-reducing loss function to obtain a relationship classification result, distinguishing the relationship category with similar semantics from coarse granularity to fine granularity, thereby learning the corresponding relationship between the more essential predicate characteristics and the predicate labels on the classifier, outputting the relationship classification result through a causal association mining model, and finally constructing a scene graph according to the relationship classification result.

As shown in fig. 1, the method of the present embodiment includes:

step 101, image data is acquired.

In this step, the acquired image data is image data for which a scene graph is to be generated.

And 102, analyzing and processing the target object in the image data based on a pre-constructed causal association mining model to obtain a relation classification result, and outputting the relation classification result by the causal association mining model to finish the analysis and processing process.

In the step, a causal association mining network model (CAE-Net) is used for analyzing and processing target objects in image data, correct causal association between the interaction state of the target objects and a relationship classification result is established, the relationship classification result is obtained, correct prediction is carried out according to richer semantics, and high deviation prediction of the relationship classification result is avoided.

Step 102 comprises:

and 1021, extracting object features of the image data to obtain object feature data, and decomposing the object feature data into category commonality features and object personality features.

In this step, the object feature data refers to the corresponding features or characteristics of a certain type of object in the image data that are different from other types of object, or a set of such features and characteristics. And extracting the characteristics through a FasterR-CNN framework to obtain object characteristic data.

The FasterR-CNN integrates basic feature extraction, regional suggestion, frame regression and classification into one network, so that the comprehensive performance is greatly improved, and the speed of extracting object features of image data is further increased.

The object feature data is decomposed into category commonalities and object individuality, as shown in fig. 2-a, and the object feature data O of the target object includes two parts: class commonality feature O _g And object personality O _s R describes two target pairsAs shown in fig. 2-b, the class commonality features mainly express common structural or appearance characteristics among objects of the same class, for example, class commonality features of dogs include information such as Nose (Nose), ears (Ears), tail (Tail), legs (Fourlegs), and the like, and object individuality features of a target object mainly describe characteristics of the target object different from other class target objects, for example, object individuality features of dogs include information such as upright front legs (Straight front legs), crimped rear legs (Crouched hind legs), and Open mouth (Open mole).

Step 1022, obtaining a class average feature according to the class commonality feature and the object individuality feature, obtaining first relation data based on the class commonality feature and the object individuality feature, and obtaining second relation data based on the class average feature.

In the step, the category common feature can learn the statistical knowledge of the category and the natural interdependence of the category in the data set so as to reduce the candidate set and the standard semantic space, the object personality feature can reflect the real interaction between the object pairs so as to help the classifier to make a more accurate decision on a causal level, the first relation data is obtained based on the category common feature and the object personality feature, meanwhile, the second relation data is obtained based on the category average feature, the object personality feature and the category common feature are respectively processed, false association between the category common feature and the relation classification is avoided, the contribution of the object personality feature data is restrained, and the advantages of the category common feature are reserved while the object personality feature is strengthened.

Step 1023, performing difference processing on the first relationship data and the second relationship data to obtain third relationship data.

In the step, the first relationship data represents the influence of the category commonality feature and the object individuality feature on the relationship classification, the second relationship feature data represents the influence of the category commonality feature data on the relationship classification, and the influence of the object individuality feature data on the relationship classification is learned through the difference of the influence of the category commonality feature and the object individuality feature on the relationship classification, so that the causal feature between the object feature data of the target object and the relationship is mined.

And step 1024, performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, and outputting the relationship classification result by the causal relation mining model to complete the analysis processing process.

In the step, relationship classification is carried out according to the first relationship data and the third relationship data, a relationship classification result is obtained, category commonality characteristics and object individuality characteristics are balanced, and the effect of target object interaction information in the object individuality characteristics is fully exerted while advantages brought by the category commonality characteristics are maintained.

And step 103, constructing a scene graph according to the relation classification result.

In this step, the scene graph is constructed according to the correct relation classification result with richer semantics, improving the performance of the scene graph.

According to the scheme, the extracted feature data is decomposed into the object personality feature data and the category commonality feature data through the pre-constructed causal association mining network model, the category commonality feature can learn the statistical knowledge of categories and the natural interdependence of the categories in the data set so as to reduce the candidate set and the standard semantic space, the object personality feature can reflect the real interaction between the object pairs so as to be beneficial to the classifier to make more accurate decisions on a causal level, the first relation data is obtained based on the category commonality feature and the object personality feature, meanwhile, the second relation data is obtained based on the category average feature, the object personality feature and the category commonality feature are respectively processed, false association between the category commonality feature and the relation classification is avoided, the contribution of the object personality feature data is restrained, and the advantages of the category commonality feature are reserved while the object personality feature is enhanced. And performing difference processing on the first relation data and the second relation data to obtain third relation data, and learning the influence of the object personality characteristic data on the relation classification through the difference of the influence of the category commonality characteristic and the object personality characteristic on the relation classification, so as to mine the causal characteristic between the object feature data of the target object and the relation. And then carrying out relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, balancing category commonality characteristics and object individuality characteristics so as to fully play the role of target object interaction information in the object individuality characteristics while maintaining the advantages brought by the category commonality characteristics, outputting the relationship classification result through a causal association mining model, and finally constructing a scene graph according to the relationship classification result, and constructing the scene graph according to the correct relationship classification result with richer semantics, thereby improving the performance of the scene graph.

In some embodiments, the causal link mining model includes a fact branch model and a counter-fact branch model;

step 1022, including:

inputting the category commonality features and the object individuality features into the fact branch model, processing the category commonality features and the object individuality features through the fact branch model, and outputting the first relation data;

and simultaneously, inputting the class average characteristic into the inverse facts branch model, and outputting the second relation data through the inverse facts branch model.

In the scheme, the influence of the category common characteristic and the object individual characteristic on the relationship classification is measured through the fact branch model, the influence of the category common characteristic on the relationship classification is measured independently through the anti-fact branch model, the object individual characteristic and the category common characteristic are processed respectively, so that the suppression of the category common characteristic on the object individual characteristic is relieved, the effect of the object individual characteristic in the relationship classification is fully exerted, false association between the category common characteristic and the relationship classification is avoided, the contribution of object individual characteristic data is restrained, and the advantages of the category common characteristic are reserved while the object individual characteristic is strengthened.

In some embodiments, the causal link mining model includes a classifier;

step 1024 includes:

and inputting the first relation data and the third relation data into a trained classifier, classifying the relation by the classifier, and outputting the relation classification result.

In the above scheme, there are many semantic overlaps between the relationship classification results, such as "on" (at a certain position), "linking on" (stopped at a certain position), "standby on" (standing at a certain position), and "walking on" (walking at a certain position). Although they are semantically similar at coarse granularity, they are slightly different at fine granularity. The trained classifier learns hierarchical relationship classification, and relationship categories with similar semantics are distinguished from coarse granularity to fine granularity, so that the corresponding relationship between more essential predicate features and predicate labels can be distinguished, and causal association between object feature data and relationships is realized.

In some embodiments, the obtaining of the classifier includes:

acquiring training data and constructing a softmax regression pre-training model;

determining a first loss function/based on the training data _bf ；

Based on the first loss function l _bf And said training data determining a second loss function/ _ht ；

According to said first loss function l _bf Said second loss function l _ht And said training data determining a third loss function/ _fore ；

According to the first loss function l _bf Said second loss function l _ht And the third loss function l _fore A multi-level step-down loss function is determined,

wherein the multi-level falling off-set loss function l _MHD The expression is as follows:

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

wherein α represents the weight of the multilayer degradation partial loss function;

and inputting the training data into the softmax regression pre-training model, performing minimization treatment based on the multi-level deviation reduction loss function, continuously performing training adjustment on the softmax regression pre-training model to obtain a trained softmax regression pre-training model, and taking the trained softmax regression pre-training model as the classifier.

In the above scheme, the distribution of the relationship classification results in the scene graph construction is double unbalanced, and the number of the backgrounds marked by the relationship classification results is obviously more than that of the foreground, wherein the foreground refers to a target object in the image data, and the background refers to an object other than the target object in the image data. On the other hand, unlike the imbalances faced by other tasks, there is a lot of semantic overlap between relationship categories in the scene graph construction. For example, there are many semantic overlaps between the relationship classification results, e.g., "on" (at a certain location), "park on" (at a certain location), "stand on" (at a certain location), and "walk on" (at a certain location), which are similar in coarse-grained semantics but slightly different in fine-grained semantics. Generally, coarse-grained relational classification results are concentrated in a header with frequent samples, where the header refers to some relational classification result to which a greater number of target objects belong, and fine-grained relational classification results are distributed in a fewer number of tails, where the tails refer to some relational classification result to which fewer target objects belong, so that information of the tails is typically suppressed by the background and the header foreground.

Therefore, through a multi-level bias-reducing loss function (MHD loss), the softmax regression pre-training model can learn the level relationship classification in training, and the relationship classification with similar semantics is distinguished from coarse granularity to fine granularity, so that the correspondence between more essential predicate features and predicate labels is learned on the softmax regression pre-training model. The multi-level fall off penalty function does not give the tail a higher priority in calculating the penalty and gradient, nor does it introduce any related design that could harm the head feature expression. Under the training of double unbalanced data, the classifier can effectively distinguish the differences between the background and the foreground and between the head and the tail through the multi-level deviation-reducing loss function, the relationships are classified in a level mode from coarse granularity to fine granularity according to semantics, and finally, causal association between the features and the relationships is realized on the learning of the classifier, so that causal association between the object feature data and the relationships is realized more accurately.

Wherein the first loss function l _bf To distinguish the first and/or third relationship data as foreground or background binary loss function, a second loss function l _ht To distinguish the first and/or third relationship data as a head or tail class in the foreground relationship, a third loss function l _fore To distinguish between multiple classification loss functions for all foreground classes.

In some embodiments, the training data comprises real relationship classification results of object feature data of an object in training image data and object feature data of the training data, wherein the object in the training data comprises a target training object and a non-target training object;

the first loss function l _bf The expression is as follows:

wherein,represented as the first loss function l _bf The object feature data of the object in the training image data is foreground or background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0 representing the background, n is 1 representing the foreground; />Represented as the first loss function l _bf Object feature data of an object in the training image data of the foreground or background; sigma is a sigmoid function;

the second loss function l _ht The expression is as follows:

wherein,represented as l in the second loss function _ht The true relationship classification result of the object feature data of the training data; n is 1 or 0, n is 0 and represents a head, the head represents a real relation classification result with a large number of objects, n is 1 and represents a tail, and the tail represents a real relation classification result with a small number of objects; / >Represented as l in the second loss function _ht Object feature data of an object in the training image data of the head or tail; sigma is a sigmoid function; beta is a weight parameter;

the third loss function l _fore The expression is as follows:

wherein y' _j A true relationship classification result of object feature data of the training data expressed as a foreground; j represents the true relationship classification result of the object feature data of the jth training data; r is expressed as the number of true relationship classification results of the object feature data of the training data; p's' _j Probability information expressed as a result of the relational classification of the object feature data of the training data.

In the above scheme, the object feature data of the object in the training image data is classified as the foreground or the background by the first loss function, the object feature data of the object in the training image data in the foreground is classified as the head or the tail by the second loss function, and finally the object feature data of the object in all the training image data in the foreground is classified in relation by the third loss function.

The training data is object characteristics of the object in the foreground training image dataData x= (x ₀ ，x ₁ ，...，x _R ) Where R is the number of object feature data and true relationship classification result y= (y) in the form of a one-hot vector ₀ ，y ₁ ，...，y _R )。

Probability information p= (p) of the relation classification result of the object feature data of the estimated training data obtained from the object feature data of the object in the foreground training image data ₀ ，p ₁ ，...，p _R )＝softmax(x)。

For the first loss function l _bf In,the expression is as follows:

if the prediction of the original object feature data is a category in the background, thenAnd->Is the average of the object feature data of all the foreground. On the contrary, let(s)>And->When y is ₀ When=1, y ^bf = (1, 0), otherwise y ^bf ＝(0，1)。

For the second loss function l _ht In the true relationship classification result y ^ht Is to be converted into:

wherein head represents a head, tail represents a tail, and probability information p= (p) of a result of the relation classification of object feature data by training data ₀ ，p ₁ ，…，p _R ) =softmax (x) obtained true relationship classification results.

Conversion of object feature data of an object in training image data for a relationship of a head foreground and a relationship of a tail foreground:

where m represents the number of relationship classification results for the head and n represents the number of relationship classification results for the tail. Yh when the predicted relationship classification result of the original probability distribution is the relationship classification result of the head ^t Set to (1, 0), its corresponding x ^ht Head item of (a)Element x of corresponding category for original relation characteristic data _i Tail item- >And setting the average value of all tail elements in the original relation characteristic data. When the predicted category of the original probability distribution is the relation classification result of the tail part, y ^ht Set to (0, 1), its corresponding x ^ht Tail term->Element x set as corresponding category of original relation characteristic data _i Head item->Is the average value of all head elements in the original relation characteristic data.

In some embodiments, the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein, FC (& gt) is expressed as a full-connection layer learning algorithm of the Faster R-CNN model; g (-) represents a binary tree long-term and short-term memory network learning algorithm; o (O) _i Object feature data represented as an i-th target object in the image data; o (O) _i Object feature data represented as a j-th target object in the image data;

the second relationship data is expressed as:

wherein,represented as class average feature sum +.f of i-th target object in the image data obtained based on the first and second relationship feature data>All expressed as class average features of a jth target object in the image data, which are obtained based on the first relationship data and the second relationship feature data;

the class average feature is expressed as:

Wherein i represents object feature data of the i-th target object; t is expressed as the number of iterations; λ is denoted as update weight;object feature data representing a target object for a t-th iteration; />A class average feature of the target object represented as the t-1 th iteration; />Class average features of the target object expressed as the t-th iteration;

the third relationship data is expressed as:

in the scheme, the binary tree long-term and short-term memory network algorithm in the first relation data is a BiTreeLST network algorithm, and the first relation data which are used for cooperatively combining the category common characteristic data and the object individual characteristic data is generated through the fact branch model.

Second relational data in order to generate second relational feature data affected only by category commonality features, the anti-fact branch model uses a category average feature vector counted in training as input, which is independent of the current input image, the category average feature does not belong to the object of the current scene, is a feature conflicting with the real existence, and is anti-fact.

The causality between the feature data and the relationships is implicit in the subject's personality traits, as it involves interactions between the subjects. The first relational data of the fact branch model is the result of the common influence of the category commonality feature and the object personality feature, and the second relational data of the inverse fact branch model is mainly dominated by the category commonality feature. Thus, by comparing the fact branch model, the first relation data L is output _f (o _i ，o _j ) Second relationship data with output of its inverse branch modelThe difference between the relationship classification data can evaluate the influence of the object personality characteristics on the relationship classification data L _sp (o _i ，o _j ). Finally, the causal link mining model synthesizes first relation data L of the fact branch model _f (o _i ，o _j ) And influence of object personality characteristics on relationship classification _sp (o _i ，o _j ) The method has the advantages that relationship prediction is carried out, the advantages of category commonality characteristics are reserved while the object individuality characteristics are enhanced, so that the category commonality characteristics can learn the statistical knowledge of categories and the natural interdependence of the categories in the data set so as to reduce the candidate set and the standard semantic space, and the object individuality characteristics can reflect the real interaction between object pairs so as to be beneficial to more accurate decision making on causal layers in classification.

In some embodiments, the causal link mining model comprises a fast R-CNN model;

step 1021, comprising:

obtaining a candidate region of the target object through the fast R-CNN model based on the image data;

and extracting object characteristics from the candidate region of the target object to obtain object characteristic data.

In the above scheme, the target detector in the fast R-CNN model obtains the candidate region of the target object from the image data and also obtains the corresponding position coordinates thereof.

The target detection means that the segmentation and the recognition of the target are combined into a whole, the fast R-CNN mainly comprises a convolution layer, an RPN network layer (RegionProposal Network, regional generation network), a Rol mapping layer (Region of Interest ) and a classification regression layer in structure, the convolution layer mainly functions to extract a characteristic diagram of the whole image data, and the convolution layer structure comprises convolution, an activation function and pooling operation.

The RPN network layer can rapidly and more efficiently utilize the convolutional neural network, an anchor point or an anchor frame can be generated when the key object candidate region is generated, then whether the anchors belong to the foreground or the background is judged through a discriminant function, and then the anchor point or the anchor frame is adjusted for the first time through frame regression to obtain the accurate key object candidate region.

The RoI mapping layer is added mainly to solve the problem that the feature images of the last input full-connection layer are different in size, and the fixed size is obtained through up-sampling.

And finally, judging which category the object belongs to and finely adjusting the position of the candidate region of the target object through the two classification layers and the regression layer respectively.

In some embodiments, as shown in fig. 3, a causal association mining network model (CAE-Net model) performs Feature extraction (feature extraction) on a target Object in image data through a fast R-CNN model to obtain Object features, where the target Object is a person and an animal, decomposes the Object features into category commonalities and Object individuality features, and O: object features (category commonality feature and Object individuality feature) are input into a face model (facts branch), joint feature embedding (joint feature embedding) processing is carried out, L (namely first relation data) of Relations logits on O (category commonality feature and Object individuality feature commonly influence) is obtained through the face model processing, feature statistics (feature statistics) is carried out simultaneously, category average feature is obtained based on the category commonality feature and the Object individuality feature, category average feature is input into counterfactual branch model (anti-facts branch), joint feature embedding (joint feature embedding) processing is carried out, and Relations logits onO is obtained through the counterfactual branch model processing _g L of influence of class-generic features _cf (i.e., second relationship data), based on L (i.e., first relationship data) and L _cf (i.e., second relationship data) to obtain Relations logits on O _s L (subject personality trait impact) _sp (i.e., third relationship data) by combining L (i.e., first relationship data) with L _sp (i.e., third relationship data) inputs a classifier trained based on MHD loss (multi-level fall-off loss function), outputs relationship classification (relationship classification result), and constructs a scene graph based on relationship classification (relationship classification result).

It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.

It should be noted that some embodiments of the present application are described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the application also provides a scene graph generation system based on a causal relation mining model, which corresponds to the method of any embodiment.

Referring to fig. 4, the scene graph generating system based on the causal relation mining model includes:

a data acquisition module 401 configured to acquire image data;

an extraction classification module 402 configured to analyze a target object in the image data based on a pre-constructed causal link mining model, the extraction classification module comprising:

an extraction and decomposition unit 4021 configured to perform object feature extraction on the image data to obtain object feature data, and decompose the object feature data into category commonality features and object personality features;

a relationship data obtaining unit 4022 configured to obtain a category average feature according to the category commonality feature and the object individuality feature, obtain first relationship data based on the category commonality feature and the object individuality feature, and obtain second relationship data based on the category average feature;

a difference processing unit 4023 configured to perform difference processing on the first relationship data and the second relationship data to obtain third relationship data;

A relationship classification unit 4024 configured to perform relationship classification according to the first relationship data and the third relationship data, to obtain a relationship classification result, and the causal relation mining model outputs the relationship classification result to complete the analysis processing procedure;

a scene graph construction module 403 configured to construct a scene graph according to the relationship classification result.

the relationship data acquisition unit 4022 is specifically configured to:

In some embodiments, the causal link mining model includes a classifier;

the relationship classifying unit 4024 is specifically configured to:

In some embodiments, the obtaining of the classifier includes:

determining a first loss function/based on the training data _bf ；

According to the first loss function l _bf Said second loss function l _ht And said third loss function lfore determines a multi-level falling offset loss function,

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

The first loss function l _bf The expression is as follows:

wherein,represented as the first loss function l _bf The object feature data of the object in the training image data is foreground or background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0 representing the background, n is 1 representing the foreground; />Represented as the first loss function l _bf Object feature numbers of objects in the training image data of the middle foreground or backgroundAccording to the above; sigma is a sigmoid function;

the second loss function l _ht The expression is as follows:

wherein,represented as l in the second loss function _ht The true relationship classification result of the object feature data of the training data; n is 1 or 0, n is 0 and represents a head, the head represents a real relation classification result with a large number of objects, n is 1 and represents a tail, and the tail represents a real relation classification result with a small number of objects; />Represented as l in the second loss function _ht Object feature data of an object in the training image data of the head or tail; sigma is a sigmoid function; beta is a weight parameter;

the third loss function l _fore The expression is as follows:

In some embodiments, the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein, FC (·) tableA full-join layer learning algorithm shown as the FaterR-CNN model; g (-) represents a binary tree long-term and short-term memory network learning algorithm; o (o) _i Object feature data represented as an i-th target object in the image data; o (o) _j Object feature data represented as a j-th target object in the image data;

the second relationship data is expressed as:

the class average feature is expressed as:

wherein i represents object feature data of the i-th target object; t is expressed as the number of iterations; λ is denoted as update weight;object feature data representing a target object for a t-th iteration; />A class average feature of the target object represented as the t-1 th iteration; / >Class average features of the target object expressed as the t-th iteration;

the third relationship data is expressed as:

in some embodiments, the causal link mining model comprises a FaterR-CNN model;

the extraction and decomposition unit 4021 is specifically configured to:

the extracting the object feature of the image data to obtain object feature data includes:

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The apparatus of the foregoing embodiment is configured to implement the corresponding causal-related mining model-based scene graph generating method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the scene graph generation method based on the causal relation mining model according to any embodiment when executing the program.

Fig. 5 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 501, a memory 502, an input/output interface 503, a communication interface 504, and a bus 505. Wherein the processor 501, the memory 502, the input/output interface 503 and the communication interface 504 enable a communication connection between each other inside the device via the bus 505.

The processor 501 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 502 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 502 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present specification are implemented in software or firmware, relevant program codes are stored in memory 502 and invoked by processor 501 for execution.

The input/output interface 503 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 504 is used to connect a communication module (not shown in the figure) to enable communication interaction between the device and other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 505 includes a path to transfer information between elements of the device (e.g., processor 501, memory 502, input/output interface 503, and communication interface 504).

It should be noted that, although the above device only shows the processor 501, the memory 502, the input/output interface 503, the communication interface 504, and the bus 505, in the implementation, the device may further include other components necessary for achieving normal operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding causal relation mining model-based scene graph generating method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, corresponding to any of the above embodiments of the method, the present application further provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the causal relation mining model based scene graph generation method according to any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to perform the method for generating a scene graph based on a causal link mining model according to any one of the foregoing embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and/or the like which are within the spirit and principles of the embodiments are intended to be included within the scope of the present application.

Claims

1. A causal link mining model-based scene graph generation method, comprising:

acquiring image data;

extracting object features of a target object in the image data to obtain object feature data, and decomposing the object feature data into category commonality features and object personality features;

obtaining class average characteristics according to the class commonality characteristics and the object individuality characteristics, obtaining first relation data based on the class commonality characteristics and the object individuality characteristics, obtaining second relation data based on the class average characteristics, wherein the causal relation mining model comprises a fact branch model and a counter fact branch model, obtaining first relation data based on the class commonality characteristics and the object individuality characteristics, and obtaining second relation data based on the class average characteristics, and comprises the following steps: inputting the category commonality features and the object individuality features into the fact branch model, processing the category commonality features and the object individuality features through the fact branch model, and outputting the first relation data; meanwhile, inputting the class average feature into the inverse facts branch model, and outputting the second relation data through the inverse facts branch model;

constructing a scene graph according to the relation classification result;

the causal association mining model comprises a classifier;

the performing relationship classification according to the first relationship data and the third relationship data to obtain a relationship classification result, including:

inputting the first relation data and the third relation data into a trained classifier, classifying the relation through the classifier, and outputting a relation classification result;

the obtaining process of the classifier comprises the following steps:

determining a first loss function/based on the training data _bf ；

According to the first loss function l _bf Said second loss function l _ht And said training data determining a third loss function/ _fore ；

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

wherein α represents the weight of the multi-level depolarization loss function;

inputting the training data into the softmax regression pre-training model, performing minimization treatment based on the multi-level deviation reduction loss function, continuously performing training adjustment on the softmax regression pre-training model to obtain a trained softmax regression pre-training model, and taking the trained softmax regression pre-training model as the classifier;

the training data comprises object feature data of objects in training image data and real relation classification results of the object feature data of the training data, wherein the objects in the training data comprise target training objects and non-target training objects;

the first loss function l _bf The expression is as follows:

wherein,represented as the first loss function l _bf The object feature data of the object in the training image data is foreground or background, the foreground is the target training object, and the background is the non-target training object; n is 1 or 0, n is 0 representing the background, n is 1 representing the foreground; / >Represented as the first loss function l _bf Object feature data of an object in the training image data of the foreground or background; sigma is a sigmoid function;

the second loss function l _ht The expression is as follows:

the third loss function l _fore The expression is as follows:

2. The method of claim 1, wherein the first relationship data is expressed as:

L _f (o _i ，o _j )＝FC(g(o _i ，o _j ))

wherein, FC (·) represents a full-link layer learning algorithm of the fast R-CNN model; g (-) represents a binary tree long-term and short-term memory network learning algorithm; o (o) _i Object feature data represented as an i-th target object in the image data; o (o) _j Object feature data represented as a j-th target object in the image data;

the second relationship data is expressed as:

wherein,represented as class average feature and +.f of the ith target object in the image data based on the first and second relationship data>All expressed as class average characteristics of a j-th target object obtained based on the first relation data and the second relation data in the image data;

the class average feature is expressed as:

The third relationship data is expressed as:

3. the method of claim 1, wherein the causal link mining model comprises a fast R-CNN model;

4. A causal mining model-based scene graph generation system, comprising:

a data acquisition module configured to acquire image data;

the extraction and decomposition unit is configured to extract object features of a target object in the image data to obtain object feature data, and decompose the object feature data into category commonality features and object individuality features;

a relational data obtaining unit, configured to obtain a class average feature according to the class common feature and the object personality feature, obtain first relational data based on the class common feature and the object personality feature, obtain second relational data based on the class average feature, and the causal relation mining model includes a fact branch model and a counter fact branch model, obtain first relational data based on the class common feature and the object personality feature, and obtain second relational data based on the class average feature, and specifically be used for inputting the class common feature and the object personality feature into the fact branch model, and outputting the first relational data after processing by the fact branch model; meanwhile, inputting the class average feature into the inverse facts branch model, and outputting the second relation data through the inverse facts branch model;

a scene graph construction module configured to construct a scene graph according to the relationship classification result;

the causal association mining model comprises a classifier;

the relationship classification is carried out according to the first relationship data and the third relationship data to obtain a relationship classification result, and the relationship classification result is specifically used for inputting the first relationship data and the third relationship data into a trained classifier, carrying out relationship classification through the classifier, and outputting the relationship classification result;

the obtaining process of the classifier comprises the following steps:

determining a first loss function/based on the training data _bf ；

l _MHD ＝l _bf +αl _ht +(1-α)l _fore

the first loss function l _bf The expression is as follows:

the second loss function l _ht The expression is as follows:

the third loss function l _fore The expression is as follows:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 3 when the program is executed by the processor.

6. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 3.