CN115546626B

CN115546626B - Data double imbalance-oriented depolarization scene graph generation method and system

Info

Publication number: CN115546626B
Application number: CN202210210795.3A
Authority: CN
Inventors: 罗廷金; 周浩
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2024-02-02
Anticipated expiration: 2042-03-03
Also published as: CN115546626A

Abstract

The invention discloses a data double imbalance-oriented reduced bias scene graph generation method and a system, wherein the method comprises the following steps: acquiring an original image; inputting the original image into an image recognition combined model, and acquiring a plurality of object candidate areas and corresponding object categories output by the model; constructing a causal intervention tree according to the multiple object candidate areas and the obtained average external object, and learning unbiased relation features; inputting unbiased relation features into a classifier optimized according to a bias resistance loss function, and obtaining a prediction relation output by the classifier; and generating a declination scene graph according to the image recognition result and the prediction relation. The invention reduces the context bias caused by data unbalance through the causal intervention tree, eliminates the virtual false correlation and learns the unbiased relation characteristics, and simultaneously reduces the bias of the data double unbalance in the scene graph generating task through optimizing the classifier by the bias resistance loss function, thereby remarkably improving the accuracy of relation prediction and achieving the purpose of effectively generating the bias-reducing scene graph.

Description

Data double imbalance-oriented depolarization scene graph generation method and system

Technical Field

The invention relates to the technical field of image processing, in particular to a data double imbalance-oriented reduced bias scene graph generation method and system.

Background

Currently, scene graph generation (Scene Graph Generation, SGG for short) plays a vital role in depth understanding visual scene semantics by exploring relationships between objects, and is widely used in various visual intelligence and reasoning tasks, such as image retrieval and visual questions and answers.

In reality, the relation distribution of the data set is usually biased, and the head relation category in the data set takes the dominant role, which is represented by a typical long tail distribution. Because of the lack of enough training samples, tail relation categories with few samples can be suppressed by head relation categories with more samples in training, and the problem of reduced relation recognition accuracy caused by insufficient representativeness of relevant category characteristics occurs. In order to improve the accuracy of relationship identification of long tail distribution in SGG, for the previous biased scene graph generation method, particularly, the statistical correlation among the mining objects is used for fitting highly biased distribution, and the poor performance of tail class relationship is tolerated. Reduction-based scene graph generation methods have been proposed to alleviate the suppression of head relationship categories to tail relationship categories, for example, journal Unbiased Scene Graph Generation from Biased Training published in the chinese knowledge network by Tang Kaihua et al proposes a counter-facts causal relationship to eliminate the influence of bad bias.

However, in addition to the long tail distribution imbalance of foreground, the number of background samples (i.e., unlabeled relationships) in the SGG significantly exceeds the number of foreground samples (i.e., artificially labeled relationships). For example, in one subset VG150 (150 objects and 50 relationships with highest frequency of occurrence in VG) of the currently commonly used authoritative dataset Visual Genome, the background sample number is 18 times the foreground sample number, specifically, according to the statistical information on the VG150 dataset, the scene graph of each image contains about 11.5 objects and 6.2 relationships, that is, only 6.2 candidate relationships are marked as foreground category (about 5.13%) and the rest candidate relationships are background category (about 94.87%) in 120 candidate relationships generated by the objects. Thus, SGGs are actually faced with the challenge of data double imbalance, namely long tail distribution imbalance of foreground and background and foreground distribution extreme imbalance. Because the background and foreground distribution is ignored to be unbalanced, the existing scene graph generation method based on the declination tends to divide the head relation examples into a background category and a tail category, so that the relation identification performance of the head category is greatly attenuated.

Therefore, how to alleviate the data double imbalance, namely, the long tail distribution imbalance of the foreground category and the bias brought by the extremely unbalanced distribution of the background and the foreground to the SGG task is still a problem to be solved urgently.

Disclosure of Invention

Based on this, it is necessary to provide a method and a system for generating a data-oriented double-unbalanced declination scene graph for solving the above technical problems.

Based on the above object, the present invention provides a method for generating a data-oriented double unbalanced reduced bias scene graph, comprising:

acquiring an original image;

inputting the original image into a preset image recognition combined model, and acquiring an image recognition result output by the image recognition combined model; the image recognition result comprises a plurality of object candidate areas and corresponding object categories;

acquiring an average external object, constructing a causal intervention tree according to a plurality of object candidate areas and the average external object, and learning unbiased relation features based on the causal intervention tree;

constructing a partial resistance loss function, and optimizing a classifier according to the partial resistance loss function; the partial resistance loss function is used for decoupling the identification of the foreground relationship and the background relationship and the classification of different foreground relationships;

inputting the unbiased relation features into the optimized classifier, and obtaining a prediction relation output by the classifier;

and generating a declination scene graph according to the image recognition result and the prediction relation.

Preferably, the obtaining an average external object, constructing a causal intervention tree according to a plurality of object candidate regions and the average external object, and learning unbiased features based on the causal intervention tree, includes:

constructing initial trees of n initial nodes by a minimum spanning tree algorithm for the original images of n object candidate areas;

constructing a causal intervention tree of (n+1) nodes by adding an additional node to the initial tree; the additional nodes represent average external objects;

performing feature assignment on each node in the causal intervention tree, and determining the feature of each node;

inputting the characteristics of each node into a gating circulation unit network, carrying out message transmission on the subject object, the object and the average external object of each candidate relation through the gating circulation unit network, and outputting the logic vector of the candidate relation through a full connection layer.

Preferably, said characterizing each of said nodes in said causal intervention tree, determining a characteristic of each of said nodes, comprises:

setting a feature of each of the initial nodes of the causal intervention tree as an object feature learned from the corresponding object candidate region through a deep neural network, and setting a feature of the additional nodes of the causal intervention tree as a feature of the average external object; the characteristic of the average external object is that the average characteristic is obtained by calculating the characteristics of all external objects through a moving average method.

Preferably, the constructing a partial resistance loss function, optimizing a classifier according to the partial resistance loss function, includes:

constructing a two-class loss function, wherein the two-class loss function is used for identifying the relation between the foreground and the background;

constructing a multi-classification loss function, wherein the multi-classification loss function is used for classifying each foreground relation;

and constructing a partial resistance loss function by combining the two-class loss function and the multi-class loss function, and optimizing the classifier.

Preferably, the constructing a two-class loss function, the two-class loss function being used to identify a foreground-to-background relationship, includes:

let the logic vector of the deep neural network output be x= (X) ₀ ,x ₁ ,…,x _|R| ) The |r| is the number of relation categories, and the probability distribution corresponding to the logic vector is set to be p= (P) ₀ ,p ₁ ,…,p _|R| ) The corresponding true tag vector is y= (Y) ₀ ,y ₁ ,…,y _|R| )；

Acquiring a two-dimensional logic vector X required for foreground and background classification ^bf A first conversion relation with a logic vector X output by the deep neural network; the first conversion relation is:

wherein x is ₀ 、x _i The first and i-th logit values in the logit vector X, respectively, and i= (1, 2, …, |r|); p is p ₀ 、p _i The first probability and the ith probability in the probability distribution P respectively; beta is a first weight parameter used for controlling the background relation category in the logic vector X;

Acquiring two-dimensional true label vector Y required for foreground and background classification ^bf A second transformation relationship of the true tag vector Y corresponding to the logic vector X; the second transformation relationship is:

wherein y is ₀ Is the first real relationship tag in the real tag vector Y;

constructing a binary class loss function l according to the binary cross entropy loss function, the first conversion relation and the second conversion relation _bf The method comprises the steps of carrying out a first treatment on the surface of the The binary class loss function l _bf Expressed as:

wherein,the logic vector required for the foreground and background two classifications; />The real label vector is required for the classification of the foreground and the background; sigma is a sigmoid function; alpha is a weight parameter of the foreground relation sample.

Preferably, the constructing a multi-classification loss function, the multi-classification loss function being used for classifying each foreground relationship, includes:

a weight term ω is defined, which weight term ω is expressed as:

wherein r is a candidate relationship;

constructing a multi-class loss function/by introducing the weight term omega into the softmax cross entropy loss function _fore The method comprises the steps of carrying out a first treatment on the surface of the The multi-class loss function l _fore Expressed as:

wherein p is _j Is the j-th probability in the probability distribution P, and j= (0, 1,2, …, |r|); y is _j Is the probability p _j Corresponding true relationship labels.

Preferably, the constructing a multi-classification loss function, where the multi-classification loss function is used to classify each foreground relationship, further includes:

updating the multi-class loss function/based on softmax equalization loss function _fore Probability p of (b) _j The updated probability p _j Expressed as:

wherein x is _j Is the probability p _j A corresponding logic value; x is x _k A logic value corresponding to the k-th class of relation class, and k= (0, 1,2, …, |r|); omega _k The weight corresponding to the k-th type of relation category is expressed as:

ω _k ＝1-E(k)T _λ (f _k )(1-y _k )，

wherein E (k) is a binary term, E (k) =0 when k=0 belongs to the background relationship category, and E (k) =1 when k > 0 belongs to the foreground relationship category; t (T) _λ () As a threshold function, whenFrequency f of the kth class of relationship class _k When less than threshold lambda, T _λ () =1, otherwise T _λ ()＝0；y _k And the true relationship label corresponding to the k-th type relationship category.

Preferably, the image recognition combined model consists of a rapid regional convolution neural network and a two-way tree structure long-term and short-term memory network which are connected in sequence; the step of inputting the original image into a preset image recognition combination model and obtaining an image recognition result output by the image recognition combination model comprises the following steps:

performing object detection on the input original image through the fast region convolution neural network to obtain a plurality of object candidate regions in the original image;

And inputting each object candidate region into the bidirectional tree structure long-short-term memory network, extracting object characteristics corresponding to each object candidate region through the bidirectional tree structure long-short-term memory network, and identifying and obtaining corresponding object categories based on each object characteristic.

Based on the same inventive concept, the invention also provides a data double imbalance-oriented declination scene graph generating system, which comprises:

the image acquisition module is used for acquiring an original image;

the image recognition module is used for inputting the original image into a preset image recognition combination model and acquiring an image recognition result output by the image recognition combination model; the image recognition result comprises a plurality of object candidate areas and corresponding object categories;

the unbiased feature learning module is used for acquiring an average external object, constructing a causal intervention tree according to a plurality of object candidate areas and the average external object, and learning unbiased relation features based on the causal intervention tree;

the optimizing module is used for constructing a partial resistance loss function and optimizing the classifier according to the partial resistance loss function; the partial resistance loss function is used for decoupling the identification of the foreground relationship and the background relationship and the classification of different foreground relationships;

The relation prediction module is used for inputting the unbiased relation characteristics into the optimized classifier and obtaining a prediction relation output by the classifier;

and the scene graph generating module is used for generating a declination scene graph according to the image recognition result and the prediction relation.

Preferably, the unbiased feature learning module includes:

an initial tree construction sub-module, configured to construct an initial tree of n initial nodes by using a minimum spanning tree algorithm for the original images for generating n candidate object regions;

a causal intervention sub-module for building a causal intervention tree of (n+1) nodes by adding an additional node to the initial tree; the additional nodes represent average external objects;

the feature giving module is used for giving features to each node in the causal intervention tree and determining the features of each node;

and the characteristic output sub-module is used for inputting the characteristic of each node into a gating circulation unit network, carrying out message transmission on the subject object, the object and the average external object of each candidate relation through the gating circulation unit network, and outputting the logic vector of the candidate relation through the full connection layer.

According to the data-oriented double unbalanced hypo-polarization scene graph generation method, object detection and object classification are carried out on an input original image through an image recognition combination model to obtain an object candidate region and an object category, a causal intervention tree is built according to the object candidate region and an introduced average external object, the causal intervention tree is based on the unbiased feature of the causal intervention tree learning relationship, the unbiased feature of the relationship is input into a classifier optimized according to a bias resistance loss function to obtain a prediction relationship, and finally the hypo-polarization scene graph is generated according to the object candidate region, the object category and the prediction relationship. The invention reduces the context bias caused by unbalanced data through the causal intervention tree, eliminates the virtual false correlation and learns the unbiased relation characteristic, reduces the double unbalanced bias from foreground to background and from tail foreground to head foreground in the scene graph generating task through optimizing the classifier by the bias resistance loss function, and obviously improves the accuracy of relation prediction, thereby achieving the purpose of effectively generating the bias-reducing scene graph.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for generating a data-oriented double imbalance reduced bias scene graph according to an embodiment of the invention;

FIG. 2 is a flowchart of step S30 of a method for generating a data-oriented double imbalance reduced bias scene graph according to an embodiment of the invention;

FIG. 3a is a graph of spatial distribution of relationship features in the generation of a biased scene graph in accordance with an embodiment of the invention;

FIG. 3b is a graph of spatial distribution of relationship features in unbiased scene graph generation in accordance with an embodiment of the invention;

FIG. 4 is a schematic diagram of a system for generating a data-oriented double imbalance declination scene graph according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an unbiased learning module of a system for generating a data double unbalanced hypo-polarization scene graph according to an embodiment of the invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clearly apparent, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Some of the terms involved in the present invention are explained as follows:

fast R-CNN: fast Region-based Convolutional Neural Network, fast Region convolutional neural network;

Bi-Tree LSTM: bi-Tree Long Short-Term Memory network of two-way Tree structure;

GRU: gatedRecurrent Unit, gating the circulation cell.

As shown in fig. 1, the method for generating a data-oriented double-imbalance depolarization scene graph provided by an embodiment of the present invention specifically includes the following steps:

step S10, acquiring an original image.

In this embodiment, the original image I is an image that needs to generate a scene graph, and a plurality of objects can be identified and obtained from the original image I.

Step S20, inputting the original image into a preset image recognition combination model, and obtaining an image recognition result output by the image recognition combination model, wherein the image recognition result comprises a plurality of object candidate areas and corresponding object categories.

In the present embodiment, for a given original image I, object detection and object classification are performed on the original image I by an image recognition combination model to obtain an object candidate region set b= (B) ₁ ,b ₂ ,…,b _n ) And object class set o= (O) ₁ ,o ₂ ,…,o _n ) And obtaining an image recognition result according to the object candidate region set B and the object category set O. Wherein each object candidate region B in the set of object candidate regions B _i Corresponds to an object class O in an object class set O _i 。

It will be appreciated that the above-described object classification and relationship classification in subsequent steps play an important role in the context map generation task. The scene graph generated penalty function may include an object classification component and a relationship classification component. In the object classification, a softmax cross entropy loss training optimization image recognition combination model can be adopted to perform object classification by utilizing the image recognition combination model after training optimization.

Preferably, the image recognition combined model consists of Fast R-CNN and Bi-Tree LSTM which are connected in sequence; step S20 includes the steps of:

firstly, performing object detection on an input original image through Fast R-CNN to obtain a plurality of object candidate areas in the original image;

then, each object candidate region is input into Bi-Tree LSTM, object characteristics corresponding to each object candidate region are extracted through Bi-Tree LSTM, and corresponding object categories are obtained based on each object characteristic identification.

It can be understood that first, the first layer structure Fast R-CNN in the image recognition combined model is adopted to perform object detection on the original image I, and n object candidate regions b are obtained from the original image I _i Namely obtaining an object frame component P (B|I) for generating a scene graph, and extracting each object candidate region B by adopting a second layer structure Bi-Tree LSTM in the image recognition combined model _i And identifies each object candidate region b _i Corresponding object class o _i I.e. the object components P (O I, B) used to generate the scene graph. For the object components P (o|i, B), since the class distribution of the object samples in the VG dataset is approximately the same, it can be regarded as an unbiased learning process.

Further, the scene graph G corresponding to the original image I can be obtained by combining the relation components P (R|I, B, O, Z) for unbiased relation prediction obtained in the subsequent steps.

It should be noted that Fast R-CNN and Bi-Tree LSTM are widely used in the field of image recognition at present, and are not described here again.

And step S30, constructing a causal intervention tree according to the plurality of object candidate areas and the introduced average external object, and learning unbiased relation features through the causal intervention tree.

It will be appreciated that the decision in the context graph generation task (i.e., the relationship prediction) is the interaction of the content, which contains the object features of the object candidate region, and the context, which is the dominant and intrinsic factors of the decision, and the context, which is an exogenous factor, plays an auxiliary role in the decision. The context may enhance understanding of the image, reducing the number of potential candidate relationships in the relationship decision. However, the context created by the double imbalance of data may contain bias detrimental to fairness decisions, and to eliminate such adverse effects, the present embodiment may construct causal intervention trees to reduce the context bias, eliminate false correlations and learn unbiased features.

Preferably, as shown in fig. 2, step S30 includes the steps of:

step S301, constructing initial trees of n initial nodes by a minimum spanning tree algorithm for the original images of the n object candidate areas.

In step S301, the minimum spanning tree algorithm is a Prim algorithm, and the implementation flow of dynamically constructing an initial tree for n object candidate regions by the Prim algorithm is as follows: 1) Establishing an edge set for storing results, establishing a node set for storing nodes and marking whether the nodes are accessed or not, and establishing a minimum heap of edges; 2) Starting traversing all nodes, if not accessing, adding the nodes to a node set, and then piling the connected edges; 3) Taking the minimum edge from the heap, judging whether a target node corresponding to the minimum edge is accessed, if not, adding the minimum edge into the initial tree, and marking the target node to access; 4) Adding the edge connected with the target node to the minimum heap; 5) And (5) cycling the steps until all the nodes are traversed, and obtaining an initial tree.

Step S302, constructing a causal intervention tree of (n+1) nodes by adding an additional node in the initial tree; the additional nodes represent average external objects.

In step S302, a causal intervention tree is constructed comprising n initial nodes, each representing an object candidate region, and an attachment node representing an incoming average external object

Step S303, giving features to each node in the causal intervention tree, and determining the features of each node.

Preferably, the features of each initial node of the causal intervention tree are set as object features learned from the corresponding object candidate region by the deep neural network, and the features of additional nodes of the causal intervention tree are set as features of the average external objectCharacteristics of the average external object->To calculate the characteristics of all external objects by means of a moving average method to obtain an average characteristic +.>

It should be noted that the deep neural network may be Bi-Tree LSTM in step S202.

Step S304, inputting the characteristics of each node into GRU network, transmitting information between subject object, object and average external object of each candidate relation through GRU network, and outputting logic vector of candidate relation through full connection layerWherein the logic vector->Can be expressed as:

in the formula (1), o _i 、o _j A subject object and an object respectively; FC (), || represent the full connection layer and feature splice layer in the GRU network, respectively.

It can be appreciated that the present embodiment utilizes an average external objectInstead of a plurality of external objects z _i In order to fully exert causal intervention effects of external objects, each candidate relationship r can be accessed through a GRU network _i-j Subject object o _i Object o _j Average external object->Adding additional messaging between and across GRU networksThe connection layer calculates each candidate relation r by using the formula (1) _i→j Is>That is, the present embodiment will candidate relationship r _i→j Is>As an unbiased relationship feature.

Further, in the process of learning unbiased relation features through the causal intervention tree, after the logic vectors of candidate relations are obtained through the GRU network, the logic vectors output by the GRU network and the preset target logic vectors can be input into a preset Loss function together, loss values Loss are output, and features of average external objects in the causal intervention tree are iteratively updated by using a back propagation algorithm according to the Loss values LossUntil Loss value Loss is less than or equal to preset Loss threshold value, preserving and updating characteristic of average external object +.>Is a causal intervention tree of (1). At this time, for the t-th iterative training, the features of the average external object +.>Can be expressed as:

in the formula (2), lambda is the updated weight of each iterative training; z _k Represents the introduced kth class external object, and k= (1, 2, …, C); c is the object class number of the introduced external object; v (z) _k ) For external object z _k Is characterized by, and external object z _k Is characterized by the characteristic average of all kth class objects in the VG dataset.

As can be seen from the above, compared with independently calculating the logic vector under the intervention of a plurality of external objects for each candidate relationship by using the convolutional neural network, the embodiment uses the average external object to replace the plurality of external objects, which can significantly reduce the calculation amount, effectively alleviate the context bias generated by the data double imbalance, and filter the false correlation between the objects. Secondly, by setting a logic vector for learning candidate relations through causal intervention, effective support can be provided for unbiased relation prediction.

Step S40, constructing a partial resistance loss function, and optimizing a classifier according to the partial resistance loss function; the bias loss function is used for decoupling identification of foreground relations and background relations and classification of different foreground relations.

Existing scene graph generation methods typically employ cross entropy loss functions that are adapted for balanced distribution to optimize their relational classification model. However, the relationship distribution in the scene graph generation is double unbalanced, such as the relationship feature space distribution in the biased scene graph generation shown in fig. 3a, and in the case of biased data training, since the feature space of the foreground relationship class is suppressed by the feature space of the background relationship class, especially the sample of the tail foreground relationship is always in an under-expressed state, the foreground relationship sample is often predicted as the background relationship class, so that the performance of relationship prediction is seriously reduced.

In the embodiment, a bias resistance loss function is constructed to enable a classifier to learn unbiased relation prediction. Preferably, step S40 includes the steps of:

step S401, constructing a two-class loss function, wherein the two-class loss function is used for identifying the relation between the foreground and the background;

step S402, constructing a multi-classification loss function, wherein the multi-classification loss function is used for classifying each foreground relation;

step S403, constructing a partial resistance loss function by combining the two-class loss function and the multi-class loss function, and optimizing the classifier.

Specifically, in order to balance the influence of background relation category, frequent background relation category and rare foreground relation category on classifier training, the recognition of foreground and background relation and classification of foreground relation are performed by using a partial resistance loss functionDecoupling is performed such that unbiased features are classified as foreground or background relationships, as well as a specific foreground relationship. Partial resistance loss function l _BR Consists of two parts, one part is a two-class loss function for identifying the relation between foreground and background _bf Another part is for classifying different foreground relations _fore . Partial resistance loss function l _BR Can be expressed as:

l _BR ＝l _bf +l _fore (3)

In an alternative embodiment, step S401 includes the steps of:

step S4011 sets a logic vector of the deep neural network output as x= (X) ₀ ,x ₁ ,…,x _|R| ) The |r| is the number of relation categories, and the probability distribution corresponding to the logic vector is set to be p= (P) ₀ ,p ₁ ,…,p _|R| ) The corresponding true tag vector is y= (Y) ₀ ,y ₁ ,…,y _|R| )。

The deep neural network may be the GRU network in step S303; the relationship of the logic vector and the probability distribution is: p=softmax (X); the true tag vector Y is a one-hot encoded vector.

Step S4012, obtaining two-dimensional logic vector X required for foreground and background classification ^bf A first conversion relation with a logic vector X output by the deep neural network; the first conversion relationship is:

formula (4), x ₀ 、x _i The first and i-th logit values in the logit vector X, respectively, and i= (1, 2, …, |r|); p is p ₀ 、p _i The first probability and the ith probability in the probability distribution P respectively; beta is a first weight parameter used for controlling the background relation category in the logic vector X; max () is a maximum function.

Step S4013, obtaining two-dimensional real label vector Y required for foreground and background classification ^bf And logic directionA second transformation relationship of the true tag vector Y corresponding to the quantity X; the second transformation relationship is:

In the formula (5), y ₀ Is the first genuine relationship tag in genuine tag vector Y.

Step S4014, constructing a binary class loss function l according to the binary cross entropy loss function, the first transformation relationship and the second transformation relationship _bf The method comprises the steps of carrying out a first treatment on the surface of the The binary class loss function l _bf Can be expressed as:

in the formula (6) of the present invention,for the two-dimensional logic vector required for both foreground and background classification,for the two-dimensional real label vector required for the two-classification of the foreground and the background, sigma is a sigmoid function, and alpha is a weight parameter of a foreground relation sample.

Understandably, constructing a binary class loss function/ _bf The key point of the method is to obtain a logic vector X and a corresponding real label vector Y output by a GUR network and a two-dimensional logic vector X for foreground and background classification ^bf And a two-dimensional true tag vector Y ^bf Is a conversion relation of (a). The embodiment uses the first transformation relation shown in the formula (4) to transform the original logic vector X into the two-dimensional logic vector X required by the two classifications of the foreground and the background ^bf Converting the original initial real label vector into a two-dimensional real label vector Y required by the classification of the foreground and the background by using a second conversion relation shown in the formula (5) ^bf 。

From the first transformation relationship shown in equation (4), if the predicted class of the original logic vector X belongs to the background relationship class, then two Classified two-dimensional logic vector X ^bf In (a)And->Is the Logit value x corresponding to all foreground relation categories _i Average value of (2); conversely, if the predicted class of the original Logit vector X is of the foreground relationship class, then the two-dimensional Logit vector X of the two classifications ^bf Middle->And->Wherein the weight parameter beta is used to control the size of the background term element in the original logic vector X.

From the second transformation relationship shown in equation (5), it can be seen that if Y in the original true tag vector Y ₀ =1, then two-dimensional true label vector Y of two classifications ^bf And (1, 0), otherwise (0, 1).

As can be seen from the combination of the formula (4) and the formula (6), the binary class loss function l _bf The binary class loss function/can be adjusted by two weight parameters α and β _bf The extent of contribution of each of the parts in the system. The main function of the weight parameter alpha is to balance the weight between the gradient of the foreground relational sample and the gradient of the background relational sample in the two classifications, and the main function of the weight parameter beta is to adjust the influence of unbalance on each element in the logic vector. That is, in the case of a very unbalanced foreground and background distribution, the background relationship samples may suppress the gradient of the foreground relationship samples, and the binary class loss function l is used to obtain a classifier with balanced and fair decision boundaries _bf Using the weight parameter α to attenuate the effects of background gradients; at the same time, unbalanced data may expand the values of the background term elements in the log vector, in order to make the contribution of the foreground term elements and the background term elements in the log vector relatively equal and give more gradient support to correctly classified foreground relational samples, binary class loss function l _bf Reducing logi using weight parameter betaInfluence of background in t vector. It can be understood that in this embodiment, only the class decision boundary between the background and the foreground is adjusted by two weight parameters α and β, without affecting the decision boundary of each class in the foreground relationship, so that the constructed binary class loss function l _bf The feature distribution of the foreground relation category of the head is not damaged, and the classifier is beneficial to learning and distinguishing the feature representation of each foreground relation category.

In an alternative embodiment, step S402 includes the steps of:

in step S4021, a weight term ω is defined, which may be expressed as:

in the formula (7), r is a candidate relation;the first value of the probability distribution corresponding to the candidate relation r is->The first value of the true label vector Y corresponding to the candidate relation r.

Step S4022, constructing a multi-class loss function l by introducing a weight term omega into the softmax cross entropy loss function _fore The method comprises the steps of carrying out a first treatment on the surface of the The multi-class loss function l _fore Can be expressed as:

in the formula (8), p _j Is the jth probability in the probability distribution P, and j= (0, 1,2, …, |r|), y _j Is the probability p _j Corresponding true relationship labels.

It can be appreciated that to reduce the suppression from the feature space of the background relationship to the feature space of the foreground relationship and increase the feature representation of the foreground relationship class, a multi-class loss function l is constructed by introducing a weight term ω into the softmax cross entropy loss function _fore ToNeglecting binary class loss function l _bf A gradient of background relationship samples that have been correctly classified.

As can be seen from the combination of the formulas (7) and (8), the gradient truly marked as foreground relational sample and the gradient classified as foreground relational sample by the classifier will remain for the data already in the binary class loss function l _bf The resulting loss and gradient will be set to 0 under the adjustment of the weight term ω, that is, the binary class loss function l _bf Gradient of background relation sample of middle error classification in multi-classification loss function l _fore The robustness of the classifier can be enhanced, and the feature distribution of the classifier learning foreground relation category can be maintained.

Further, step S4022 further includes the following steps:

Step S4023, updating the multi-class loss function l according to the softmax equalization loss function _fore Probability p of (b) _j Updated probability p _j Can be expressed as:

in the formula (9), x _j Is the probability p _j A corresponding logic value; x is x _k A logic value corresponding to the k-th class of relation class, and k= (0, 1,2, …, |r|); omega _k The weight corresponding to the k-th class of relation class can be expressed as:

ω _k ＝1-E(k)T _λ (f _k )(1-y _k ) (10)

in the formula (10), E (k) is a binary term, and when k=0 belongs to the background relationship category, E (k) =0, and when k > 0 belongs to the foreground relationship category, E (k) =1; t (T) _λ () As a threshold function, when the frequency f of the kth class of relation class _k When less than threshold lambda, T _λ () =1, otherwise T _λ ()＝0；y _k And the true relationship label corresponding to the k-th type relationship category. Alternatively, the threshold λ=0.1.

It can be appreciated that tail relation samples in order to enhance classifier optimization trainingContribution, probability p _j The update will be based on the softmax equalization loss function to ignore negative sample gradients in the tail foreground relationship category.

As can be seen from the foregoing, in this embodiment, by performing joint construction of the bias block loss function on the binary class loss function and the multi-class loss function, and optimally training the classifier for the relationship classification, the bias from the background relationship class to the foreground relationship class and from the tail foreground relationship to the head foreground relationship can be effectively suppressed. The feature space distribution diagram of the relation in the unbiased scene graph generation as shown in fig. 3b, by moving the decision boundary of the classifier and expanding the feature space of the foreground relation class, the classifier optimized by the biased resistance loss function can learn the more differentiated representation of each foreground relation class, and can more reasonably distinguish the background relation sample, the head foreground relation sample and the tail foreground relation sample. In addition, the multiple partial resistance loss functions with the weight parameters can retain more effective object characteristic information in the head foreground category compared with methods such as re-weighting and re-sampling.

And S50, inputting the unbiased relation features into the optimized classifier, and acquiring the prediction relation output by the classifier.

In the present embodiment, the partial loss function l is utilized _BR After optimizing the classifier for the relationship classification, unbiased relationship features (preferably, logic vectors of candidate relationships) learned by causal intervention trees are input into the classifier, and the predicted relationship R output by the classifier is obtained.

And step S60, generating a declination scene graph according to the image recognition result and the prediction relation.

In the present embodiment, a scene graph G corresponding to an original image I is generated from an object candidate region B and an object class O obtained by an image recognition combination model, and a prediction relationship R obtained by a classifier.

As can be seen from the foregoing, in the data-oriented double unbalanced hypo-bias scene graph generating method provided in this embodiment, object detection and object classification are performed on an input original image through an image recognition combined model to obtain an object candidate region and an object class, then a causal intervention tree is constructed according to the object candidate region and an introduced average external object, and based on causal intervention tree learning relationship unbiased features, the relationship unbiased features are input into a classifier optimized according to a bias resistance loss function to obtain a prediction relationship, and finally a scene graph is generated according to the object candidate region, the object class and the prediction relationship. The embodiment reduces the context bias caused by unbalanced data through the causal intervention tree, eliminates the virtual false correlation and learns the unbiased relation characteristics, and simultaneously reduces the bias of double imbalance from foreground to background and from tail foreground to head foreground in the scene graph generating task through optimizing the classifier through the bias resistance loss function, thereby remarkably improving the accuracy of relation prediction and achieving the aim of effectively generating the unbiased scene graph. In addition, the scene graph generation is performed based on the unbalanced data set VG150, and experiments show that the data double-imbalance-oriented deviation-reducing scene graph generation method provided by the embodiment is accurate and effective.

As shown in fig. 4, based on the same inventive concept, corresponding to the method of any embodiment, an embodiment of the present invention further provides a data-oriented double imbalance reduced bias scene graph generating system, which includes an image acquisition module 110, an image recognition module 120, an unbiased feature learning module 130, an optimization module 140, a relationship prediction module 150, and a scene graph generating module 160, where detailed descriptions of the functional modules are as follows:

an image acquisition module 110 for acquiring an original image;

the image recognition module 120 is configured to input an original image into a preset image recognition combination model, and obtain an image recognition result output by the image recognition combination model; the image recognition result comprises a plurality of object candidate areas and corresponding object categories;

the unbiased feature learning module 130 is configured to acquire an average external object, construct a causal intervention tree according to the plurality of object candidate regions and the average external object, and learn unbiased features based on the causal intervention tree;

the optimizing module 140 is configured to construct a partial resistance loss function, and optimize the classifier according to the partial resistance loss function; the bias resistance loss function is used for decoupling the identification of the foreground relationship and the background relationship and the classification of different foreground relationships;

The relation prediction module 150 is configured to input the unbiased relation feature into the optimized classifier, and obtain a predicted relation output by the classifier;

the scene graph generating module 160 is configured to generate a declination scene graph according to the image recognition result and the prediction relationship.

In an alternative embodiment, the image recognition module 120 includes the following sub-modules, and each of the following sub-modules is described in detail below:

the object detection sub-module is used for carrying out object detection on the input original image through the fast region convolution neural network to obtain a plurality of object candidate regions in the original image;

the object classification sub-module is used for inputting each object candidate region into the two-way tree structure long-short term memory network, extracting the object characteristics corresponding to each object candidate region through the two-way tree structure long-short term memory network, and identifying and obtaining the corresponding object category based on each object characteristic.

In an alternative embodiment, as shown in fig. 5, the unbiased feature learning module 130 includes the following sub-modules, and detailed descriptions of the functional sub-modules are as follows:

an initial tree construction sub-module 131, configured to construct an initial tree of n initial nodes by using a minimum spanning tree algorithm for generating an original image of n object candidate regions;

A causal intervention sub-module 132 for building a causal intervention tree of (n+1) nodes by adding an additional node to the initial tree; the additional node represents an average external object;

a feature assigning module 133, configured to assign features to each node in the causal intervention tree, and determine features of each node;

the feature output sub-module 144 is configured to input the feature of each node into the gating loop unit network, perform message transfer between the subject object, the object, and the average external object of each candidate relationship through the gating loop unit network, and output the logic vector of the candidate relationship through the full connection layer.

In an alternative embodiment, the feature assigning module 130 is configured to set the feature of each initial node of the causal intervention tree to be the feature of the object learned from the corresponding object candidate area through the deep neural network, and to set the feature of the additional nodes of the causal intervention tree to be the feature of the average external object; the characteristic of the average external object is that the average characteristic is calculated by a moving average method on the characteristics of all the external objects.

In an alternative embodiment, the optimization module 140 includes the following sub-modules, and the detailed descriptions of each functional sub-module are as follows:

The classification submodule is used for constructing a classification loss function which is used for identifying the relation between the foreground and the background;

the multi-classification sub-module is used for constructing a multi-classification loss function, and the multi-classification loss function is used for classifying each foreground relation;

and the joint optimization sub-module is used for constructing a partial resistance loss function by combining the two-class loss function and the multi-class loss function and optimizing the classifier.

In an alternative embodiment, the two-classification sub-module includes the following units, and the detailed descriptions of each functional unit are as follows:

a setting unit for setting a logic vector of the deep neural network output as x= (X) ₀ ,x ₁ ,…,x _|R| ) The |r| is the number of relation categories, and the probability distribution corresponding to the logic vector is set to be p= (P) ₀ ,p ₁ ,…,p _|R| ) The corresponding true tag vector is y= (Y) ₀ ,y ₁ ,…,y _|R| )；

A first conversion processing unit for obtaining two-dimensional logic vector X required for foreground and background classification ^bf A first conversion relation with a logic vector X output by the deep neural network; the first conversion relationship is:

A second conversion processing unit for obtaining two-dimensional real label vector Y required for foreground and background classification ^bf A second transformation relationship of the true tag vector Y corresponding to the logic vector X; the second transformation relationship is:

wherein y is ₀ Is the first real relationship tag in the real tag vector Y;

a first loss function construction unit for constructing a binary class loss function according to the binary cross entropy loss function, the first conversion relation and the second conversion relation _bf The method comprises the steps of carrying out a first treatment on the surface of the The binary class loss function l _bf Expressed as:

In an alternative embodiment, the multi-classification submodule includes the following elements, and the detailed descriptions of each functional element are as follows:

a weight definition unit for defining a weight term ω, the weight term ω being expressed as:

wherein r is a candidate relationship;

a second loss function construction unit for constructing a multi-class loss function by introducing a weight term omega into the softmax cross entropy loss function _fore The method comprises the steps of carrying out a first treatment on the surface of the The multi-class loss function l _fore Expressed as:

In an alternative embodiment, the multi-classification sub-module further includes the following units, and the detailed descriptions of each functional unit are as follows:

a probability updating unit for updating the multi-class loss function according to the softmax balance loss function _fore Probability p of (b) _j Updated probability p _j Expressed as:

ω _k ＝1-E(k)T _λ (f _k )(1-y _k )，

wherein E (k) is a binary term, E (k) =0 when k=0 belongs to the background relationship category, and E (k) =1 when k > 0 belongs to the foreground relationship category; t (T) _λ () As a threshold function, when the frequency f of the kth class of relation class _k When less than threshold lambda, T _λ () =1, otherwise T _λ ()＝0；y _k True corresponding to the k-th class of relation classAnd (5) a relationship label.

The system of the foregoing embodiment is configured to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the invention is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the present invention. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the invention, are intended to be included within the scope of the invention.

Claims

1. The method for generating the data-oriented double unbalanced depolarization scene graph is characterized by comprising the following steps of:

acquiring an original image;

inputting the original image into a preset image recognition combined model, and acquiring an image recognition result output by the image recognition combined model; the image recognition combined model consists of a rapid regional convolution neural network and a two-way tree structure long-term and short-term memory network which are connected in sequence; the image recognition result comprises a plurality of object candidate areas and corresponding object categories;

obtaining an average external object, wherein the characteristic of the average external object is that the characteristic of all external objects is calculated by a moving average method to obtain an average characteristic, the external object is obtained from the context of the object candidate region, a causal intervention tree is constructed according to a plurality of object candidate regions and the average external object, and the causal intervention tree-based learning unbiased relation characteristic comprises the following steps:

Constructing an initial tree of n initial nodes by a minimum spanning tree algorithm for the original image of n object candidate areas, and constructing a causal intervention tree of n+1 nodes by adding an additional node in the initial tree, wherein the additional node represents an average external object; feature assigning is performed on each node in the causal intervention tree, and the feature of each node is determined, including: setting a feature of each of the initial nodes of the causal intervention tree as an object feature learned from the corresponding object candidate region through a deep neural network, and setting a feature of the additional nodes of the causal intervention tree as a feature of the average external object; inputting the characteristics of each node into a gating circulation unit network, carrying out message transmission on a subject object, an object and an average external object of each candidate relation through the gating circulation unit network, and outputting a logic vector of the candidate relation through a full connection layer;

constructing a partial resistance loss function, optimizing a classifier according to the partial resistance loss function, and comprising the following steps: constructing a two-class loss function, wherein the two-class loss function is used for identifying the relation between the foreground and the background; constructing a multi-classification loss function, wherein the multi-classification loss function is used for classifying each foreground relation; constructing a partial resistance loss function by combining the two-class loss function and the multi-class loss function, and optimizing a classifier; the partial resistance loss function is used for decoupling the identification of the foreground relationship and the background relationship and the classification of different foreground relationships;

2. The method for generating a data-oriented double-imbalance hypo-polarization scene graph according to claim 1, wherein the constructing a two-class loss function for identifying a foreground-background relationship comprises:

setting deep nerveThe logic vector of the network output is x= (X) ₀ ,x ₁ ,…,x _|R| ) The |r| is the number of relation categories, and the probability distribution corresponding to the logic vector is set to be p= (P) ₀ ,p ₁ ,…,p _|R| ) The corresponding true tag vector is y= (Y) ₀ ,y ₁ ,…,y _|R| )；

wherein x is ₀ 、x _i The first and i-th logit values in the logit vector X, respectively, and i=1, 2, …, |r|; p is p ₀ 、p _i The first probability and the ith probability in the probability distribution P respectively; beta is a first weight parameter used for controlling the background relation category in the logic vector X;

wherein y is ₀ Is the first real relationship tag in the real tag vector Y;

3. The method for generating a data-oriented double imbalance hypo-polarization scene graph according to claim 2, wherein the constructing a multi-classification loss function for classifying each foreground relationship comprises:

a weight term ω is defined, which weight term ω is expressed as:

wherein r is a candidate relationship,for the first value of the probability distribution P corresponding to the candidate relation r,/>The first value of the true label vector Y corresponding to the candidate relation r is obtained;

wherein p is _j Is the first in the probability distribution Pj probabilities, and j=0, 1,2, …, |r|; y is _j Is the probability p _j Corresponding true relationship labels.

4. The method for generating a data-oriented double-imbalance hypo-polarization scene graph according to claim 3, wherein said constructing a multi-classification loss function for classifying each foreground relationship further comprises:

wherein x is _j Is the probability p _j A corresponding logic value; x is x _k A logic value corresponding to the k-th class of relation class, and k=0, 1,2, …, |r|; omega _k The weight corresponding to the k-th type of relation category is expressed as:

ω _k ＝1-E(k)T _λ (f _k )(1-y _k )，

wherein E (k) is a binary term, E (k) =0 when k=0 belongs to the background relationship category, and E (k) =1 when k > 0 belongs to the foreground relationship category; t (T) _λ () As a threshold function, when the frequency f of the kth class of relation class _k When less than threshold lambda, T _λ () =1, otherwise T _λ ()＝0；y _k And the true relationship label corresponding to the k-th type relationship category.

5. The method for generating a data-oriented double unbalanced hypo-polarization scene graph according to any one of claims 1 to 4, wherein the inputting the original image into a preset image recognition combination model and obtaining an image recognition result output by the image recognition combination model comprises:

6. A data-oriented double-imbalance-oriented depolarization scene graph generation system, comprising:

the image acquisition module is used for acquiring an original image;

the image recognition module is used for inputting the original image into a preset image recognition combination model and acquiring an image recognition result output by the image recognition combination model; the image recognition combined model consists of a rapid regional convolution neural network and a two-way tree structure long-term and short-term memory network which are connected in sequence; the image recognition result comprises a plurality of object candidate areas and corresponding object categories;

the unbiased feature learning module is used for acquiring average external objects, wherein the average external objects are characterized in that the average external objects are obtained by calculating the features of all external objects through a moving average method, the external objects are acquired from the context of the object candidate areas, a causal intervention tree is constructed according to a plurality of the object candidate areas and the average external objects, and unbiased relation features are learned based on the causal intervention tree;

The unbiased feature learning module includes:

a causal intervention sub-module for constructing a causal intervention tree of n+1 nodes by adding an additional node to the initial tree; the additional nodes represent average external objects;

a feature assigning module, configured to assign features to each of the nodes in the causal intervention tree, and determine features of each of the nodes, including: setting a feature of each of the initial nodes of the causal intervention tree as an object feature learned from the corresponding object candidate region through a deep neural network, and setting a feature of the additional nodes of the causal intervention tree as a feature of the average external object;

the characteristic output sub-module is used for inputting the characteristic of each node into a gating circulation unit network, carrying out message transmission on a subject object, an object and an average external object of each candidate relation through the gating circulation unit network, and outputting a logic vector of the candidate relation through a full connection layer;

The optimizing module is used for constructing a partial resistance loss function, optimizing a classifier according to the partial resistance loss function, and comprises the following steps: constructing a two-class loss function, wherein the two-class loss function is used for identifying the relation between the foreground and the background; constructing a multi-classification loss function, wherein the multi-classification loss function is used for classifying each foreground relation; constructing a partial resistance loss function by combining the two-class loss function and the multi-class loss function, and optimizing a classifier; the partial resistance loss function is used for decoupling the identification of the foreground relationship and the background relationship and the classification of different foreground relationships;