CN115861779A

CN115861779A - Unbiased scene graph generation method based on effective feature representation

Info

Publication number: CN115861779A
Application number: CN202211506846.3A
Authority: CN
Inventors: 王菡子; 马文熙
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-28

Abstract

An unbiased scene graph generation method based on effective feature representation relates to a computer vision technology. Extracting visual features of the object by using a training strategy of decoupling a feature extraction network and a classification network and using a pre-trained backbone network; performing target detection, and pairing and combining the extracted visual features of the object, the position code of the object and the class code of the object again for coding to obtain coding features suitable for predicate classification; performing predicate classification through a full connection layer; and training a feature extraction network, wherein a classification network in a full connection layer form is not adopted during reasoning, and predicate classification is carried out according to the cosine similarity between the coding features of the samples to be classified and the mean value of the features of each type of predicates by calculating the mean value of the coding features of each type of predicates for predicate classification. A fully-connected layer classifier is abandoned, classification is directly carried out based on predicate characteristics, and the problem that fully-connected layer parameters are easily affected by long-tail data can be solved, so that the performance of a scene graph generation task is improved.

Description

Unbiased scene graph generation method based on effective feature representation

Technical Field

The invention relates to a computer vision technology, in particular to an unbiased scene graph generation method based on effective feature representation.

Background

In recent years, scene graph generation algorithms based on deep learning have made significant progress. However, the data set used for the scene graph generative model training has a serious long tail problem, i.e., the number of few head predicates is far greater than the number of middle tail predicates. Extreme imbalance between sample classes causes the prediction result of the model to be greatly deviated, i.e. the prediction result has strong tendency to predict the head predicate, which does not allow the model to learn the relationship prediction well. Therefore, it is very important to solve the problem of low algorithm performance caused by long tail data in the scene graph generation task.

The predicate classification accuracy of the scene graph generation task is improved, and a natural idea is to improve a network, increase the complexity of the network and enable the network to extract better predicate characteristics. Early SGG methods therefore focused on building better feature extraction networks. Guojun Yin et al (Yin, G., sheng, L., liu, B., yu, N., wang, X., shao, J., & Loy, C.C. (2018). Zoom-net: minor depth features interactions for visual correlation in Proceedings of the European Conference Connection Vision (ECCV) (pp.322-338)) utilize local feature interactions to improve the performance of scenegraph generation. Kaihua Tang et al (Tang, K., zhang, H., wu, B., luo, W., & Liu, W. (2019) Learning to compound dynamic tree structures for visual constraints in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.6619-6628)) learn global visual context information by messaging using a standard recurrent neural network structure or a variant thereof. Jianwei Yang et al (Yang, J., lu, J., lee, S., batra, D., & Parikh, D. (2018). Graph r-cnn for scene Graph generation in Proceedings of the European conference on computer vision (ECCV) (pp.670-685)) propose to generate a final sparse scene Graph by pruning an original dense complete scene Graph. The above focuses on improving the structure of the feature extraction network, and neglects the long tail problem, which is a great factor affecting the training effect. Thus, kaihua Tang et al (Tang, K., niu, Y., huang, J., shi, J., & Zhang, H. (2020.) Unbiased scene graph generation from biological training in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.3716-3725.) eliminate bias by countering the fact causal analysis during training. Yuyuu Guo et al (Guo, y., gao, l., wang, x., hu, y., xu, x., lu, x., & Song, j. (2021). From general to specific: information network generation view area adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.16383-16392)) employ a strategy of two-step training to alleviate the long tail problem by fine-tuning some parameters of the network in the constructed balance training domain.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to solve the problem of poor relation detection performance caused by long-tail training data in a scene graph generation task, and provides an unbiased scene graph generation method based on effective feature representation.

The invention comprises the following steps:

A. collecting a scene graph generation data set, dividing the scene graph generation data set into a training set, a verification set and a test set, and carrying out image preprocessing;

B. extracting visual characteristics of the object by using a pre-trained backbone network, and sending the visual characteristics into a target detection branch to obtain the position and the category of the object;

C. b, respectively coding the object position and the object type obtained in the step B to obtain object position coding characteristics and object type coding characteristics;

D. c, splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object type coding characteristics obtained in the step C to obtain effective characteristic representation of the object;

E. d, transmitting the effective characteristic representations of all the objects in each graph obtained in the step D into a relation fusion characteristic encoder, and pairing the encoding results pairwise to obtain a series of effective characteristic representations of relations;

F. e, transmitting the effective characteristic representation of the relationship obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network;

G. and after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A to E for each type of predicates, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of each type of relation during reasoning, and taking the category with the maximum similarity as a classification result.

In step a, the Scene graph generation data set adopts a public data set VG-150 (Xu, d., zhu, y., choy, c.b., & Fei-Fei, l. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE con on computer vision and pattern recognition (pp.5410-5419)); the data set collects 108077 pictures, which contain 150 types of objects and 50 types of predicates; dividing the data set into a training set and a testing set according to the proportion of 7; and during model training, preprocessing operations such as random clipping, random turning, normalization and the like are carried out on the pictures, so that the richness of the training sample is further expanded.

In step B, the backbone network is ResNeXt-101-FPN network (Xie, S., girshick, R., doll a r, P., tu, Z.,&He，K.(2017).Aggregated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition(pp.1492-1500)，Lin，T.Y.，Dollár，P.，Girshick，R.，He，K.，Hariharan，B.，&belongie, S. (2017), feature pyramid networks for object detection in Proceedings of the IEEE conference on computer vision and pattern recognition (pp.2117-2125), and the target detection branch is the fast R-CNN network (Ren, S., he, K., girshick, R.,&sun, j. (2015), faster r-cnn: advance in neural information processing systems, 28.); the visual characteristics of the object are a 4096-dimensional vector obtained by neural network; the position of the object is a four-dimensional vector representing the horizontal and vertical coordinates of the upper left corner and the lower right corner of the object frame, the category of the object is [0,C _O ) Integer within the range, C _O Representing the number of object classes in the data set.

In step C, the position-coding features of the object are: first, a nine-dimensional vector is calculated, where each position of the vector represents: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), then linear transformation is performed into vector of 128 dimensions; the class code of the object is a 200-dimensional vector learned by a neural network embedding layer.

In step D, the valid features of the object are represented as: firstly, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C, and then transmitting a splicing result into a full connection layer to be converted into a vector of 768 dimensions.

In step E, the following substeps are further included:

E1. the relational fusion encoder consists of a series of transform encoders (Vaswani, a., shazer, n., parmar, n., uszkoreoit, j., jones, l., gomez, a.n., & Polosukhin, i. (2017) Attention is all you new. Advances in neural information processing systems, 30.), and two fusion strategies are added;

E2. the first part of the two fusion strategies is fusion operation on the input of a Transformer encoder; specifically, the input of the 1 st transform encoder is the valid feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st transform encoder are the output of the previous transform encoder; in order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the (M + 1) th transform encoder is changed into the fusion result of the output of the (M + 1) th transform encoder and the significance expression of the object, and the specific fusion mode is as follows:

X _M+1 ＝(X ₁ +Y _M )W _in +b _in (formula one)

Wherein X _M+1 Is the input of the M +1 th Transformer encoder; x ₁ The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is _M Is the output of the Mth transform encoder, W _in And b _in Matrices and vectors required for linear transformation;

E3. the second part of the two fusion strategies is the fusion operation performed on the output of each transform encoder, which enables the encoding result to include multi-level features, and the calculation method is as follows:

wherein Y represents the fusion result of the coding results of the transform coders, and M + N is the number of the transform coders;

E4. calculating the fusion coding result of the transform coder of each object, and carrying out pairwise splicing to obtain a series of effective characteristic representations of relations, specifically, for a pair of objects < s, o >, the effective characteristic representation F of the relation between the two objects ^s，o Calculated by the following formula:

F ^s，o ＝cat(Y ^s W _out +b _out ，Y ^o W _out +b _out ) (formula three)

Where cat (. Cndot.) represents the splicing operation of the vector, Y ^s And Y ^o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W _out And b _out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F ^s，o Is a 768-dimensional vector.

In step F, in order to enable the network to update the parameters, the effective feature representation of the relationship obtained in step E is transmitted to a full-connection layer network for predicate classification, and inverse gradient propagation is performed through cross entropy loss of the predicate classification, so as to update the parameters of the feature extraction network part.

In step G, the following substeps are further included:

G1. in order to avoid the problem that the parameters of the full connection layer tend to optimize the classification effect of the classes with large number of samples, all the parameters of the whole network are frozen when the training of the steps A-F is carried out to be converged; and for each type of predicate, calculating the mean value of the effective feature expression of the relation containing the type of predicate in the training set by utilizing the steps A-E to obtain C _R Mean value of individual relationship features, C _R The number of the predicates in the data set, i type predicate r _i Is the mean value mu of the relational characteristics _i The calculation method of (2) is as follows:

wherein n is _i For the number of training samples containing the i-th predicate, p ^s，o Is an object pair<s，o>The true value of the predicate in between,

the function is defined as follows:

G2. in the model reasoning stage, firstly, the effective characteristics of the relation of the samples to be classified are calculated according to the steps A-E

Calculating cosine similarity with all the relation feature mean values calculated in the step G1 one by one, and taking the category with the maximum similarity as a classification result in inference->

The method comprises the following specific steps:

the invention provides an unbiased scene graph generation method based on effective feature representation. Therefore, the whole training of the scene graph generation network is firstly carried out, then the classifier in the form of the full connection layer in the original network is discarded during reasoning, and the predicate classification is carried out by using a feature cosine similarity matching method instead. The method requires that the predicate characteristics learned by the network can practically and accurately represent the relation between the main predicates in each relation, and a relation characteristic fusion encoder is provided for performing multi-level fusion operation of the predicate characteristics, so that more practical and effective relation characteristic expression is obtained. By the method, the problem that the classifier learning is biased due to long-tail data can be effectively solved, and the performance of generating the scene graph is effectively improved.

Drawings

Fig. 1 is a diagram of the entire network structure according to the embodiment of the present invention.

Fig. 2 is a diagram showing a comparison between a scene graph generated by a reference method and a scene graph generated by the method of the present invention, in which several pictures are randomly extracted from the scene graph generation data set VG-150.

Detailed Description

The present invention will be further described with reference to the following examples, which are provided in the present application and are not limited to the following examples.

Referring to fig. 1, an implementation of an embodiment of the invention includes the steps of:

A. and collecting a scene graph to generate a data set, dividing the data set into a training set, a verification set and a test set, and then carrying out image preprocessing. The specific method comprises the following steps: the invention adopts a public data set VG-150 (Xu, D, zhu, Y, choy, C.B., & Fei-Fei, L. (2017.) Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419)). The data set collected 108077 pictures, which contained 150 types of objects and 50 types of predicates. The data set is divided into a training set and a test set according to the proportion of 7. And during model training, preprocessing operations such as random clipping, random turning, normalization and the like are carried out on the pictures, so that the richness of the training sample is further expanded.

B. And extracting the visual characteristics of the object by using the pre-trained backbone network, and sending the visual characteristics into a target detection branch to obtain the position and the category of the object. The specific method comprises the following steps: the backbone network is a resenext-101-FPN network (Xie, S, girshick, R, doll a r, P, tu, Z,&He,K.(2017).Aggregated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition(pp.1492-1500),Lin,T.Y.,Dollár,P.,Girshick,R.,He,K.,Hariharan,B.,&belongie, S. (2017), features pyramid networks for object detection in Proceedings of the IEEE conference on computer viSion and pattern recognition (pp.2117-2125), the target detection branch adopts an Faster R-CNN network (Ren, S., he, K., girshick, R.,&sun, j. (2015), faster r-cnn: advance in neural information processing systems, 28.). The visual characteristics of the object are a 4096-dimensional vector obtained by neural networking. The position of the object is a four-dimensional vector which represents the horizontal and vertical coordinates of the upper left corner and the lower right corner of the object frame, and the class of the object is [0, C' _O ) Integer within the range, C _O Representing the number of object classes in the data set.

C. And C, respectively coding the position and the type of the object obtained in the step B to obtain the position coding characteristic and the type coding characteristic of the object. The specific method comprises the following steps: the position coding features of the object are: first, a nine-dimensional vector is calculated, where each position of the vector represents: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), and then linear transformation is performed into 128-dimensional vector. The class code of the object is a 200-dimensional vector learned by a neural network embedding layer.

D. And D, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C to obtain effective characteristic representation of the object. The specific method comprises the following steps: the valid features of the object are represented as: firstly, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C, and then transmitting a splicing result into a full connection layer to be converted into a vector of 768 dimensions.

E. And D, transmitting the effective characteristic representations of all the objects in each graph obtained in the step D into a relation fusion characteristic encoder, and pairing the encoding results pairwise to obtain a series of effective characteristic representations of the relation. The specific method comprises the following substeps:

E1. the relational fusion encoder consists of a series of transform encoders (Vaswani, a., shazer, n., parmar, n., uszkoreoit, j., jones, l., gomez, a.n., & Polosukhin, i. (2017). Anchorage is all you new. Advances in neural information processing systems, 30.), and two fusion strategies are added.

E2. The first part of the two fusion strategies is the fusion operation on the input of the transform encoder. Specifically, the input of the 1 st fransformer encoder is the valid feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st fransformer encoder are the outputs of the previous fransformer encoder. In order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the (M + 1) th transform encoder is changed into the fusion result of the output of the (M + 1) th transform encoder and the significance expression of the object, and the specific fusion mode is as follows:

X _M+1 ＝(X ₁ +Y _M )W _in +b _in (formula one)

Wherein X _M+1 Is the input of the M +1 th Transformer encoder; x ₁ The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is _M Is the output of the Mth transform encoder, W _in And b _in Matrices and vectors required for linear transformation.

wherein Y represents the fusion result of the coding results of the various Transformer coders, and M + N is the number of the Transformer coders.

E4. Calculating the fusion coding result of the Transformer coder of each object, and splicing every two objects to obtain effective characteristic representations of a series of relations, specifically, for a pair of objects<s，o>The effective characteristic of the relationship between these two objects is denoted F ^s，o Calculated by the following formula:

F ^s，o ＝cat(Y ^s W _out +b _out ，Y ^o W _out +b _out ) (formula three)

Where cat (·,. Cndot.) represents the splicing operation of vectors, Y ^s And Y ^o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W _out And b _out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F ^s，o Is a 768-dimensional vector.

F. And E, transmitting the effective characteristic representation of the relation obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network. The specific method comprises the following steps: and E, in order to enable the network to update parameters, transmitting the effective feature representation of the relation obtained in the step E into a full-connection layer network for predicate classification, and performing inverse gradient propagation through cross entropy loss of the predicate classification, thereby updating the parameters of the feature extraction network part.

G. And after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A-E, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of the relation of each class during reasoning, and taking the class with the maximum similarity as a classification result. The specific method comprises the following substeps:

G1. in order to avoid the problem that the parameters of the fully-connected layer tend to optimize the classification effect of classes with large number of samples, all parameters of the whole network are frozen when the training of the steps A-F is converged. And for each type of predicates, calculating the mean value of the effective feature expressions of the relation containing the type of predicates in the training set by utilizing the steps A to E to obtain C _R Mean value of individual relationship features, C _R The number of the predicates in the data set, i type predicate r _i Is the mean value mu of the relational characteristics _i The calculation method of (2) is as follows:

the function is defined as follows:

And is counted with the step G1Calculating cosine similarity of all the calculated relation feature mean values one by one, and taking the class with the maximum similarity as a classification result in reasoning>

The method comprises the following specific steps:

as shown in fig. 2, given a plurality of pictures, the method of the present invention can generate a more meaningful scene graph compared to the reference method, thereby avoiding the problem that the reference method predicts a high-frequency predicate at a time, effectively alleviating the long-tail problem in the task of generating the scene graph, and being capable of predicting a low-frequency predicate with more information content.

Table 1 shows the predicate average recall (mR) of three common subtasks on test data of VG-150 compared with some other existing scene graph generation methods.

As can be seen from Table 1, the highest predicate average recall (mR) is achieved on all three common subtasks of the VG-150 data set for evaluating the scene graph generation model.

TABLE 1

IMP corresponds to the method proposed by Danfei Xu et al (Xu, D, zhu, Y, choy, C.B., & Fei-Fei, L. (2017.) Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419));

MotifNet corresponds to the method proposed by Rowan Zeller et al (Zellers, R., yatskar, M., thomson, S., & Choi, Y. (2018). Neral mobility: scene mapping with global context. In Proceedings of the IEEE conference on computer vision and pattern registration (pp.5831-5840));

VCTree corresponds to the method proposed by Kaihua Tang et al (Tang, K., zhang, H., wu, B., luo, W., & Liu, W. (2019). Learning to composition dynamic tree structures for visual compositions in Proceedings of the IEEE/CVF conference on computer vision and pattern registration (pp.6619-6628));

TDE corresponds to the method proposed by Kaihua Tang et al (Tang, K., niu, Y., huang, J., shi, J., & Zhang, H. (2020). Unbinary scene graph generation from binary training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.3716-3725));

PUM corresponds to the method proposed by Gengcong Yang et al (Yang, g., zhang, j., zhang, y., wu, b., & Yang, y. (2021). Basic modeling of a semiconductor imaging for development. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.12527-12536));

BGNN corresponds to the method proposed by Rongjie Li et al (Li, R., zhang, S., wan, B., & He, X. (2021). Bipartite graph with adaptive message passing for indirect scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.11109-11119));

BA-SGG corresponds to the method proposed by Yuyu Guo et al (Guo, Y., gao, L., wang, X., hu, Y., xu, X., lu, X., & Song, J. (2021). From general to specific: information science growth in balance addition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.16383-16392)).

Scene graph generation aims at detecting objects in a graph and relationships between the objects, which are represented by triples like < subject, predicate, object >. The method adopts a training strategy of decoupling a feature extraction network and a classification network, and firstly extracts the visual features of an object by using a pre-trained backbone network; then, target detection is carried out, and pairwise pairing and recombination coding are carried out by utilizing the extracted visual features of the object, the position codes of the object and the class codes of the object, so that coding features suitable for predicate classification are obtained; then carrying out predicate classification through a full connection layer; the feature extraction network is trained through the steps, a classification network in a full connection layer form is not adopted during reasoning, and predicate classification is carried out according to the cosine similarity between the coding features of the samples to be classified and the mean value of the features of each type of predicates by calculating the mean value of the coding features of each type of predicates for predicate classification. The method of abandoning the full-connection layer classifier and classifying based on predicate characteristics can solve the problem that full-connection layer parameters are easily affected by long-tail data, so that the performance of a scene graph generation task is improved.

Claims

1. An unbiased scene graph generation method based on effective feature representation is characterized by comprising the following steps:

C. respectively coding the object position and the object type obtained in the step B to obtain an object position coding characteristic and an object type coding characteristic;

D. splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object category coding characteristics obtained in the step C to obtain effective characteristic representation of the object;

F. e, transmitting the effective characteristic representation of the relation obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network;

G. and after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A-E, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of the relation of each class during reasoning, and taking the class with the maximum similarity as a classification result.

2. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in the step A, a public data set VG-150 is adopted as the scene graph generation data set, 108077 pictures are collected by the data set, and 150 types of objects and 50 types of predicates are contained in the pictures; dividing the data set into a training set and a test set according to the ratio of 7:3, and taking the first 5000 pictures of the training set as a verification set; and the richness of the training sample is expanded to the picture preprocessing operation during the model training, wherein the preprocessing operation comprises random cutting, random turning and normalization.

3. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in the step B, the backbone network adopts a ResNeXt-101-FPN network, and the target detection branch adopts a Faster R-CNN network; the visual characteristic of the object is a 4096-dimensional vector learned by a neural network; the position of the object is a four-dimensional vector which represents the horizontal and vertical coordinates of the upper left corner and the lower right corner of an object frame, and the class of the object is [0, C ] _O ) Integer within the range, C _O Representing the number of object classes in the data set.

4. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step C, the object position coding features are: a nine-dimensional vector is calculated, each position of the vector representing: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), then linear transformation is performed into vector of 128 dimensions; the object class codes are 200-dimensional vectors learned by a neural network embedding layer.

5. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step D, the valid features of the object are represented as: and C, splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object type coding characteristics obtained in the step C, transmitting a splicing result into a full connection layer, and converting the splicing result into a vector of 768 dimensions.

6. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step E, said obtaining a valid feature representation of a series of relationships comprises:

E1. the relation fusion encoder consists of a series of Transformer encoders and two fusion strategies are added;

the first part of the two fusion strategies is fusion operation on the input of a Transformer encoder; specifically, the input of the 1 st transform encoder is the effective feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st transform encoder are the output of the previous transform encoder; in order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the M +1 th transform encoder is changed into the fusion result of the output of the Mth transform encoder and the significance expression of the object, and the fusion mode is as follows:

X _M+1 ＝(X ₁ +Y _M )W _in +b _in

wherein, X _M+1 Is the input of the M +1 th Transformer encoder; x ₁ The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is _M Is the output of the Mth transform encoder, W _in And b _in Matrices and vectors required for linear transformation;

the second part of the two fusion strategies is the fusion operation performed on the output of each transform encoder, which enables the encoding result to include multi-level features, and the calculation method is as follows:

wherein Y represents the fusion result of the coding results of each Transformer coder, and M + N is the number of the Transformer coders;

E2. calculating the fusion coding result of the Transformer coder of each object, and splicing every two objects to obtain effective characteristic representations of a series of relations, specifically, for a pair of objects<s,o>The effective characteristic of the relationship between these two objects is denoted F ^s,o Calculated by the following formula:

F ^s,o ＝cat(Y ^s W _out +b _out , ^o W _out +b _out )

wherein cat (·, denotes the operation of splicing vectors, Y ^s And Y ^o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W _out And b _out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F ^s,o Is a 768-dimensional vector.

7. The method for generating an unbiased scene graph based on active feature representation as claimed in claim 1, characterized in that: in step F, the effective feature representation of the relationship obtained in step E is transmitted to a full-link network for classification, and classification loss is calculated to update parameters of the network.

8. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step G, the following substeps are included:

G1. to avoid parameters of fully connected layersThe problem of the classification effect of the classes with a large number of samples tends to be optimized, and all parameters of the whole network are frozen when the training of the steps A-F is converged; for each type of predicate, calculating the mean value of the effective feature expression of the relation containing the type of predicate in the training set by utilizing the steps A-E to obtain C _R Mean value of individual relationship features, C _R The number of the predicates in the data set, i type predicate r _i Is the mean value mu of the relational characteristics _i The calculation method of (2) is as follows:

wherein n is _i For the number of training samples containing the i-th predicate, p ^s,o Is an object pair<s,o>The true value of the predicate in between,

the function is defined as follows:

G2. in the model reasoning stage, effective characteristics of the relation of the samples to be classified are calculated according to the steps A-E

Calculating cosine similarity with all the relation feature mean values calculated in the step G1 one by one, and taking the class with the maximum similarity as a classification result during reasoning>

The formula is as follows:

/>