CN115861779A - Unbiased scene graph generation method based on effective feature representation - Google Patents

Unbiased scene graph generation method based on effective feature representation Download PDF

Info

Publication number
CN115861779A
CN115861779A CN202211506846.3A CN202211506846A CN115861779A CN 115861779 A CN115861779 A CN 115861779A CN 202211506846 A CN202211506846 A CN 202211506846A CN 115861779 A CN115861779 A CN 115861779A
Authority
CN
China
Prior art keywords
network
scene graph
effective
classification
graph generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211506846.3A
Other languages
Chinese (zh)
Inventor
王菡子
马文熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202211506846.3A priority Critical patent/CN115861779A/en
Publication of CN115861779A publication Critical patent/CN115861779A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

An unbiased scene graph generation method based on effective feature representation relates to a computer vision technology. Extracting visual features of the object by using a training strategy of decoupling a feature extraction network and a classification network and using a pre-trained backbone network; performing target detection, and pairing and combining the extracted visual features of the object, the position code of the object and the class code of the object again for coding to obtain coding features suitable for predicate classification; performing predicate classification through a full connection layer; and training a feature extraction network, wherein a classification network in a full connection layer form is not adopted during reasoning, and predicate classification is carried out according to the cosine similarity between the coding features of the samples to be classified and the mean value of the features of each type of predicates by calculating the mean value of the coding features of each type of predicates for predicate classification. A fully-connected layer classifier is abandoned, classification is directly carried out based on predicate characteristics, and the problem that fully-connected layer parameters are easily affected by long-tail data can be solved, so that the performance of a scene graph generation task is improved.

Description

Unbiased scene graph generation method based on effective feature representation
Technical Field
The invention relates to a computer vision technology, in particular to an unbiased scene graph generation method based on effective feature representation.
Background
In recent years, scene graph generation algorithms based on deep learning have made significant progress. However, the data set used for the scene graph generative model training has a serious long tail problem, i.e., the number of few head predicates is far greater than the number of middle tail predicates. Extreme imbalance between sample classes causes the prediction result of the model to be greatly deviated, i.e. the prediction result has strong tendency to predict the head predicate, which does not allow the model to learn the relationship prediction well. Therefore, it is very important to solve the problem of low algorithm performance caused by long tail data in the scene graph generation task.
The predicate classification accuracy of the scene graph generation task is improved, and a natural idea is to improve a network, increase the complexity of the network and enable the network to extract better predicate characteristics. Early SGG methods therefore focused on building better feature extraction networks. Guojun Yin et al (Yin, G., sheng, L., liu, B., yu, N., wang, X., shao, J., & Loy, C.C. (2018). Zoom-net: minor depth features interactions for visual correlation in Proceedings of the European Conference Connection Vision (ECCV) (pp.322-338)) utilize local feature interactions to improve the performance of scenegraph generation. Kaihua Tang et al (Tang, K., zhang, H., wu, B., luo, W., & Liu, W. (2019) Learning to compound dynamic tree structures for visual constraints in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.6619-6628)) learn global visual context information by messaging using a standard recurrent neural network structure or a variant thereof. Jianwei Yang et al (Yang, J., lu, J., lee, S., batra, D., & Parikh, D. (2018). Graph r-cnn for scene Graph generation in Proceedings of the European conference on computer vision (ECCV) (pp.670-685)) propose to generate a final sparse scene Graph by pruning an original dense complete scene Graph. The above focuses on improving the structure of the feature extraction network, and neglects the long tail problem, which is a great factor affecting the training effect. Thus, kaihua Tang et al (Tang, K., niu, Y., huang, J., shi, J., & Zhang, H. (2020.) Unbiased scene graph generation from biological training in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.3716-3725.) eliminate bias by countering the fact causal analysis during training. Yuyuu Guo et al (Guo, y., gao, l., wang, x., hu, y., xu, x., lu, x., & Song, j. (2021). From general to specific: information network generation view area adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.16383-16392)) employ a strategy of two-step training to alleviate the long tail problem by fine-tuning some parameters of the network in the constructed balance training domain.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to solve the problem of poor relation detection performance caused by long-tail training data in a scene graph generation task, and provides an unbiased scene graph generation method based on effective feature representation.
The invention comprises the following steps:
A. collecting a scene graph generation data set, dividing the scene graph generation data set into a training set, a verification set and a test set, and carrying out image preprocessing;
B. extracting visual characteristics of the object by using a pre-trained backbone network, and sending the visual characteristics into a target detection branch to obtain the position and the category of the object;
C. b, respectively coding the object position and the object type obtained in the step B to obtain object position coding characteristics and object type coding characteristics;
D. c, splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object type coding characteristics obtained in the step C to obtain effective characteristic representation of the object;
E. d, transmitting the effective characteristic representations of all the objects in each graph obtained in the step D into a relation fusion characteristic encoder, and pairing the encoding results pairwise to obtain a series of effective characteristic representations of relations;
F. e, transmitting the effective characteristic representation of the relationship obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network;
G. and after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A to E for each type of predicates, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of each type of relation during reasoning, and taking the category with the maximum similarity as a classification result.
In step a, the Scene graph generation data set adopts a public data set VG-150 (Xu, d., zhu, y., choy, c.b., & Fei-Fei, l. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE con on computer vision and pattern recognition (pp.5410-5419)); the data set collects 108077 pictures, which contain 150 types of objects and 50 types of predicates; dividing the data set into a training set and a testing set according to the proportion of 7; and during model training, preprocessing operations such as random clipping, random turning, normalization and the like are carried out on the pictures, so that the richness of the training sample is further expanded.
In step B, the backbone network is ResNeXt-101-FPN network (Xie, S., girshick, R., doll a r, P., tu, Z.,&He,K.(2017).Aggregated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition(pp.1492-1500),Lin,T.Y.,Dollár,P.,Girshick,R.,He,K.,Hariharan,B.,&belongie, S. (2017), feature pyramid networks for object detection in Proceedings of the IEEE conference on computer vision and pattern recognition (pp.2117-2125), and the target detection branch is the fast R-CNN network (Ren, S., he, K., girshick, R.,&sun, j. (2015), faster r-cnn: advance in neural information processing systems, 28.); the visual characteristics of the object are a 4096-dimensional vector obtained by neural network; the position of the object is a four-dimensional vector representing the horizontal and vertical coordinates of the upper left corner and the lower right corner of the object frame, the category of the object is [0,C O ) Integer within the range, C O Representing the number of object classes in the data set.
In step C, the position-coding features of the object are: first, a nine-dimensional vector is calculated, where each position of the vector represents: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), then linear transformation is performed into vector of 128 dimensions; the class code of the object is a 200-dimensional vector learned by a neural network embedding layer.
In step D, the valid features of the object are represented as: firstly, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C, and then transmitting a splicing result into a full connection layer to be converted into a vector of 768 dimensions.
In step E, the following substeps are further included:
E1. the relational fusion encoder consists of a series of transform encoders (Vaswani, a., shazer, n., parmar, n., uszkoreoit, j., jones, l., gomez, a.n., & Polosukhin, i. (2017) Attention is all you new. Advances in neural information processing systems, 30.), and two fusion strategies are added;
E2. the first part of the two fusion strategies is fusion operation on the input of a Transformer encoder; specifically, the input of the 1 st transform encoder is the valid feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st transform encoder are the output of the previous transform encoder; in order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the (M + 1) th transform encoder is changed into the fusion result of the output of the (M + 1) th transform encoder and the significance expression of the object, and the specific fusion mode is as follows:
X M+1 =(X 1 +Y M )W in +b in (formula one)
Wherein X M+1 Is the input of the M +1 th Transformer encoder; x 1 The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is M Is the output of the Mth transform encoder, W in And b in Matrices and vectors required for linear transformation;
E3. the second part of the two fusion strategies is the fusion operation performed on the output of each transform encoder, which enables the encoding result to include multi-level features, and the calculation method is as follows:
Figure BDA0003969344740000041
wherein Y represents the fusion result of the coding results of the transform coders, and M + N is the number of the transform coders;
E4. calculating the fusion coding result of the transform coder of each object, and carrying out pairwise splicing to obtain a series of effective characteristic representations of relations, specifically, for a pair of objects < s, o >, the effective characteristic representation F of the relation between the two objects s,o Calculated by the following formula:
F s,o =cat(Y s W out +b out ,Y o W out +b out ) (formula three)
Where cat (. Cndot.) represents the splicing operation of the vector, Y s And Y o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W out And b out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F s,o Is a 768-dimensional vector.
In step F, in order to enable the network to update the parameters, the effective feature representation of the relationship obtained in step E is transmitted to a full-connection layer network for predicate classification, and inverse gradient propagation is performed through cross entropy loss of the predicate classification, so as to update the parameters of the feature extraction network part.
In step G, the following substeps are further included:
G1. in order to avoid the problem that the parameters of the full connection layer tend to optimize the classification effect of the classes with large number of samples, all the parameters of the whole network are frozen when the training of the steps A-F is carried out to be converged; and for each type of predicate, calculating the mean value of the effective feature expression of the relation containing the type of predicate in the training set by utilizing the steps A-E to obtain C R Mean value of individual relationship features, C R The number of the predicates in the data set, i type predicate r i Is the mean value mu of the relational characteristics i The calculation method of (2) is as follows:
Figure BDA0003969344740000042
wherein n is i For the number of training samples containing the i-th predicate, p s,o Is an object pair<s,o>The true value of the predicate in between,
Figure BDA0003969344740000046
the function is defined as follows:
Figure BDA0003969344740000043
G2. in the model reasoning stage, firstly, the effective characteristics of the relation of the samples to be classified are calculated according to the steps A-E
Figure BDA0003969344740000044
Calculating cosine similarity with all the relation feature mean values calculated in the step G1 one by one, and taking the category with the maximum similarity as a classification result in inference->
Figure BDA0003969344740000045
The method comprises the following specific steps:
Figure BDA0003969344740000051
the invention provides an unbiased scene graph generation method based on effective feature representation. Therefore, the whole training of the scene graph generation network is firstly carried out, then the classifier in the form of the full connection layer in the original network is discarded during reasoning, and the predicate classification is carried out by using a feature cosine similarity matching method instead. The method requires that the predicate characteristics learned by the network can practically and accurately represent the relation between the main predicates in each relation, and a relation characteristic fusion encoder is provided for performing multi-level fusion operation of the predicate characteristics, so that more practical and effective relation characteristic expression is obtained. By the method, the problem that the classifier learning is biased due to long-tail data can be effectively solved, and the performance of generating the scene graph is effectively improved.
Drawings
Fig. 1 is a diagram of the entire network structure according to the embodiment of the present invention.
Fig. 2 is a diagram showing a comparison between a scene graph generated by a reference method and a scene graph generated by the method of the present invention, in which several pictures are randomly extracted from the scene graph generation data set VG-150.
Detailed Description
The present invention will be further described with reference to the following examples, which are provided in the present application and are not limited to the following examples.
Referring to fig. 1, an implementation of an embodiment of the invention includes the steps of:
A. and collecting a scene graph to generate a data set, dividing the data set into a training set, a verification set and a test set, and then carrying out image preprocessing. The specific method comprises the following steps: the invention adopts a public data set VG-150 (Xu, D, zhu, Y, choy, C.B., & Fei-Fei, L. (2017.) Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419)). The data set collected 108077 pictures, which contained 150 types of objects and 50 types of predicates. The data set is divided into a training set and a test set according to the proportion of 7. And during model training, preprocessing operations such as random clipping, random turning, normalization and the like are carried out on the pictures, so that the richness of the training sample is further expanded.
B. And extracting the visual characteristics of the object by using the pre-trained backbone network, and sending the visual characteristics into a target detection branch to obtain the position and the category of the object. The specific method comprises the following steps: the backbone network is a resenext-101-FPN network (Xie, S, girshick, R, doll a r, P, tu, Z,&He,K.(2017).Aggregated residual transformations for deep neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition(pp.1492-1500),Lin,T.Y.,Dollár,P.,Girshick,R.,He,K.,Hariharan,B.,&belongie, S. (2017), features pyramid networks for object detection in Proceedings of the IEEE conference on computer viSion and pattern recognition (pp.2117-2125), the target detection branch adopts an Faster R-CNN network (Ren, S., he, K., girshick, R.,&sun, j. (2015), faster r-cnn: advance in neural information processing systems, 28.). The visual characteristics of the object are a 4096-dimensional vector obtained by neural networking. The position of the object is a four-dimensional vector which represents the horizontal and vertical coordinates of the upper left corner and the lower right corner of the object frame, and the class of the object is [0, C' O ) Integer within the range, C O Representing the number of object classes in the data set.
C. And C, respectively coding the position and the type of the object obtained in the step B to obtain the position coding characteristic and the type coding characteristic of the object. The specific method comprises the following steps: the position coding features of the object are: first, a nine-dimensional vector is calculated, where each position of the vector represents: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), and then linear transformation is performed into 128-dimensional vector. The class code of the object is a 200-dimensional vector learned by a neural network embedding layer.
D. And D, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C to obtain effective characteristic representation of the object. The specific method comprises the following steps: the valid features of the object are represented as: firstly, splicing the object visual characteristics obtained in the step B, the object position coding characteristics and the object type coding characteristics obtained in the step C, and then transmitting a splicing result into a full connection layer to be converted into a vector of 768 dimensions.
E. And D, transmitting the effective characteristic representations of all the objects in each graph obtained in the step D into a relation fusion characteristic encoder, and pairing the encoding results pairwise to obtain a series of effective characteristic representations of the relation. The specific method comprises the following substeps:
E1. the relational fusion encoder consists of a series of transform encoders (Vaswani, a., shazer, n., parmar, n., uszkoreoit, j., jones, l., gomez, a.n., & Polosukhin, i. (2017). Anchorage is all you new. Advances in neural information processing systems, 30.), and two fusion strategies are added.
E2. The first part of the two fusion strategies is the fusion operation on the input of the transform encoder. Specifically, the input of the 1 st fransformer encoder is the valid feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st fransformer encoder are the outputs of the previous fransformer encoder. In order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the (M + 1) th transform encoder is changed into the fusion result of the output of the (M + 1) th transform encoder and the significance expression of the object, and the specific fusion mode is as follows:
X M+1 =(X 1 +Y M )W in +b in (formula one)
Wherein X M+1 Is the input of the M +1 th Transformer encoder; x 1 The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is M Is the output of the Mth transform encoder, W in And b in Matrices and vectors required for linear transformation.
E3. The second part of the two fusion strategies is the fusion operation performed on the output of each transform encoder, which enables the encoding result to include multi-level features, and the calculation method is as follows:
Figure BDA0003969344740000071
wherein Y represents the fusion result of the coding results of the various Transformer coders, and M + N is the number of the Transformer coders.
E4. Calculating the fusion coding result of the Transformer coder of each object, and splicing every two objects to obtain effective characteristic representations of a series of relations, specifically, for a pair of objects<s,o>The effective characteristic of the relationship between these two objects is denoted F s,o Calculated by the following formula:
F s,o =cat(Y s W out +b out ,Y o W out +b out ) (formula three)
Where cat (·,. Cndot.) represents the splicing operation of vectors, Y s And Y o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W out And b out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F s,o Is a 768-dimensional vector.
F. And E, transmitting the effective characteristic representation of the relation obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network. The specific method comprises the following steps: and E, in order to enable the network to update parameters, transmitting the effective feature representation of the relation obtained in the step E into a full-connection layer network for predicate classification, and performing inverse gradient propagation through cross entropy loss of the predicate classification, thereby updating the parameters of the feature extraction network part.
G. And after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A-E, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of the relation of each class during reasoning, and taking the class with the maximum similarity as a classification result. The specific method comprises the following substeps:
G1. in order to avoid the problem that the parameters of the fully-connected layer tend to optimize the classification effect of classes with large number of samples, all parameters of the whole network are frozen when the training of the steps A-F is converged. And for each type of predicates, calculating the mean value of the effective feature expressions of the relation containing the type of predicates in the training set by utilizing the steps A to E to obtain C R Mean value of individual relationship features, C R The number of the predicates in the data set, i type predicate r i Is the mean value mu of the relational characteristics i The calculation method of (2) is as follows:
Figure BDA0003969344740000072
wherein n is i For the number of training samples containing the i-th predicate, p s,o Is an object pair<s,o>The true value of the predicate in between,
Figure BDA0003969344740000073
the function is defined as follows:
Figure BDA0003969344740000074
G2. in the model reasoning stage, firstly, the effective characteristics of the relation of the samples to be classified are calculated according to the steps A-E
Figure BDA0003969344740000075
And is counted with the step G1Calculating cosine similarity of all the calculated relation feature mean values one by one, and taking the class with the maximum similarity as a classification result in reasoning>
Figure BDA0003969344740000081
The method comprises the following specific steps:
Figure BDA0003969344740000082
as shown in fig. 2, given a plurality of pictures, the method of the present invention can generate a more meaningful scene graph compared to the reference method, thereby avoiding the problem that the reference method predicts a high-frequency predicate at a time, effectively alleviating the long-tail problem in the task of generating the scene graph, and being capable of predicting a low-frequency predicate with more information content.
Table 1 shows the predicate average recall (mR) of three common subtasks on test data of VG-150 compared with some other existing scene graph generation methods.
As can be seen from Table 1, the highest predicate average recall (mR) is achieved on all three common subtasks of the VG-150 data set for evaluating the scene graph generation model.
TABLE 1
Figure BDA0003969344740000083
IMP corresponds to the method proposed by Danfei Xu et al (Xu, D, zhu, Y, choy, C.B., & Fei-Fei, L. (2017.) Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419));
MotifNet corresponds to the method proposed by Rowan Zeller et al (Zellers, R., yatskar, M., thomson, S., & Choi, Y. (2018). Neral mobility: scene mapping with global context. In Proceedings of the IEEE conference on computer vision and pattern registration (pp.5831-5840));
VCTree corresponds to the method proposed by Kaihua Tang et al (Tang, K., zhang, H., wu, B., luo, W., & Liu, W. (2019). Learning to composition dynamic tree structures for visual compositions in Proceedings of the IEEE/CVF conference on computer vision and pattern registration (pp.6619-6628));
TDE corresponds to the method proposed by Kaihua Tang et al (Tang, K., niu, Y., huang, J., shi, J., & Zhang, H. (2020). Unbinary scene graph generation from binary training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.3716-3725));
PUM corresponds to the method proposed by Gengcong Yang et al (Yang, g., zhang, j., zhang, y., wu, b., & Yang, y. (2021). Basic modeling of a semiconductor imaging for development. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.12527-12536));
BGNN corresponds to the method proposed by Rongjie Li et al (Li, R., zhang, S., wan, B., & He, X. (2021). Bipartite graph with adaptive message passing for indirect scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp.11109-11119));
BA-SGG corresponds to the method proposed by Yuyu Guo et al (Guo, Y., gao, L., wang, X., hu, Y., xu, X., lu, X., & Song, J. (2021). From general to specific: information science growth in balance addition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.16383-16392)).
Scene graph generation aims at detecting objects in a graph and relationships between the objects, which are represented by triples like < subject, predicate, object >. The method adopts a training strategy of decoupling a feature extraction network and a classification network, and firstly extracts the visual features of an object by using a pre-trained backbone network; then, target detection is carried out, and pairwise pairing and recombination coding are carried out by utilizing the extracted visual features of the object, the position codes of the object and the class codes of the object, so that coding features suitable for predicate classification are obtained; then carrying out predicate classification through a full connection layer; the feature extraction network is trained through the steps, a classification network in a full connection layer form is not adopted during reasoning, and predicate classification is carried out according to the cosine similarity between the coding features of the samples to be classified and the mean value of the features of each type of predicates by calculating the mean value of the coding features of each type of predicates for predicate classification. The method of abandoning the full-connection layer classifier and classifying based on predicate characteristics can solve the problem that full-connection layer parameters are easily affected by long-tail data, so that the performance of a scene graph generation task is improved.

Claims (8)

1. An unbiased scene graph generation method based on effective feature representation is characterized by comprising the following steps:
A. collecting a scene graph generation data set, dividing the scene graph generation data set into a training set, a verification set and a test set, and carrying out image preprocessing;
B. extracting visual characteristics of the object by using a pre-trained backbone network, and sending the visual characteristics into a target detection branch to obtain the position and the category of the object;
C. respectively coding the object position and the object type obtained in the step B to obtain an object position coding characteristic and an object type coding characteristic;
D. splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object category coding characteristics obtained in the step C to obtain effective characteristic representation of the object;
E. d, transmitting the effective characteristic representations of all the objects in each graph obtained in the step D into a relation fusion characteristic encoder, and pairing the encoding results pairwise to obtain a series of effective characteristic representations of relations;
F. e, transmitting the effective characteristic representation of the relation obtained in the step E into a full-connection layer network for classification, and calculating classification loss so as to update the parameters of the network;
G. and after the training is converged, calculating the average value of the effective features of the relation of the samples containing the predicates in the training set by utilizing the steps A-E, calculating the cosine similarity between the effective features of the relation of the samples to be classified and the calculated average value of the effective features of the relation of each class during reasoning, and taking the class with the maximum similarity as a classification result.
2. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in the step A, a public data set VG-150 is adopted as the scene graph generation data set, 108077 pictures are collected by the data set, and 150 types of objects and 50 types of predicates are contained in the pictures; dividing the data set into a training set and a test set according to the ratio of 7:3, and taking the first 5000 pictures of the training set as a verification set; and the richness of the training sample is expanded to the picture preprocessing operation during the model training, wherein the preprocessing operation comprises random cutting, random turning and normalization.
3. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in the step B, the backbone network adopts a ResNeXt-101-FPN network, and the target detection branch adopts a Faster R-CNN network; the visual characteristic of the object is a 4096-dimensional vector learned by a neural network; the position of the object is a four-dimensional vector which represents the horizontal and vertical coordinates of the upper left corner and the lower right corner of an object frame, and the class of the object is [0, C ] O ) Integer within the range, C O Representing the number of object classes in the data set.
4. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step C, the object position coding features are: a nine-dimensional vector is calculated, each position of the vector representing: width of object frame/width of image, height of object frame/height of image, abscissa of center point of object frame/width of image, ordinate of center point of object frame/height of image, abscissa of upper left corner of object frame/width of image, ordinate of upper left corner of object frame/height of image, abscissa of lower right corner of object frame/width of image, ordinate of lower right corner of object frame/height of image, (width of object frame)/(height of image), then linear transformation is performed into vector of 128 dimensions; the object class codes are 200-dimensional vectors learned by a neural network embedding layer.
5. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step D, the valid features of the object are represented as: and C, splicing the object visual characteristics obtained in the step B with the object position coding characteristics and the object type coding characteristics obtained in the step C, transmitting a splicing result into a full connection layer, and converting the splicing result into a vector of 768 dimensions.
6. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step E, said obtaining a valid feature representation of a series of relationships comprises:
E1. the relation fusion encoder consists of a series of Transformer encoders and two fusion strategies are added;
the first part of the two fusion strategies is fusion operation on the input of a Transformer encoder; specifically, the input of the 1 st transform encoder is the effective feature representation of the object obtained in step D, and the subsequent inputs except for the M +1 st transform encoder are the output of the previous transform encoder; in order to prevent the significance expression of the object from being forgotten in the encoding process, the input of the M +1 th transform encoder is changed into the fusion result of the output of the Mth transform encoder and the significance expression of the object, and the fusion mode is as follows:
X M+1 =(X 1 +Y M )W in +b in
wherein, X M+1 Is the input of the M +1 th Transformer encoder; x 1 The input of the 1 st Transformer encoder is the effective characteristic representation of the object; y is M Is the output of the Mth transform encoder, W in And b in Matrices and vectors required for linear transformation;
the second part of the two fusion strategies is the fusion operation performed on the output of each transform encoder, which enables the encoding result to include multi-level features, and the calculation method is as follows:
Figure FDA0003969344730000021
wherein Y represents the fusion result of the coding results of each Transformer coder, and M + N is the number of the Transformer coders;
E2. calculating the fusion coding result of the Transformer coder of each object, and splicing every two objects to obtain effective characteristic representations of a series of relations, specifically, for a pair of objects<s,o>The effective characteristic of the relationship between these two objects is denoted F s,o Calculated by the following formula:
F s,o =cat(Y s W out +b out , o W out +b out )
wherein cat (·, denotes the operation of splicing vectors, Y s And Y o Respectively represents the fusion coding results of the Transformer coder for the object s and the object o, W out And b out The effective character of the resulting relationship, which is the matrix and vector required for the linear change, is denoted F s,o Is a 768-dimensional vector.
7. The method for generating an unbiased scene graph based on active feature representation as claimed in claim 1, characterized in that: in step F, the effective feature representation of the relationship obtained in step E is transmitted to a full-link network for classification, and classification loss is calculated to update parameters of the network.
8. The method as claimed in claim 1, wherein the unbiased scene graph generation method based on the valid feature representation is characterized in that: in step G, the following substeps are included:
G1. to avoid parameters of fully connected layersThe problem of the classification effect of the classes with a large number of samples tends to be optimized, and all parameters of the whole network are frozen when the training of the steps A-F is converged; for each type of predicate, calculating the mean value of the effective feature expression of the relation containing the type of predicate in the training set by utilizing the steps A-E to obtain C R Mean value of individual relationship features, C R The number of the predicates in the data set, i type predicate r i Is the mean value mu of the relational characteristics i The calculation method of (2) is as follows:
Figure FDA0003969344730000031
wherein n is i For the number of training samples containing the i-th predicate, p s,o Is an object pair<s,o>The true value of the predicate in between,
Figure FDA0003969344730000032
the function is defined as follows:
Figure FDA0003969344730000033
G2. in the model reasoning stage, effective characteristics of the relation of the samples to be classified are calculated according to the steps A-E
Figure FDA0003969344730000034
Calculating cosine similarity with all the relation feature mean values calculated in the step G1 one by one, and taking the class with the maximum similarity as a classification result during reasoning>
Figure FDA0003969344730000035
The formula is as follows:
Figure FDA0003969344730000036
/>
CN202211506846.3A 2022-11-29 2022-11-29 Unbiased scene graph generation method based on effective feature representation Pending CN115861779A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211506846.3A CN115861779A (en) 2022-11-29 2022-11-29 Unbiased scene graph generation method based on effective feature representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211506846.3A CN115861779A (en) 2022-11-29 2022-11-29 Unbiased scene graph generation method based on effective feature representation

Publications (1)

Publication Number Publication Date
CN115861779A true CN115861779A (en) 2023-03-28

Family

ID=85667472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211506846.3A Pending CN115861779A (en) 2022-11-29 2022-11-29 Unbiased scene graph generation method based on effective feature representation

Country Status (1)

Country Link
CN (1) CN115861779A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333744A (en) * 2023-09-21 2024-01-02 南通大学 Unbiased scene graph generation method based on spatial feature fusion and prototype embedding

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333744A (en) * 2023-09-21 2024-01-02 南通大学 Unbiased scene graph generation method based on spatial feature fusion and prototype embedding
CN117333744B (en) * 2023-09-21 2024-05-28 南通大学 Unbiased scene graph generation method based on spatial feature fusion and prototype embedding

Similar Documents

Publication Publication Date Title
CN108229550B (en) Cloud picture classification method based on multi-granularity cascade forest network
CN111325236B (en) Ultrasonic image classification method based on convolutional neural network
CN112347888B (en) Remote sensing image scene classification method based on bi-directional feature iterative fusion
CN110570433B (en) Image semantic segmentation model construction method and device based on generation countermeasure network
CN110555841B (en) SAR image change detection method based on self-attention image fusion and DEC
CN113361373A (en) Real-time semantic segmentation method for aerial image in agricultural scene
CN114120041B (en) Small sample classification method based on double-countermeasure variable self-encoder
CN109871749B (en) Pedestrian re-identification method and device based on deep hash and computer system
CN113688941A (en) Small sample sonar image classification, identification and optimization method based on generation of countermeasure network
CN116206185A (en) Lightweight small target detection method based on improved YOLOv7
CN115861779A (en) Unbiased scene graph generation method based on effective feature representation
CN112905828A (en) Image retriever, database and retrieval method combined with significant features
CN112733693A (en) Multi-scale residual error road extraction method for global perception high-resolution remote sensing image
CN115565019A (en) Single-channel high-resolution SAR image ground object classification method based on deep self-supervision generation countermeasure
Wang et al. Generative adversarial network based on resnet for conditional image restoration
CN115170943A (en) Improved visual transform seabed substrate sonar image classification method based on transfer learning
CN114675249A (en) Attention mechanism-based radar signal modulation mode identification method
Xie et al. Co-compression via superior gene for remote sensing scene classification
CN112560034B (en) Malicious code sample synthesis method and device based on feedback type deep countermeasure network
CN114168782B (en) Deep hash image retrieval method based on triplet network
CN115965968A (en) Small sample target detection and identification method based on knowledge guidance
CN113343924B (en) Modulation signal identification method based on cyclic spectrum characteristics and generation countermeasure network
CN112966544B (en) Radar radiation source signal classification and identification method adopting ICGAN and ResNet networks
Yang et al. Relative entropy multilevel thresholding method based on genetic optimization
CN112991257B (en) Heterogeneous remote sensing image change rapid detection method based on semi-supervised twin network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination