CN114782791A - Scene graph generation method based on transformer model and category association - Google Patents

Scene graph generation method based on transformer model and category association Download PDF

Info

Publication number
CN114782791A
CN114782791A CN202210388789.7A CN202210388789A CN114782791A CN 114782791 A CN114782791 A CN 114782791A CN 202210388789 A CN202210388789 A CN 202210388789A CN 114782791 A CN114782791 A CN 114782791A
Authority
CN
China
Prior art keywords
scene graph
relation
network
result
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210388789.7A
Other languages
Chinese (zh)
Other versions
CN114782791B (en
Inventor
曾锦权
丁长兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210388789.7A priority Critical patent/CN114782791B/en
Publication of CN114782791A publication Critical patent/CN114782791A/en
Application granted granted Critical
Publication of CN114782791B publication Critical patent/CN114782791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a scene graph generation method based on a transform model and category association, which comprises the following steps: detecting the positions and object type information of all objects in the picture by using a Faster-RCNN target detection algorithm, and then generating a network through a trained scene graph to obtain a scene graph structure of an input picture; on the basis of a transformer-based information fusion mechanism, the invention improves a transformer attention mechanism aiming at the characteristics of information fusion, reduces the calculation complexity and the convergence difficulty, pays attention to the relevance among relation categories, corrects the bias influence caused by the unbalanced problem of training samples, and reduces the label noise caused by unmarked samples by using a label-free positive sample learning method.

Description

Scene graph generation method based on transform model and category association
Technical Field
The invention relates to the technical field of computer vision processing, in particular to a scene graph generation method based on a transform model and category association.
Background
The goal of the Scene Graph Generation (SGG) method is to automatically detect the specific positions and categories of objects in an input picture by using computer vision, identify the categories of relationships existing between the objects, and finally generate a data structure defined as a Scene Graph with the objects in the picture as nodes and the categories of relationships existing between the objects as directed edges. The produced scene graph data structure can provide important auxiliary information for realizing high-grade artificial intelligence, and can be widely applied to the fields of intelligent robots, safety monitoring, image-text retrieval, man-machine interaction and the like.
Most of the existing scene graph generation methods only use a classical long and short memory unit or a transform structure for information fusion, and only use a method of changing a decision surface of a classifier to try to correct the imbalance problem of training samples. These methods have disadvantages in many respects. Firstly, the design of an information fusion network is not optimized and improved according to the characteristics of information fusion so as to reduce the calculation complexity and convergence difficulty of the network; secondly, when the unbalanced problem of the training sample is attempted to be corrected, the relevance among relation categories is not well concerned; finally, the class of the current default unlabeled relationship sample is "no relationship", resulting in the training optimization of the model being affected by the label noise.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention provides a method for generating a scene graph based on a transform model and category association.
The second purpose of the invention is to provide a scene graph generation system based on a transformer model and category association.
A third object of the present invention is to provide a computer-readable storage medium.
It is a fourth object of the invention to provide a computing device.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a method for generating a scene graph based on a transformer model and category association, which comprises the following steps of:
inputting the original picture into a Faster-RCNN target detection algorithm network, and detecting the positions and object category information of all objects;
constructing a scene graph generation network based on a transformer model;
initializing the constructed scene graph generation network in a random mode;
randomly distributing the pictures in the training set to a batch with a fixed size, and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;
performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;
constructing a de-bias loss function based on relation class association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;
adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation, and updating a scene graph through back propagation to generate a network;
constructing a label-free positive sample learning method, after training optimization of a set step length, utilizing a currently trained network to generate a pseudo label for an unlabeled positive sample, and adding the pseudo label into the training sample to continue training optimization on the network;
and generating a scene graph result for the input picture by using the trained scene graph generation network.
As a preferred technical scheme, the original picture is input into a Faster-RCNN target detection algorithm network to detect the positions and object type information of all objects, and the specific steps include:
using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the pictureeAnd confidence s in each object classkThe index k ∈ {1, …, CeIs corresponding toOf the object class of (1), wherein CeIs the number of all object classes.
As a preferred technical solution, the constructing of a scene graph generation network based on a transform model includes the specific steps of:
constructing initial features of the object:
extracting a global feature map from an original picture by using a convolution network ResNext-101 in fast-RCNN according to a bounding box b of an objecteIntercepting the convolution characteristics of an object bounding box area on a global characteristic diagram by utilizing an interested area alignment operation ROIAlign to obtain a visual characteristic vector, selecting a GloVe word vector to encode the semantic information of an object, taking the confidence coefficient of the current object on each object category as a weight, and weighting and summing the word vectors corresponding to each object category to obtain a semantic characteristic vector; after splicing the visual features and the semantic features, obtaining initial features of the object by using a full connection layer;
constructing a transformer encoder for object information fusion;
the method comprises the steps that multiple layers of transformer structures fused aiming at object information are stacked, each layer comprises multi-head linear self-attention and Add & Norm operation, initial features of an object are converted into refined features with context information through a transformer encoder, and an object classifier is used for obtaining output of the object classifier;
initial characteristics of the constructed relationship:
for any pair of object ordered binary combinations, defining a first object as a subject and another object as an object, intercepting convolution characteristics of minimum closed regions of two object bounding boxes on a global characteristic diagram by using a region-of-interest alignment operation ROIAlign, splicing the characteristics with refinement characteristics of the subject and refinement characteristics of the object, and finally obtaining relationship initial characteristics by using a full connection layer;
and constructing a transformer encoder for relational information fusion, stacking a plurality of layers of transformer structures aiming at the relational information fusion, converting the initial characteristics of the relation into refined characteristics with context information through the transformer encoder, and then obtaining the output of a relational classifier by using a full-connection-layer classifier.
As a preferred technical solution, the initializing the constructed scene graph generation network in a random manner includes the specific steps of:
and operating a ROIAlign module for aligning the interested region of the constructed scene graph generation network, carrying out initialization on inherited parameters from a pre-trained Faster-RCNN corresponding module, and randomly initializing the rest part of the scene graph generation network by adopting an Xavier weight initialization method.
As a preferred technical solution, the method for generating a network by adding the loss of the object prediction result and the loss of the relationship prediction result, performing gradient calculation, and updating a scene graph through back propagation includes the specific steps of:
obtaining the relevance coefficient between different relation classes by using the relation classification result of the verification set, taking the relevance coefficient as a weighting coefficient of a softmax function calculation denominator, inputting the softmax result into a class balancing loss function without super parameters to calculate the loss of the relation, and specifically expressing the loss as follows:
Figure BDA0003595913320000041
Figure BDA0003595913320000042
wherein, NpFor the number of samples in the current batch for which the relationship class is p,
Figure BDA0003595913320000043
represents the classifier output with the ith sample in the relation class p, wp,mIs the correlation coefficient between the relation category p and the relation category m;
calculating the sum of the loss of the object prediction result and the loss of the relation prediction result, and specifically expressing the sum as follows:
Figure BDA0003595913320000044
and updating the network parameters by adopting a gradient descent method.
As a preferred technical solution, the method for constructing a label-free positive sample learning includes, after training optimization with a set step length, generating a pseudo label for an unlabeled positive sample by using a currently trained network, and includes the specific steps of:
screening out a set number of relation classification prediction results based on confidence degree sequencing under the condition of giving object position and category marking information in a current picture;
putting the labeled positive exemplars in the collection that are correctly predicted
Figure BDA0003595913320000051
Unlabeled exemplars are put into the collection
Figure BDA0003595913320000052
Recording the confidence coefficient of the last relation classification prediction result as tau;
for collections
Figure BDA0003595913320000053
A sample of when
Figure BDA0003595913320000054
If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is larger than a set threshold when the sample beta exists, and the relation classification result p of the sample alpha isαIf the label of the sample is not the same as that of the sample beta, the label of the sample alpha is updated to pαIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient
Figure BDA0003595913320000055
Updating the label of the sample alpha;
and after the pseudo labels are updated on all the selected label-free positive samples in the training set, continuing to train and optimize the constructed scene graph generation network until the set step length is reached, and finishing training to obtain the trained scene graph generation network.
As a preferred technical solution, the generating a scene graph result for the input picture by using the trained scene graph generating network specifically includes:
constructing initial characteristics of an object and obtaining an object classifier output through a constructed transformer encoder for object information fusion;
all the ordered binary combinations of the object are solved, the initial characteristics of the relation are constructed, and the output of a relation classifier is obtained through a constructed transformer encoder for relation information fusion;
for each object ordered binary combination, the final score of the relationship prediction result between the two objects
Figure BDA0003595913320000056
The calculation formula is as follows:
Figure BDA0003595913320000057
wherein s ishIs the confidence of the subject object classification result, stIs the confidence of the object classification result, srThe confidence degrees of the relation classification results are obtained by using a softmax function for the output of the object classifier and the output of the relation classifier;
and constructing the result of the obtained scene graph by taking the predicted objects as graph nodes and the predicted relationship between the objects as directed edges.
In order to achieve the second object, the invention adopts the following technical scheme:
a system for generating a scene graph based on a transform model and category association comprises: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;
the object information detection module is used for inputting the original picture into a fast-RCNN target detection algorithm network to detect the positions and object type information of all objects;
the scene graph generation network construction module is used for constructing a scene graph generation network based on a transformer model;
the initialization module is used for initializing the constructed scene graph generation network in a random mode;
the scene graph result prediction module is used for randomly distributing the pictures in the training set to a batch with a fixed size and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;
the object prediction result loss calculation module is used for performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;
the relation prediction result loss calculation module is used for constructing a de-bias loss function based on relation category association and adopting the loss function to carry out loss calculation on the relation prediction result in the predicted scene graph structure;
the gradient calculation module is used for adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation and updating the scene graph through back propagation to generate a network;
the training optimization module is used for constructing a label-free positive sample learning method, after training optimization of a specific step length, reliable pseudo labels are generated for unlabeled positive samples by using the currently trained network, and the labels are added into the training samples to continue training optimization of the network;
and the output module is used for generating a scene graph result for the input picture by using the trained scene graph generation network.
In order to achieve the third object, the invention adopts the following technical scheme:
a storage medium stores a program that, when executed by a processor, implements the above-described method for generating a scene graph based on a transform model and category association.
In order to achieve the fourth object, the invention adopts the following technical scheme:
a computing device comprises a processor and a memory for storing a processor executable program, wherein the processor executes the program stored in the memory to realize the scene graph generation method based on the transform model and the category association.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) according to the method, on the basis of an information fusion mechanism based on a transform, attention calculation of the transform is improved according to characteristics of the information fusion mechanism, the linear attention mechanism is used for replacing a traditional softmax attention mechanism, the calculation complexity of a relational information fusion module is reduced, and the geometric information among objects is used for carrying out nonlinear weight weighting on the attention of object information fusion, so that the convergence difficulty is reduced.
(2) The method constructs the bias-removing loss function based on the relation class association, adopts the loss function to calculate the loss of the relation prediction result, guides the constructed scene graph to generate the relevance among the network attention relation classes, and corrects the bias influence caused by the unbalanced problem of the training sample.
(3) The invention constructs a learning method for constructing the unlabeled positive sample, generates a reliable pseudo label for the unlabeled positive sample, and reduces the label noise caused by the unlabeled training sample.
Drawings
FIG. 1 is a schematic flow chart of a method for generating a scene graph based on a transform model and category association according to the present invention;
fig. 2 is a schematic structural diagram of a scene graph generation network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
As shown in fig. 1, the present embodiment provides a scenegraph generating method based on a transform model and category association, including the following steps:
s1: inputting the original picture into a fast-RCNN target detection algorithm network, and detecting the positions and object type information of all objects, wherein the method specifically comprises the following steps:
using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the pictureeAnd its confidence s in each object classkSubscript k ∈ {1, …, C ∈ [ {1, …eIs the corresponding object class, where CeIs the number of all object classes.
S2: as shown in fig. 2, constructing a scene graph generation network based on a transform model specifically includes the following steps:
s21: constructing initial features of the object:
extracting a global feature map from an original picture by using a convolution network ResNext-101 in fast-RCNN according to a bounding box b of an objecteIntercepting the convolution characteristics of the object bounding box area on the global characteristic diagram by using the ROIAlign operation of the region of interest alignment to obtain a visual characteristic vector with the length of 4096; meanwhile, GloVe word vectors which can be used publicly after pre-training on data sets of Wikipedia2014 and Gigaword5 are selected to encode semantic information of the object, confidence coefficients of the current object on each object category are taken as weights, and the word vectors corresponding to each object category are weighted and summed to obtain a semantic feature vector with the length of 200; after the visual features and the semantic features are spliced, the initial feature f of the object with the length of 512 is obtained by using the full connecting layere
This process can be formulated as:
Figure BDA0003595913320000091
therein, FCeDenotes the fully-connected layer, [;]representing a vector stitching operation.
S22: constructing a transformer encoder for object information fusion;
a transform encoder for object information fusion is formed by stacking 4 layers of transform structures for object information fusion, each layer including multi-headed self-attention and Add & Norm operations. In the multi-head self-attention operation, the input feature Z of the current layer is firstly converted into a query feature Q, a key value feature K and a value feature V by three different full-connection layers, and the length of a feature vector is 512; these features were then evenly divided into 8 parts over the length of the vector, each for self-attention calculations for 8 heads. At each attention head, Q of the ith object and K of other objects are processed by a rectifying operation ReLU to calculate dot products, then the dot products are divided by the sum of all the dot products to obtain attention weights, and finally the attention weights are used for carrying out weighted summation on V of other objects to obtain the output result of the current attention head, wherein the calculation formula is as follows:
Figure BDA0003595913320000092
wherein the content of the first and second substances,
Figure BDA0003595913320000093
as output result of the i-th object, gi,jThe weighting factor is calculated according to the geometric information between the ith object and the jth object, and the calculation formula is as follows:
Figure BDA0003595913320000094
Figure BDA0003595913320000095
wherein, IoUi,jIs the intersection ratio of the bounding boxes of the object i and the object j, rhoi,jEuclidean distance of two object boundary frame central points, ci,jIn order to simultaneously contain the diagonal Euclidean distance of the minimum closure area of the bounding boxes of the two objects, the calculation formula is as follows:
Figure BDA0003595913320000096
wherein (x)1,y1) Is the coordinate of the top left vertex of the bounding box, (x)2,y2) Coordinates of a lower right vertex;
splicing the output results of 8 heads, and obtaining the output characteristics of the multi-head attention mechanism through a full connection layer; in the following Add & Norm operation, firstly adding the input features Z of the current layer and the output features of the multi-head self-attention operation, then performing layer normalization operation, adding the obtained features and the output features of the current layer after FC-RELU-FC conversion, and then performing layer normalization operation to finally obtain the output features of the current transform layer, wherein the specific calculation formula is as follows:
Z=LayerNorm(Z+MultiHeadAttention(Z))
Z=LayerNorm(Z+FeedForward(Z))
wherein LayerNorm is layer normalization operation, MultiHeadAttention is multi-head attention operation, and feed forward is the FeedForward network;
the initial characteristics of the object are converted into refined characteristics z with context information through a 4-layer transformer codereThen using a full connected layer classifier to obtain an object classifier output ok
S23: constructing initial characteristics of the relationship;
for any pair of object ordered binary combination, defining the first object as a subject and the other object as an object, and intercepting the convolution characteristic f of the minimum closed area of the two object bounding boxes on the global characteristic diagram by using the region-of-interest alignment operation ROIAlignuThe specific calculation formula is expressed as:
Figure BDA0003595913320000101
Figure BDA0003595913320000102
Figure BDA0003595913320000103
Figure BDA0003595913320000104
Figure BDA0003595913320000105
and the feature is compared with the refined feature z of the subjecthAnd refinement features z of objectstSplicing, using the full connection layer to finally obtain a relation initial characteristic f with the length of 512rSubscript r denotes the relationship;
s24: constructing a transformer encoder for relational information fusion, wherein the transformer encoder is formed by stacking 2 layers of transformer structures aiming at the relational information fusion, and each layer is similar to the transformer structure of the object information fusion; the difference is that in order to reduce the computational complexity caused by a large number of relationships, the self-attention calculation is changed into the following formula:
Figure BDA0003595913320000111
after the initial characteristics of the relation are converted into refined characteristics with context information through a 2-layer transformer encoder, a full-connection-layer classifier is used for obtaining the output r of the relation classifierpSubscript p e {1, …, CrIs a relationship class, CrIs the number of all relationship categories;
s3: initializing the constructed scene graph generation network in a random mode;
aligning the interesting region of the constructed scene graph generation network with an ROIAlign module, directly inheriting parameters from a pre-trained Faster-RCNN corresponding module for initialization, and randomly initializing the rest part of the scene graph generation network by adopting an Xavier weight initialization method;
s4: randomly distributing the pictures in the training set to a batch with a fixed size, and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;
randomly distributing the pictures in the training set to a batch with a fixed size, wherein the pictures in the same batch have the same size proportion; generating a network by inputting batch into the constructed scene graph, constructing initial characteristics of the object, and obtaining an object classifier output o through the constructed transformer encoder for object information fusionkCalculating object classifier output o using cross entropy loss functionkLoss of (d);
all the ordered binary combinations of the object are obtained from one picture, and the combination of the relation label and the non-relation label is calculated according to the following ratio of 1: 3, randomly sampling 1024 training samples according to the proportion, constructing initial characteristics of the relationship, and obtaining the output r of the relationship classifier through the constructed transformer encoder for relationship information fusionpComputing a relationship classifier output r using a de-bias loss function based on relationship class associationpLoss of (d); specifically, firstly, obtaining relevance coefficients between different relation classes by using a relation classification result of a verification set, then taking the relevance coefficients as a weighting coefficient of a denominator of a softmax function, and finally inputting a softmax result into a class balance loss function without hyperparameters to calculate the loss of the relation, which is specifically expressed as the following formula:
Figure BDA0003595913320000121
Figure BDA0003595913320000122
wherein, NpFor the number of samples in the current batch for which the relationship class is p,
Figure BDA0003595913320000123
representing the classifier output with the ith sample in a relation class p, wp,mAs an association between a relationship class p and a relationship class mA coefficient of sex; when the number of training samples of the relation class p is less than or equal to 10% of the class m, w is w if verification set samples of the class p greater than 80% are predicted as the class mp,m0, if between 40% and 80% of the validation set samples of class p are predicted to be class m, then wp,mIs 0.4, in other cases wp,mIs 1, and specifically represents the following formula:
Figure BDA0003595913320000124
wherein the content of the first and second substances,
Figure BDA0003595913320000125
and
Figure BDA0003595913320000126
number of training samples, a, for class p and class m, respectivelyp,mPredicting samples of class p in the validation set as a proportion of class m;
calculating the sum of the object and the relation loss, and specifically expressing the sum as follows:
Figure BDA0003595913320000127
updating network parameters by using a gradient descent method; before training optimization of the network, wp,mInitialized to 1, with a fixed size of 16 for batch, linearly increasing the learning rate to 0.001 with a preheat learning rate for the first 500 iteration steps, then keeping the learning rate unchanged until the step is adjusted to 0.0001 at 10000 th iteration step and to 0.00001 at 16000 th iteration step, during which the currently trained network parameters are saved every 2000 iteration steps, and updating w with the test results of the validation setp,m
S5: performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;
s6: constructing a de-bias loss function based on relation class association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;
s7: adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation, and updating a scene graph through back propagation to generate a network;
s8: constructing a label-free positive sample learning method, generating reliable pseudo labels for the unlabeled positive samples by using the currently trained network after training optimization of a specific step length, and adding the pseudo labels into the training samples to continuously train and optimize the network;
constructing a label-free positive sample learning method, and after 14000-step training optimization, generating a reliable pseudo label for an unlabeled positive sample of a training set by using a currently trained network; specifically, first, given the object position and category label information in the current picture, the top 20 relational classification prediction results with the highest confidence are obtained, and the correctly predicted labeled positive samples are put into a set
Figure BDA0003595913320000131
The unlabeled exemplars are then put into the collection
Figure BDA0003595913320000132
Recording the confidence coefficient of the 20 th relational classification prediction result as tau; for the
Figure BDA0003595913320000133
A sample α, when
Figure BDA0003595913320000134
If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is more than 0.3 when the sample beta exists, the relation classification result p of the sample alpha is judged to beαIf the label of the sample is not the same as that of the sample beta, the label of the sample alpha is updated to pαIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient
Figure BDA0003595913320000135
To update the label of the sample α; in the training setAfter the pseudo labels are updated on all the selected label-free positive samples, continuing training and optimizing the constructed scene graph generation network until iteration reaches 18000 step length, and terminating training to obtain a trained scene graph generation network;
s9: generating a scene graph result for the input picture by using the trained scene graph generation network;
detecting the position information and the object type information of all objects by using a Faster-RCNN target detection algorithm network, and then inputting all the information into a constructed scene graph to generate a network:
constructing initial features of an object and obtaining an object classifier output o through a constructed transformer encoder for object information fusionk
All ordered binary combinations of the objects are solved, initial characteristics of the relations are constructed, and the output r of the relation classifier is obtained through the constructed transformer encoder for relation information fusionp
Then, aiming at each object ordered binary combination (h, t), the final score of the relationship prediction result between the two objects
Figure BDA0003595913320000141
The calculation formula is as follows:
Figure BDA0003595913320000142
wherein s ishIs the confidence of the subject object classification result, stIs the confidence of the object classification result, srIs the confidence of the result of the relationship classification, all by outputting o to the object classifierkAnd the output r of the relation classifierpSolving by a softmax function; and finally, constructing the result of the obtained scene graph by taking the predicted objects as graph nodes and the predicted relationship among the objects as directed edges.
In order to verify the effectiveness of the method, experiments are carried out on a VG150 public scene graph generation data set, and quantitative and qualitative analysis is carried out.
As shown in table 1 below, the first behavior in table 1 is the result of the reference model, which uses the classical transformer structure for both the transform encoders of objects and relationships, and uses cross-entropy loss for both the classification results of objects and relationships. The experimental results of the details of LG (optimized transform structure), CC (unbiased loss function based on relation class association) and PU (label-free positive sample learning method) are added in the latter models respectively, and the experimental results show the effectiveness of each part.
TABLE 1 self-comparison Experimental data Table on VG150
Figure BDA0003595913320000143
Figure BDA0003595913320000151
As shown in table 2, the present invention is compared to the results on VG150 of the published method that is currently best performing:
table 2 comparison of the present invention with other methods on VG150
Figure BDA0003595913320000152
Example 2
The embodiment provides a scenegraph generation system based on a transform model and category association, including: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;
in this embodiment, the object information detection module is configured to input an original picture into a fast-RCNN target detection algorithm network, and detect positions and object type information of all objects;
in this embodiment, the scene graph generation network construction module is configured to construct a scene graph generation network based on a transform model;
in this embodiment, the initialization module is configured to initialize the constructed scene graph generation network in a random manner;
in this embodiment, the scene graph result prediction module is configured to randomly allocate pictures in a training set to a batch of a fixed size, and input the pictures into a scene graph generation network to obtain a predicted scene graph result;
in this embodiment, the object prediction result loss calculation module is configured to perform loss calculation on an object prediction result in a predicted scene graph result by using a cross entropy loss function;
in this embodiment, the relationship prediction result loss calculation module is configured to construct a de-bias loss function based on relationship class association, and perform loss calculation on the relationship prediction result in the predicted scene graph structure by using the loss function;
in this embodiment, the gradient calculation module is configured to add the loss of the object prediction result and the loss of the relationship prediction result, perform gradient calculation, and update the scene graph through back propagation to generate a network;
in this embodiment, the training optimization module is configured to construct a label-free positive sample learning method, and after training optimization for a specific step length, generate reliable pseudo labels for unlabeled positive samples by using a currently trained network, and add them to the training samples to continue training optimization on the network;
in this embodiment, the output module is configured to generate a scene graph result for the input picture by using the trained scene graph generation network.
Example 3
The present embodiment provides a storage medium, which may be a storage medium such as a ROM, a RAM, a magnetic disk, or an optical disk, and the storage medium stores one or more programs, and when the programs are executed by a processor, the method for generating a scene graph based on a transform model and category association according to embodiment 1 is implemented.
Example 4
The embodiment provides a computing device, which may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or another terminal device with a display function, and the computing device includes a processor and a memory, where the memory stores one or more programs, and when the processor executes the programs stored in the memory, the method for generating a scene graph based on a transform model and category association according to embodiment 1 is implemented.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims (10)

1. A method for generating a scene graph based on a transform model and category association is characterized by comprising the following steps:
inputting the original picture into a Faster-RCNN target detection algorithm network, and detecting the positions and object category information of all objects;
constructing a scene graph generation network based on a transformer model;
initializing the constructed scene graph generation network in a random mode;
randomly distributing the pictures in the training set to a batch with a fixed size, and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;
performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;
constructing a de-bias loss function based on relation category association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;
adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation, and updating a scene graph through back propagation to generate a network;
constructing a label-free positive sample learning method, after training optimization of a set step length, utilizing a currently trained network to generate a pseudo label for an unmarked positive sample, and adding the pseudo label into the training sample to continue training optimization on the network;
and generating a scene graph result for the input picture by using the trained scene graph generation network.
2. The method for generating a scene graph based on transform model and category association as claimed in claim 1, wherein the step of inputting the original picture into the fast-RCNN object detection algorithm network to detect the positions of all objects and the object category information comprises the following specific steps:
using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the pictureeAnd confidence s in each object classkSubscript k ∈ {1, …, C ∈ [ {1, …eIs the corresponding object class, where CeIs the number of all object classes.
3. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of constructing a transform model-based scene graph generation network comprises:
constructing initial characteristics of the object:
extracting a global feature map from an original picture by using a convolution network ResNext-101 in fast-RCNN according to a bounding box b of an objecteIntercepting the convolution characteristics of an object bounding box area on a global characteristic diagram by utilizing an interested area alignment operation ROIAlign to obtain a visual characteristic vector, selecting a GloVe word vector to encode the semantic information of an object, taking the confidence coefficient of the current object on each object category as a weight, and weighting and summing the word vectors corresponding to each object category to obtain a semantic characteristic vector; after splicing the visual features and the semantic features, obtaining initial features of the object by using a full connection layer;
constructing a transformer encoder for object information fusion;
the method comprises the steps that multiple layers of transform structures fused aiming at object information are stacked, each layer comprises multi-head linear self-attention and Add & Norm operation, initial features of an object are converted into refined features with context information through a transform encoder, and an object classifier is used for obtaining object classifier output;
initial characteristics of the constructed relationship:
for any pair of object ordered binary combinations, defining a first object as a subject and another object as an object, intercepting convolution characteristics of minimum closed regions of two object bounding boxes on a global characteristic diagram by using a region-of-interest alignment operation ROIAlign, splicing the characteristics with refinement characteristics of the subject and refinement characteristics of the object, and finally obtaining relationship initial characteristics by using a full connection layer;
and constructing a transformer encoder for relational information fusion, stacking a plurality of layers of transformer structures aiming at the relational information fusion, converting the initial characteristics of the relation into refined characteristics with context information through the transformer encoder, and then obtaining the output of a relational classifier by using a full-connection-layer classifier.
4. The method for generating a scene graph based on a transform model and category association according to claim 1, wherein the initialization is performed on the constructed scene graph generation network in a random manner, and the specific steps include:
and operating a ROIAlign module for aligning the interested region of the constructed scene graph generation network, carrying out initialization on inherited parameters from a pre-trained Faster-RCNN corresponding module, and randomly initializing the rest part of the scene graph generation network by adopting an Xavier weight initialization method.
5. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of adding the loss of the object prediction result and the loss of the relationship prediction result, performing gradient calculation, and updating the scene graph generation network by back propagation comprises the following specific steps:
obtaining the relevance coefficient between different relation classes by using the relation classification result of the verification set, taking the relevance coefficient as a weighting coefficient of a softmax function calculation denominator, inputting the softmax result into a class balancing loss function without super parameters to calculate the loss of the relation, and specifically expressing the loss as follows:
Figure FDA0003595913310000031
Figure FDA0003595913310000032
wherein N ispFor the number of samples in the current batch for which the relationship class is p,
Figure FDA0003595913310000033
representing the classifier output with the ith sample in a relation class p, wp,mIs the correlation coefficient between the relation category p and the relation category m;
calculating the sum of the loss of the object prediction result and the loss of the relation prediction result, and specifically expressing the sum as follows:
Figure FDA0003595913310000034
and updating the network parameters by adopting a gradient descent method.
6. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the method for constructing a label-free positive sample learning method utilizes a currently trained network to generate pseudo labels for unlabeled positive samples after training optimization with a set step size, and includes the following specific steps:
screening out a set number of relational classification prediction results based on confidence degree sequencing under the condition of giving object positions and category marking information in a current picture;
putting the labeled positive exemplars correctly predicted in the set
Figure FDA0003595913310000035
Unlabeled exemplars are put into the collection
Figure FDA0003595913310000036
Recording the confidence coefficient of the last relation classification prediction result as tau;
for collections
Figure FDA0003595913310000041
A sample of (a), when
Figure FDA0003595913310000042
If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is larger than a set threshold when the sample beta exists, and the relation classification result p of the sample alpha isαIf the label is not the same as that of the sample beta, the label of the sample alpha is updated to pαIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient
Figure FDA0003595913310000044
Updating the label of the sample alpha;
and after the pseudo labels are updated on all the selected label-free positive samples in the training set, continuing to train and optimize the constructed scene graph generation network until the set step length is reached, and finishing training to obtain the trained scene graph generation network.
7. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of generating the scene graph result for the input picture by using the trained scene graph generation network comprises the following specific steps:
constructing initial characteristics of an object and obtaining an object classifier output through a constructed transformer encoder for object information fusion;
all ordered binary combinations of the object are solved, initial characteristics of the relation are constructed, and the output of a relation classifier is obtained through a constructed transformer encoder for relation information fusion;
for each object ordered binary combination, the final score of the relationship prediction result between the two objects
Figure FDA0003595913310000045
The calculation formula is as follows:
Figure FDA0003595913310000043
wherein s ishIs the confidence of the subject object classification result, stIs the confidence of the object classification result, srThe confidence degrees of the relation classification results are obtained by using a softmax function for the output of the object classifier and the output of the relation classifier;
and taking the predicted objects as graph nodes and the predicted relationship between the objects as directed edges to construct the obtained scene graph result.
8. A system for generating a scene graph based on a transform model and category association is characterized by comprising: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;
the object information detection module is used for inputting the original picture into a fast-RCNN target detection algorithm network to detect the positions and object type information of all objects;
the scene graph generation network construction module is used for constructing a scene graph generation network based on a transformer model;
the initialization module is used for initializing the constructed scene graph generation network in a random mode;
the scene graph result prediction module is used for randomly distributing the pictures in the training set to a batch with a fixed size and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;
the object prediction result loss calculation module is used for performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;
the relation prediction result loss calculation module is used for constructing a de-bias loss function based on relation category association and adopting the loss function to carry out loss calculation on the relation prediction result in the predicted scene graph structure;
the gradient calculation module is used for adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation and updating the scene graph through back propagation to generate a network;
the training optimization module is used for constructing a label-free positive sample learning method, after training optimization with a specific step length is carried out, reliable pseudo labels are generated for unlabeled positive samples by using the currently trained network, and the pseudo labels are added into the training samples to continue training and optimizing the network;
the output module is used for generating a scene graph result for the input picture by using the trained scene graph generation network.
9. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for generating a scenegraph based on a transform model and category association as claimed in any of claims 1-7.
10. A computing device comprising a processor and a memory for storing a processor-executable program, wherein the processor, when executing the program stored in the memory, implements the method for scenegraph generation based on transform models and category associations as recited in claims 1-7.
CN202210388789.7A 2022-04-14 2022-04-14 Scene graph generation method based on transform model and category association Active CN114782791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210388789.7A CN114782791B (en) 2022-04-14 2022-04-14 Scene graph generation method based on transform model and category association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210388789.7A CN114782791B (en) 2022-04-14 2022-04-14 Scene graph generation method based on transform model and category association

Publications (2)

Publication Number Publication Date
CN114782791A true CN114782791A (en) 2022-07-22
CN114782791B CN114782791B (en) 2024-03-22

Family

ID=82429114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210388789.7A Active CN114782791B (en) 2022-04-14 2022-04-14 Scene graph generation method based on transform model and category association

Country Status (1)

Country Link
CN (1) CN114782791B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN112989927A (en) * 2021-02-03 2021-06-18 杭州电子科技大学 Scene graph generation method based on self-supervision pre-training
CN113065587A (en) * 2021-03-23 2021-07-02 杭州电子科技大学 Scene graph generation method based on hyper-relation learning network
WO2021151296A1 (en) * 2020-07-22 2021-08-05 平安科技(深圳)有限公司 Multi-task classification method and apparatus, computer device, and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
WO2021151296A1 (en) * 2020-07-22 2021-08-05 平安科技(深圳)有限公司 Multi-task classification method and apparatus, computer device, and storage medium
CN112989927A (en) * 2021-02-03 2021-06-18 杭州电子科技大学 Scene graph generation method based on self-supervision pre-training
CN113065587A (en) * 2021-03-23 2021-07-02 杭州电子科技大学 Scene graph generation method based on hyper-relation learning network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12), pages 83 - 95 *

Also Published As

Publication number Publication date
CN114782791B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Wen et al. Preparing lessons: Improve knowledge distillation with better supervision
WO2021143396A1 (en) Method and apparatus for carrying out classification prediction by using text classification model
CN110008338B (en) E-commerce evaluation emotion analysis method integrating GAN and transfer learning
CN110188358B (en) Training method and device for natural language processing model
CN110516095B (en) Semantic migration-based weak supervision deep hash social image retrieval method and system
CN111310672A (en) Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling
CN112905827A (en) Cross-modal image-text matching method and device and computer readable storage medium
CN111680484B (en) Answer model generation method and system for visual general knowledge reasoning question and answer
CN109858015A (en) A kind of semantic similarity calculation method and device based on CTW and KM algorithm
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN113344206A (en) Knowledge distillation method, device and equipment integrating channel and relation feature learning
CN113806494B (en) Named entity recognition method based on pre-training language model
CN111666406A (en) Short text classification prediction method based on word and label combination of self-attention
CN110349229A (en) A kind of Image Description Methods and device
CN113822125B (en) Processing method and device of lip language recognition model, computer equipment and storage medium
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN117152416A (en) Sparse attention target detection method based on DETR improved model
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN116129174A (en) Generalized zero sample image classification method based on feature refinement self-supervision learning
CN112817442A (en) Situation information classification recommendation system and method under multi-task condition based on FFM
CN115640418B (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
Zhao et al. Domain adaptation with feature and label adversarial networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant