CN114782791A

CN114782791A - Scene graph generation method based on transformer model and category association

Info

Publication number: CN114782791A
Application number: CN202210388789.7A
Authority: CN
Inventors: 曾锦权; 丁长兴
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-22
Anticipated expiration: 2042-04-14
Also published as: CN114782791B

Abstract

The invention discloses a scene graph generation method based on a transform model and category association, which comprises the following steps: detecting the positions and object type information of all objects in the picture by using a Faster-RCNN target detection algorithm, and then generating a network through a trained scene graph to obtain a scene graph structure of an input picture; on the basis of a transformer-based information fusion mechanism, the invention improves a transformer attention mechanism aiming at the characteristics of information fusion, reduces the calculation complexity and the convergence difficulty, pays attention to the relevance among relation categories, corrects the bias influence caused by the unbalanced problem of training samples, and reduces the label noise caused by unmarked samples by using a label-free positive sample learning method.

Description

Scene graph generation method based on transform model and category association

Technical Field

The invention relates to the technical field of computer vision processing, in particular to a scene graph generation method based on a transform model and category association.

Background

The goal of the Scene Graph Generation (SGG) method is to automatically detect the specific positions and categories of objects in an input picture by using computer vision, identify the categories of relationships existing between the objects, and finally generate a data structure defined as a Scene Graph with the objects in the picture as nodes and the categories of relationships existing between the objects as directed edges. The produced scene graph data structure can provide important auxiliary information for realizing high-grade artificial intelligence, and can be widely applied to the fields of intelligent robots, safety monitoring, image-text retrieval, man-machine interaction and the like.

Most of the existing scene graph generation methods only use a classical long and short memory unit or a transform structure for information fusion, and only use a method of changing a decision surface of a classifier to try to correct the imbalance problem of training samples. These methods have disadvantages in many respects. Firstly, the design of an information fusion network is not optimized and improved according to the characteristics of information fusion so as to reduce the calculation complexity and convergence difficulty of the network; secondly, when the unbalanced problem of the training sample is attempted to be corrected, the relevance among relation categories is not well concerned; finally, the class of the current default unlabeled relationship sample is "no relationship", resulting in the training optimization of the model being affected by the label noise.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention provides a method for generating a scene graph based on a transform model and category association.

The second purpose of the invention is to provide a scene graph generation system based on a transformer model and category association.

A third object of the present invention is to provide a computer-readable storage medium.

It is a fourth object of the invention to provide a computing device.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a method for generating a scene graph based on a transformer model and category association, which comprises the following steps of:

inputting the original picture into a Faster-RCNN target detection algorithm network, and detecting the positions and object category information of all objects;

constructing a scene graph generation network based on a transformer model;

initializing the constructed scene graph generation network in a random mode;

randomly distributing the pictures in the training set to a batch with a fixed size, and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;

performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;

constructing a de-bias loss function based on relation class association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;

adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation, and updating a scene graph through back propagation to generate a network;

constructing a label-free positive sample learning method, after training optimization of a set step length, utilizing a currently trained network to generate a pseudo label for an unlabeled positive sample, and adding the pseudo label into the training sample to continue training optimization on the network;

and generating a scene graph result for the input picture by using the trained scene graph generation network.

As a preferred technical scheme, the original picture is input into a Faster-RCNN target detection algorithm network to detect the positions and object type information of all objects, and the specific steps include:

using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the picture_eAnd confidence s in each object class_kThe index k ∈ {1, …, C_eIs corresponding toOf the object class of (1), wherein C_eIs the number of all object classes.

As a preferred technical solution, the constructing of a scene graph generation network based on a transform model includes the specific steps of:

constructing initial features of the object:

extracting a global feature map from an original picture by using a convolution network ResNext-101 in fast-RCNN according to a bounding box b of an object_eIntercepting the convolution characteristics of an object bounding box area on a global characteristic diagram by utilizing an interested area alignment operation ROIAlign to obtain a visual characteristic vector, selecting a GloVe word vector to encode the semantic information of an object, taking the confidence coefficient of the current object on each object category as a weight, and weighting and summing the word vectors corresponding to each object category to obtain a semantic characteristic vector; after splicing the visual features and the semantic features, obtaining initial features of the object by using a full connection layer;

constructing a transformer encoder for object information fusion;

the method comprises the steps that multiple layers of transformer structures fused aiming at object information are stacked, each layer comprises multi-head linear self-attention and Add & Norm operation, initial features of an object are converted into refined features with context information through a transformer encoder, and an object classifier is used for obtaining output of the object classifier;

initial characteristics of the constructed relationship:

for any pair of object ordered binary combinations, defining a first object as a subject and another object as an object, intercepting convolution characteristics of minimum closed regions of two object bounding boxes on a global characteristic diagram by using a region-of-interest alignment operation ROIAlign, splicing the characteristics with refinement characteristics of the subject and refinement characteristics of the object, and finally obtaining relationship initial characteristics by using a full connection layer;

and constructing a transformer encoder for relational information fusion, stacking a plurality of layers of transformer structures aiming at the relational information fusion, converting the initial characteristics of the relation into refined characteristics with context information through the transformer encoder, and then obtaining the output of a relational classifier by using a full-connection-layer classifier.

As a preferred technical solution, the initializing the constructed scene graph generation network in a random manner includes the specific steps of:

and operating a ROIAlign module for aligning the interested region of the constructed scene graph generation network, carrying out initialization on inherited parameters from a pre-trained Faster-RCNN corresponding module, and randomly initializing the rest part of the scene graph generation network by adopting an Xavier weight initialization method.

As a preferred technical solution, the method for generating a network by adding the loss of the object prediction result and the loss of the relationship prediction result, performing gradient calculation, and updating a scene graph through back propagation includes the specific steps of:

obtaining the relevance coefficient between different relation classes by using the relation classification result of the verification set, taking the relevance coefficient as a weighting coefficient of a softmax function calculation denominator, inputting the softmax result into a class balancing loss function without super parameters to calculate the loss of the relation, and specifically expressing the loss as follows:

wherein, N_pFor the number of samples in the current batch for which the relationship class is p,

represents the classifier output with the ith sample in the relation class p, w_p,mIs the correlation coefficient between the relation category p and the relation category m;

calculating the sum of the loss of the object prediction result and the loss of the relation prediction result, and specifically expressing the sum as follows:

and updating the network parameters by adopting a gradient descent method.

As a preferred technical solution, the method for constructing a label-free positive sample learning includes, after training optimization with a set step length, generating a pseudo label for an unlabeled positive sample by using a currently trained network, and includes the specific steps of:

screening out a set number of relation classification prediction results based on confidence degree sequencing under the condition of giving object position and category marking information in a current picture;

putting the labeled positive exemplars in the collection that are correctly predicted

Unlabeled exemplars are put into the collection

Recording the confidence coefficient of the last relation classification prediction result as tau;

for collections

A sample of when

If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is larger than a set threshold when the sample beta exists, and the relation classification result p of the sample alpha is_αIf the label of the sample is not the same as that of the sample beta, the label of the sample alpha is updated to p_αIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient

Updating the label of the sample alpha;

and after the pseudo labels are updated on all the selected label-free positive samples in the training set, continuing to train and optimize the constructed scene graph generation network until the set step length is reached, and finishing training to obtain the trained scene graph generation network.

As a preferred technical solution, the generating a scene graph result for the input picture by using the trained scene graph generating network specifically includes:

constructing initial characteristics of an object and obtaining an object classifier output through a constructed transformer encoder for object information fusion;

all the ordered binary combinations of the object are solved, the initial characteristics of the relation are constructed, and the output of a relation classifier is obtained through a constructed transformer encoder for relation information fusion;

for each object ordered binary combination, the final score of the relationship prediction result between the two objects

The calculation formula is as follows:

wherein s is_hIs the confidence of the subject object classification result, s_tIs the confidence of the object classification result, s_rThe confidence degrees of the relation classification results are obtained by using a softmax function for the output of the object classifier and the output of the relation classifier;

and constructing the result of the obtained scene graph by taking the predicted objects as graph nodes and the predicted relationship between the objects as directed edges.

In order to achieve the second object, the invention adopts the following technical scheme:

a system for generating a scene graph based on a transform model and category association comprises: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;

the object information detection module is used for inputting the original picture into a fast-RCNN target detection algorithm network to detect the positions and object type information of all objects;

the scene graph generation network construction module is used for constructing a scene graph generation network based on a transformer model;

the initialization module is used for initializing the constructed scene graph generation network in a random mode;

the scene graph result prediction module is used for randomly distributing the pictures in the training set to a batch with a fixed size and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;

the object prediction result loss calculation module is used for performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;

the relation prediction result loss calculation module is used for constructing a de-bias loss function based on relation category association and adopting the loss function to carry out loss calculation on the relation prediction result in the predicted scene graph structure;

the gradient calculation module is used for adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation and updating the scene graph through back propagation to generate a network;

the training optimization module is used for constructing a label-free positive sample learning method, after training optimization of a specific step length, reliable pseudo labels are generated for unlabeled positive samples by using the currently trained network, and the labels are added into the training samples to continue training optimization of the network;

and the output module is used for generating a scene graph result for the input picture by using the trained scene graph generation network.

In order to achieve the third object, the invention adopts the following technical scheme:

a storage medium stores a program that, when executed by a processor, implements the above-described method for generating a scene graph based on a transform model and category association.

In order to achieve the fourth object, the invention adopts the following technical scheme:

a computing device comprises a processor and a memory for storing a processor executable program, wherein the processor executes the program stored in the memory to realize the scene graph generation method based on the transform model and the category association.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) according to the method, on the basis of an information fusion mechanism based on a transform, attention calculation of the transform is improved according to characteristics of the information fusion mechanism, the linear attention mechanism is used for replacing a traditional softmax attention mechanism, the calculation complexity of a relational information fusion module is reduced, and the geometric information among objects is used for carrying out nonlinear weight weighting on the attention of object information fusion, so that the convergence difficulty is reduced.

(2) The method constructs the bias-removing loss function based on the relation class association, adopts the loss function to calculate the loss of the relation prediction result, guides the constructed scene graph to generate the relevance among the network attention relation classes, and corrects the bias influence caused by the unbalanced problem of the training sample.

(3) The invention constructs a learning method for constructing the unlabeled positive sample, generates a reliable pseudo label for the unlabeled positive sample, and reduces the label noise caused by the unlabeled training sample.

Drawings

FIG. 1 is a schematic flow chart of a method for generating a scene graph based on a transform model and category association according to the present invention;

fig. 2 is a schematic structural diagram of a scene graph generation network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

As shown in fig. 1, the present embodiment provides a scenegraph generating method based on a transform model and category association, including the following steps:

s1: inputting the original picture into a fast-RCNN target detection algorithm network, and detecting the positions and object type information of all objects, wherein the method specifically comprises the following steps:

using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the picture_eAnd its confidence s in each object class_kSubscript k ∈ {1, …, C ∈ [ {1, …_eIs the corresponding object class, where C_eIs the number of all object classes.

S2: as shown in fig. 2, constructing a scene graph generation network based on a transform model specifically includes the following steps:

s21: constructing initial features of the object:

extracting a global feature map from an original picture by using a convolution network ResNext-101 in fast-RCNN according to a bounding box b of an object_eIntercepting the convolution characteristics of the object bounding box area on the global characteristic diagram by using the ROIAlign operation of the region of interest alignment to obtain a visual characteristic vector with the length of 4096; meanwhile, GloVe word vectors which can be used publicly after pre-training on data sets of Wikipedia2014 and Gigaword5 are selected to encode semantic information of the object, confidence coefficients of the current object on each object category are taken as weights, and the word vectors corresponding to each object category are weighted and summed to obtain a semantic feature vector with the length of 200; after the visual features and the semantic features are spliced, the initial feature f of the object with the length of 512 is obtained by using the full connecting layer_e；

This process can be formulated as:

therein, FC_eDenotes the fully-connected layer, [;]representing a vector stitching operation.

S22: constructing a transformer encoder for object information fusion;

a transform encoder for object information fusion is formed by stacking 4 layers of transform structures for object information fusion, each layer including multi-headed self-attention and Add & Norm operations. In the multi-head self-attention operation, the input feature Z of the current layer is firstly converted into a query feature Q, a key value feature K and a value feature V by three different full-connection layers, and the length of a feature vector is 512; these features were then evenly divided into 8 parts over the length of the vector, each for self-attention calculations for 8 heads. At each attention head, Q of the ith object and K of other objects are processed by a rectifying operation ReLU to calculate dot products, then the dot products are divided by the sum of all the dot products to obtain attention weights, and finally the attention weights are used for carrying out weighted summation on V of other objects to obtain the output result of the current attention head, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

as output result of the i-th object, g_i，jThe weighting factor is calculated according to the geometric information between the ith object and the jth object, and the calculation formula is as follows:

wherein, IoU_i，jIs the intersection ratio of the bounding boxes of the object i and the object j, rho_i，jEuclidean distance of two object boundary frame central points, c_i，jIn order to simultaneously contain the diagonal Euclidean distance of the minimum closure area of the bounding boxes of the two objects, the calculation formula is as follows:

wherein (x)₁，y₁) Is the coordinate of the top left vertex of the bounding box, (x)₂，y₂) Coordinates of a lower right vertex;

splicing the output results of 8 heads, and obtaining the output characteristics of the multi-head attention mechanism through a full connection layer; in the following Add & Norm operation, firstly adding the input features Z of the current layer and the output features of the multi-head self-attention operation, then performing layer normalization operation, adding the obtained features and the output features of the current layer after FC-RELU-FC conversion, and then performing layer normalization operation to finally obtain the output features of the current transform layer, wherein the specific calculation formula is as follows:

Z＝LayerNorm(Z+MultiHeadAttention(Z))

Z＝LayerNorm(Z+FeedForward(Z))

wherein LayerNorm is layer normalization operation, MultiHeadAttention is multi-head attention operation, and feed forward is the FeedForward network;

the initial characteristics of the object are converted into refined characteristics z with context information through a 4-layer transformer coder_eThen using a full connected layer classifier to obtain an object classifier output o_k；

S23: constructing initial characteristics of the relationship;

for any pair of object ordered binary combination, defining the first object as a subject and the other object as an object, and intercepting the convolution characteristic f of the minimum closed area of the two object bounding boxes on the global characteristic diagram by using the region-of-interest alignment operation ROIAlign_uThe specific calculation formula is expressed as:

and the feature is compared with the refined feature z of the subject_hAnd refinement features z of objects_tSplicing, using the full connection layer to finally obtain a relation initial characteristic f with the length of 512_rSubscript r denotes the relationship;

s24: constructing a transformer encoder for relational information fusion, wherein the transformer encoder is formed by stacking 2 layers of transformer structures aiming at the relational information fusion, and each layer is similar to the transformer structure of the object information fusion; the difference is that in order to reduce the computational complexity caused by a large number of relationships, the self-attention calculation is changed into the following formula:

after the initial characteristics of the relation are converted into refined characteristics with context information through a 2-layer transformer encoder, a full-connection-layer classifier is used for obtaining the output r of the relation classifier_pSubscript p e {1, …, C_rIs a relationship class, C_rIs the number of all relationship categories;

s3: initializing the constructed scene graph generation network in a random mode;

aligning the interesting region of the constructed scene graph generation network with an ROIAlign module, directly inheriting parameters from a pre-trained Faster-RCNN corresponding module for initialization, and randomly initializing the rest part of the scene graph generation network by adopting an Xavier weight initialization method;

s4: randomly distributing the pictures in the training set to a batch with a fixed size, and inputting the pictures into a scene graph generation network to obtain a predicted scene graph result;

randomly distributing the pictures in the training set to a batch with a fixed size, wherein the pictures in the same batch have the same size proportion; generating a network by inputting batch into the constructed scene graph, constructing initial characteristics of the object, and obtaining an object classifier output o through the constructed transformer encoder for object information fusion_kCalculating object classifier output o using cross entropy loss function_kLoss of (d);

all the ordered binary combinations of the object are obtained from one picture, and the combination of the relation label and the non-relation label is calculated according to the following ratio of 1: 3, randomly sampling 1024 training samples according to the proportion, constructing initial characteristics of the relationship, and obtaining the output r of the relationship classifier through the constructed transformer encoder for relationship information fusion_pComputing a relationship classifier output r using a de-bias loss function based on relationship class association_pLoss of (d); specifically, firstly, obtaining relevance coefficients between different relation classes by using a relation classification result of a verification set, then taking the relevance coefficients as a weighting coefficient of a denominator of a softmax function, and finally inputting a softmax result into a class balance loss function without hyperparameters to calculate the loss of the relation, which is specifically expressed as the following formula:

representing the classifier output with the ith sample in a relation class p, w_p,mAs an association between a relationship class p and a relationship class mA coefficient of sex; when the number of training samples of the relation class p is less than or equal to 10% of the class m, w is w if verification set samples of the class p greater than 80% are predicted as the class m_p,m0, if between 40% and 80% of the validation set samples of class p are predicted to be class m, then w_p,mIs 0.4, in other cases w_p,mIs 1, and specifically represents the following formula:

wherein the content of the first and second substances,

and

number of training samples, a, for class p and class m, respectively_p,mPredicting samples of class p in the validation set as a proportion of class m;

calculating the sum of the object and the relation loss, and specifically expressing the sum as follows:

updating network parameters by using a gradient descent method; before training optimization of the network, w_p,mInitialized to 1, with a fixed size of 16 for batch, linearly increasing the learning rate to 0.001 with a preheat learning rate for the first 500 iteration steps, then keeping the learning rate unchanged until the step is adjusted to 0.0001 at 10000 th iteration step and to 0.00001 at 16000 th iteration step, during which the currently trained network parameters are saved every 2000 iteration steps, and updating w with the test results of the validation set_p,m。

S5: performing loss calculation on an object prediction result in the predicted scene graph result by adopting a cross entropy loss function;

s6: constructing a de-bias loss function based on relation class association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;

s7: adding the loss of the object prediction result and the loss of the relation prediction result, performing gradient calculation, and updating a scene graph through back propagation to generate a network;

s8: constructing a label-free positive sample learning method, generating reliable pseudo labels for the unlabeled positive samples by using the currently trained network after training optimization of a specific step length, and adding the pseudo labels into the training samples to continuously train and optimize the network;

constructing a label-free positive sample learning method, and after 14000-step training optimization, generating a reliable pseudo label for an unlabeled positive sample of a training set by using a currently trained network; specifically, first, given the object position and category label information in the current picture, the top 20 relational classification prediction results with the highest confidence are obtained, and the correctly predicted labeled positive samples are put into a set

The unlabeled exemplars are then put into the collection

Recording the confidence coefficient of the 20 th relational classification prediction result as tau; for the

A sample α, when

If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is more than 0.3 when the sample beta exists, the relation classification result p of the sample alpha is judged to be_αIf the label of the sample is not the same as that of the sample beta, the label of the sample alpha is updated to p_αIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient

To update the label of the sample α; in the training setAfter the pseudo labels are updated on all the selected label-free positive samples, continuing training and optimizing the constructed scene graph generation network until iteration reaches 18000 step length, and terminating training to obtain a trained scene graph generation network;

s9: generating a scene graph result for the input picture by using the trained scene graph generation network;

detecting the position information and the object type information of all objects by using a Faster-RCNN target detection algorithm network, and then inputting all the information into a constructed scene graph to generate a network:

constructing initial features of an object and obtaining an object classifier output o through a constructed transformer encoder for object information fusion_k；

All ordered binary combinations of the objects are solved, initial characteristics of the relations are constructed, and the output r of the relation classifier is obtained through the constructed transformer encoder for relation information fusion_p；

Then, aiming at each object ordered binary combination (h, t), the final score of the relationship prediction result between the two objects

The calculation formula is as follows:

wherein s is_hIs the confidence of the subject object classification result, s_tIs the confidence of the object classification result, s_rIs the confidence of the result of the relationship classification, all by outputting o to the object classifier_kAnd the output r of the relation classifier_pSolving by a softmax function; and finally, constructing the result of the obtained scene graph by taking the predicted objects as graph nodes and the predicted relationship among the objects as directed edges.

In order to verify the effectiveness of the method, experiments are carried out on a VG150 public scene graph generation data set, and quantitative and qualitative analysis is carried out.

As shown in table 1 below, the first behavior in table 1 is the result of the reference model, which uses the classical transformer structure for both the transform encoders of objects and relationships, and uses cross-entropy loss for both the classification results of objects and relationships. The experimental results of the details of LG (optimized transform structure), CC (unbiased loss function based on relation class association) and PU (label-free positive sample learning method) are added in the latter models respectively, and the experimental results show the effectiveness of each part.

TABLE 1 self-comparison Experimental data Table on VG150

As shown in table 2, the present invention is compared to the results on VG150 of the published method that is currently best performing:

table 2 comparison of the present invention with other methods on VG150

Example 2

The embodiment provides a scenegraph generation system based on a transform model and category association, including: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;

in this embodiment, the object information detection module is configured to input an original picture into a fast-RCNN target detection algorithm network, and detect positions and object type information of all objects;

in this embodiment, the scene graph generation network construction module is configured to construct a scene graph generation network based on a transform model;

in this embodiment, the initialization module is configured to initialize the constructed scene graph generation network in a random manner;

in this embodiment, the scene graph result prediction module is configured to randomly allocate pictures in a training set to a batch of a fixed size, and input the pictures into a scene graph generation network to obtain a predicted scene graph result;

in this embodiment, the object prediction result loss calculation module is configured to perform loss calculation on an object prediction result in a predicted scene graph result by using a cross entropy loss function;

in this embodiment, the relationship prediction result loss calculation module is configured to construct a de-bias loss function based on relationship class association, and perform loss calculation on the relationship prediction result in the predicted scene graph structure by using the loss function;

in this embodiment, the gradient calculation module is configured to add the loss of the object prediction result and the loss of the relationship prediction result, perform gradient calculation, and update the scene graph through back propagation to generate a network;

in this embodiment, the training optimization module is configured to construct a label-free positive sample learning method, and after training optimization for a specific step length, generate reliable pseudo labels for unlabeled positive samples by using a currently trained network, and add them to the training samples to continue training optimization on the network;

in this embodiment, the output module is configured to generate a scene graph result for the input picture by using the trained scene graph generation network.

Example 3

The present embodiment provides a storage medium, which may be a storage medium such as a ROM, a RAM, a magnetic disk, or an optical disk, and the storage medium stores one or more programs, and when the programs are executed by a processor, the method for generating a scene graph based on a transform model and category association according to embodiment 1 is implemented.

Example 4

The embodiment provides a computing device, which may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or another terminal device with a display function, and the computing device includes a processor and a memory, where the memory stores one or more programs, and when the processor executes the programs stored in the memory, the method for generating a scene graph based on a transform model and category association according to embodiment 1 is implemented.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. A method for generating a scene graph based on a transform model and category association is characterized by comprising the following steps:

constructing a scene graph generation network based on a transformer model;

initializing the constructed scene graph generation network in a random mode;

constructing a de-bias loss function based on relation category association, and performing loss calculation on a relation prediction result in a predicted scene graph structure by adopting the loss function;

constructing a label-free positive sample learning method, after training optimization of a set step length, utilizing a currently trained network to generate a pseudo label for an unmarked positive sample, and adding the pseudo label into the training sample to continue training optimization on the network;

2. The method for generating a scene graph based on transform model and category association as claimed in claim 1, wherein the step of inputting the original picture into the fast-RCNN object detection algorithm network to detect the positions of all objects and the object category information comprises the following specific steps:

using a trained fast-RCNN target detection algorithm network to perform target detection on an input original picture to obtain a bounding box b of each object in the picture_eAnd confidence s in each object class_kSubscript k ∈ {1, …, C ∈ [ {1, …_eIs the corresponding object class, where C_eIs the number of all object classes.

3. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of constructing a transform model-based scene graph generation network comprises:

constructing initial characteristics of the object:

constructing a transformer encoder for object information fusion;

the method comprises the steps that multiple layers of transform structures fused aiming at object information are stacked, each layer comprises multi-head linear self-attention and Add & Norm operation, initial features of an object are converted into refined features with context information through a transform encoder, and an object classifier is used for obtaining object classifier output;

initial characteristics of the constructed relationship:

4. The method for generating a scene graph based on a transform model and category association according to claim 1, wherein the initialization is performed on the constructed scene graph generation network in a random manner, and the specific steps include:

5. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of adding the loss of the object prediction result and the loss of the relationship prediction result, performing gradient calculation, and updating the scene graph generation network by back propagation comprises the following specific steps:

wherein N is_pFor the number of samples in the current batch for which the relationship class is p,

representing the classifier output with the ith sample in a relation class p, w_p,mIs the correlation coefficient between the relation category p and the relation category m;

and updating the network parameters by adopting a gradient descent method.

6. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the method for constructing a label-free positive sample learning method utilizes a currently trained network to generate pseudo labels for unlabeled positive samples after training optimization with a set step size, and includes the following specific steps:

screening out a set number of relational classification prediction results based on confidence degree sequencing under the condition of giving object positions and category marking information in a current picture;

putting the labeled positive exemplars correctly predicted in the set

Unlabeled exemplars are put into the collection

for collections

A sample of (a), when

If the intersection ratio of the minimum closure area of the subject and the object of the sample alpha and the sample beta is larger than a set threshold when the sample beta exists, and the relation classification result p of the sample alpha is_αIf the label is not the same as that of the sample beta, the label of the sample alpha is updated to p_αIf the classification results which are the same but have the confidence coefficient higher than tau +0.1 on other classes of the samples alpha, selecting the classification result with the highest confidence coefficient

Updating the label of the sample alpha;

7. The method for generating a scene graph based on a transform model and category association as claimed in claim 1, wherein the step of generating the scene graph result for the input picture by using the trained scene graph generation network comprises the following specific steps:

all ordered binary combinations of the object are solved, initial characteristics of the relation are constructed, and the output of a relation classifier is obtained through a constructed transformer encoder for relation information fusion;

The calculation formula is as follows:

and taking the predicted objects as graph nodes and the predicted relationship between the objects as directed edges to construct the obtained scene graph result.

8. A system for generating a scene graph based on a transform model and category association is characterized by comprising: the system comprises an object information detection module, a scene graph generation network construction module, an initialization module, a scene graph result prediction module, an object prediction result loss calculation module, a relation prediction result loss calculation module, a gradient calculation module, a training optimization module and a result output module;

the training optimization module is used for constructing a label-free positive sample learning method, after training optimization with a specific step length is carried out, reliable pseudo labels are generated for unlabeled positive samples by using the currently trained network, and the pseudo labels are added into the training samples to continue training and optimizing the network;

the output module is used for generating a scene graph result for the input picture by using the trained scene graph generation network.

9. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for generating a scenegraph based on a transform model and category association as claimed in any of claims 1-7.

10. A computing device comprising a processor and a memory for storing a processor-executable program, wherein the processor, when executing the program stored in the memory, implements the method for scenegraph generation based on transform models and category associations as recited in claims 1-7.