CN118298428A - Unbiased scene graph generation method based on remarkable visual context - Google Patents

Unbiased scene graph generation method based on remarkable visual context Download PDF

Info

Publication number
CN118298428A
CN118298428A CN202410348938.6A CN202410348938A CN118298428A CN 118298428 A CN118298428 A CN 118298428A CN 202410348938 A CN202410348938 A CN 202410348938A CN 118298428 A CN118298428 A CN 118298428A
Authority
CN
China
Prior art keywords
predicate
features
scene graph
context
prototype
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410348938.6A
Other languages
Chinese (zh)
Inventor
王进
徐嘉玲
杨子龙
陈梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202410348938.6A priority Critical patent/CN118298428A/en
Publication of CN118298428A publication Critical patent/CN118298428A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to an unbiased scene graph generation method based on a remarkable visual context. The invention provides an unbiased scene graph generation method based on a remarkable visual context, which adopts an advanced light-weight and efficient visual transducer model for encoding visual features in an image context into remarkable visual context features. By combining the convolutional encoder and the location-aware marking module DualToken-ViT are able to capture local details and global overviews of image content, respectively, thereby constructing an efficient attention mechanism. The significant visual context features generated by the model provide important visual context information for relationship prediction, which is helpful for the model to more accurately understand image content and predict the relationship between instances. The method effectively learns the visual characteristics of the image context, and improves the robustness of the scene graph generation model relation prediction.

Description

Unbiased scene graph generation method based on remarkable visual context
Technical Field
The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to an unbiased scene graph generation method based on a remarkable visual context.
Background
Scene graph generation is an important task in the field of artificial intelligence, and is used as a leading edge task in the field of computer vision, image recognition and natural language processing are combined, the aim is to detect and classify all instances from an image, predict visual relations between every two instances, the instances and the relations are respectively represented by simple nouns and predicates and output in the form of triples, as shown in fig. 1-2, "woman-holding-ball" is a triplet, "woman" is a main instance in the triplet, "ball" is an object instance, and "holding" is a relation between the two instances, and we combine all triples in each graph to form a scene graph of the image. The scene graph generation will bridge the gap between the upstream object detection task and the downstream advanced visual understanding task as an intermediate task, such as image captioning, visual navigation, human body pose estimation, visual question-answering, etc.
In most of the current workflow of generating models of scene graphs, a target detector is first deployed, which is responsible for identifying the input image, extracting example suggestions and corresponding features in the graph, and these extracted features are then used as basic data for training the model, so that the model can accurately predict the visual relationship between different examples in the image. In a scene graph generation study, the data volume of image features is enormous. For example, instance features serve as the basis of a scene graph generation task, covering category information and visual characterizations of each instance; the spatial features describe the relative positions among the objects, and the power-assisted model further analyzes the spatial arrangement among the objects to predict spatial relationships such as "in front of", "under", "beside", and the like; the context features include information of the background, environment and other objects, ensuring that the model predicted relationship appears reasonable in the context of the entire image, forming a consistent context. The key features of capturing images in scene graph generation are therefore critical to achieving accurate relational predictions. By integrating these features, the generated scene graph can more fully and accurately describe the image content, thereby providing more accurate scene understanding for various downstream tasks.
In early studies, the scene graph generation method simply independently detected the relationship between each pair of instances, which often ignored the rich context information in the original image. Taking fig. 1 as an example, facing the example pair of "woman-ball", the model is more prone to predict that the relationship between the two is "play" according to the common sense in real life.
However, when considering the context information in the image, as shown in fig. 2, the model can make deeper reasoning through context cues such as "light", "good", "balls", etc., and further understand that the image scene actually occurs in the supermarket, so that the relationship between the two is predicted to be "holding" more accurately. This advancement represents the importance of contextual information to enhance model understanding and predictive capabilities.
In recent research, the introduction of contextual information to refine example features, thereby enhancing the ability of models to predict relationships, has become a popular trend. Taking the study of Chang et al 2023 as an example, they propose a language context module LCM (Language Context Module) to process the context information extracted by the object detector. First, they generate the corresponding word embeddings using Glove modules and form the contextual features by way of matrix concatenation. These contextual features are then fed into a transform encoder for encoding to form contextual attention features and fusion with the joint features of the subject and object in the instance pair of the relationship to be predicted to produce refined subject-object joint features. Finally, these refined features are used in a baseline model of scene graph generation for relational prediction. An overall network framework for processing context information based on language context modules is shown in fig. 3. While the above approach significantly improves the performance of the model in predicting relationships between instances, it primarily encodes word embedding in context information without directly learning the visual features of the image itself. Since the same words may exhibit different visual features in different visual contexts, the scene graph generation is a task of predicting the image content, the effect of which depends largely on the efficient learning of the visual features by the model. Thus, relying solely on word embedding may limit the ability of the model to handle visual diversity and understand complex scenarios, thereby affecting the robustness of the model in relation prediction.
Scene graph generation is a task of relational prediction of image content, the effect of which depends largely on efficient learning of visual features by a model. The existing scene graph generation method only encodes word embedding in context information, but does not directly learn visual features of the image, so that the capability of the model in terms of processing visual diversity and understanding complex scenes is limited, and the robustness of the model is insufficient.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides an unbiased scene graph generating method based on a remarkable visual context, which further improves the robustness of a scene graph generating model in relation prediction.
In order to achieve the aim of the invention, the technical scheme adopted by the invention is as follows:
An unbiased scene graph generation method based on a salient visual context, comprising the steps of:
Step 1: inputting training images in the data set into a target detector to generate an instance of a relationship to be predicted and corresponding context information;
Step 2: modeling the host-guest instance and the relation predicate by using a prototype-based representation to obtain corresponding host, guest instance prototypes and predicate prototypes, and carrying out feature fusion on the formed host and guest instance prototypes to obtain host-guest joint features;
step 3: encoding the visual features of the context using DualToken-ViT to form salient visual context features;
Step 4: fusing the host-guest combined features obtained in the step 2 with the obvious visual context features obtained in the step 3 to obtain refined host-guest combined features, and matching the refined host-guest combined features with the predicate prototypes obtained in the step 2 by adopting cosine similarity constraint to obtain matching loss;
step 5: enlarging the difference of predicate prototypes by Euclidean distance constraint to obtain distance loss;
step 6: summing the matching loss obtained in the step 4 and the distance loss obtained in the step 5 to obtain total loss, and performing training for generating a scene graph;
Step 7: if the batch-size is reached, returning to the step 1; if the training pictures are completely read, entering a step 8;
step 8: and outputting the trained model, and ending.
In a further preferred embodiment of the present invention, in the step 2, the steps of prototype-based modeling and host-guest prototype fusion are performed on the instance as follows:
Step 2-1: prototype-based modeling is carried out on a subject instance, an object instance and a predicate instance to obtain a subject prototype f s, an object prototype f o and a predicate prototype f p, the specific modeling process is shown in a formula (1), wherein c s,co,cp respectively represents word embedding of subject class, object class and predicate class labels, and w s,wo,wp respectively represents learnable weight parameters of the subject object and the predicate, and the weight parameters are learned in training;
Step 2-2: the main body prototype F s and the object prototype F o are fused by adopting a feature fusion function shown in a formula (2) to obtain a main body and object body combined feature F (F s,fo), wherein RELU (-) is an activation function, so that the problem of over fitting in the model training process can be effectively relieved;
F(fs,fo)=RELU(fs+fo)-(fs-fo)2 (2)。
Further as a preferable technical scheme of the invention, the step3 of obtaining the spatial feature comprises the following steps:
Step 3-1: extracting a local feature V c local of the visual context feature V c by using a convolution encoder, wherein the specific process is shown in a formula (3), DW is depth convolution, LN is layer normalization, PW is point-by-point convolution, and GELU is an activation function;
Vc local=Vc+PW(GELU(PW(LN(DW(Vc))))) (3);
Step 3-2: extracting global features V c global of visual context features V c by using a position sensing marking module, gradually downsampling the local features V c local in the step 3-1, reducing the loss of information in the downsampling process, and reserving more useful information, wherein the specific process is as shown in a formula (4), V c ds represents the result of gradual downsampling, DS represents double downsampling by using average pooling, phi represents that if the feature size is not matched with the expected size, multiple rolling and downsampling operations are performed, and each operation is represented by DS (Conv ());
step 3-3: using multi-headed self-attention pairs Performing global aggregation as shown in formula (5), whereinFor the result of global aggregation, containing global information, Q ds、Kds and V ds are generated by linear projection from V c ds, MSA represents a multi-headed attention mechanism;
step 3-4: position embedding P c and merging context information using weighted summation To enrich global information as shown in formula (6), whereinThe result of weighted summation is that MLP is a multi-layer perceptron, and alpha E [0,1] is a preset weight;
step 3-5: will be Generating global features by multi-headed self-attention broadcasting to visual context features V c As shown in equation (7), where Q c is generated by linear projection from V c, K g′ and V g′ are generated byGenerated by linear projection;
step 3-6: stitching local features along the y-axis And global featuresInput into the feedforward neural network FN and the two-dimensional attention network BDA to obtain final remarkable visual context characteristicsAs shown in formula (8), in whichIs a splicing operation;
Further as a preferable technical scheme of the present invention, the step 4 includes the following steps:
Step 4-1: combining the principal and guest features F (F s,fo) with the salient visual context features Splicing along the y axis and inputting the spliced main body and the spliced main body into a full-connection layer to obtain refined main body and guest body combined characteristicsAs shown in equation (9), where FC (·) is a fully connected function,Is a splicing operation;
Step 4-2: using cosine similarity constraint pairs Matching with predicate prototype f p to obtain matching loss L rp_cos, as shown in formula (10), whereinAndFor unitary operation, t is a ground truth-class subscript, τ is a learnable temperature super-parameter, and N is the number of predicates; the smaller L rp_cos isThe closer to f p, the closer the host-guest instance pair is to the predicate within the relationship predicate class;
further as a preferred technical solution of the present invention, the manhattan distance constraint in the step 5 includes the following steps:
Step 5-1: in order to expand the distinction between any two different predicate prototypes f pi and f pj, the Euclidean distance between f pi and f pj is first calculated by Euclidean distance constraint to obtain a distance matrix D ij represents the Euclidean distance of predicate prototypes fp i and fp j, dij is derived from equation (11), where i, j e {1,2, 3..N }, N is the number of predicate classes, and |·| is the absolute value operation;
Step 5-2: the elements of each row in the distance matrix D are ordered in increasing order to obtain The first k minimum values of each row are selected for summation and averaging to obtain d, as shown in a formula (12);
Step 5-3: the distance loss L p_euc is obtained by the formula (13), wherein gamma is another super-parameter for adjusting the distance margin; the smaller L p_euc, the greater the distinction between any two different predicate prototypes f pi and f pj;
Lp_euc=max(0,-d+γ) (13)。
Further as a preferred embodiment of the present invention, the final constraint L obtained in the step 6 combines the matching loss L rp_cos and the distance loss L p_euc, as shown in the formula (14):
L=Lrp_cos+Lp_euc (14)。
A test method for testing the output trained model in the unbiased scene graph generation method based on significant visual context, comprising the following steps:
s1: inputting a test image into the trained model;
s2: performing similarity matching on the relation expression output by the model and the corresponding predicate prototype, and selecting a class label of the most similar predicate prototype as a result of relation prediction;
S3: comparing the relation prediction result with a real predicate class label, and evaluating the result by using a common evaluation index;
S4: and outputting a test result and ending.
Further as a preferred technical scheme of the invention, the test image in S1 is a test set divided in the Visual Genome data set.
Further as a preferred technical solution of the present invention, in the S2, a cosine chord similarity is adopted to combine the main client and the clientMatching with all predicate prototypes, and selecting class labels R of the predicate prototypes with the maximum similarity results as the result of relation prediction, as shown in a formula (15), wherein i is {1,2,3,..N }, and s i represents the combined characteristics of the host and the guestSimilarity to predicate prototype f pi;
further as a preferred technical scheme of the invention, in the step S3, comparing the relation prediction result with the real predicate class label, and adopting widely accepted evaluation criteria in the current scene graph generation field, namely recall@k (R@K) and mean recall@k (mr@k), for accurately evaluating the test results of the model; the larger the values of the two evaluation indexes are, the better the test result is, the higher the robustness of the representation model is, and the better the model performance is.
Compared with the prior art, the unbiased scene graph generating method based on the remarkable visual context has the following technical effects: the invention provides the network model generated based on the double-mark obvious visual context scene graph, which effectively learns the visual characteristics of the image context and improves the robustness of the scene graph generation model relation prediction.
Drawings
FIG. 1 is a diagram of a comparison of predicate predictions using common sense reasoning in the prior art;
FIG. 2 is a comparison graph of predicate prediction using context information reasoning in the prior art;
FIG. 3 is a general network frame diagram of a prior art language context module LCM processing context information;
FIG. 4 is a flow chart of a method of generating a bias-free scene graph based on salient visual context of the present invention;
FIG. 5 is a schematic diagram of a network architecture of a convolutional encoder;
FIG. 6 is a diagram illustrating a network architecture of a location-aware tagging module;
FIG. 7 is a flow chart of a training phase of the dual marker visual context based scene graph generation network DSVC-SGG of the present invention.
Detailed Description
The invention is further explained in the following detailed description with reference to the drawings so that those skilled in the art can more fully understand the invention and can practice it, but the invention is explained below by way of example only and not by way of limitation.
The invention provides an unbiased scene graph generation method based on a remarkable visual context, which adopts an advanced light-weight and efficient visual transducer model for encoding visual features in an image context into remarkable visual context features. By combining the convolutional encoder and the location-aware marking module DualToken-ViT are able to capture local details and global overviews of image content, respectively, thereby constructing an efficient attention mechanism. The significant visual context features generated by the model provide important visual context information for relationship prediction, which is helpful for the model to more accurately understand image content and predict the relationship between instances. These features are then fused with the joint features of the subject and object in the instance pair of the relationship to be predicted, resulting in a more refined subject-object joint feature. These refined features are further input into an unbiased scene graph generation baseline model PE-Net with excellent current performance, aiming at enhancing the robustness of the scene graph generation model in relation prediction. Specifically, the invention provides a network DSVC-SGG (DualToken Salient Visual Context SCENE GRAPH GenerationNetwork) based on a double-label obvious visual context scene graph, the overall framework diagram of the network is shown in fig. 4, and compared with the prior art based on language context module LCM processing context information as shown in fig. 3, the visual context feature is directly encoded, so that the introduction of word embedding is avoided, the visual feature of an image context can be captured more accurately, and the robustness of scene graph generation model relation prediction is improved.
A training flow diagram of DSVC-SGG proposed by the present invention is shown in FIG. 7. Batch-size may be adjusted in the training process according to the capabilities of the computer.
The invention provides a method for generating an unbiased scene graph based on a remarkable visual context, which comprises the following steps:
step 1: inputting training images in the data set into a target detector, generating an instance of a relationship to be predicted and corresponding context information, and entering step 2; (modeling predicate prototype)
Step 2: modeling the host-guest instance and the relation predicate by using the representation based on the prototype to obtain corresponding host, guest instance prototype and predicate prototype, performing feature fusion on the formed host and guest instance prototype to obtain host-guest joint features, and entering step 3;
step 3: encoding the visual features of the context using DualToken-ViT to form salient visual context features, proceeding to step 4;
step 4: fusing the host-guest combined features obtained in the step 2 with the obvious visual context features obtained in the step 3 to obtain refined host-guest combined features, matching the refined host-guest combined features with the predicate prototypes obtained in the step 2 by adopting cosine similarity constraint to obtain matching loss, and entering the step 5;
step 5: expanding the difference of predicate prototypes by Euclidean distance constraint to obtain distance loss, and entering step 6;
step 6: summing the matching loss obtained in the step 4 and the distance loss obtained in the step 5 to obtain total loss, performing training for generating a scene graph, and entering the step 7;
step 7: if the set batch-size is reached, returning to the step 1; if the training pictures are completely read, the step 8 is entered.
Step 8: and outputting the trained model, and ending.
The training images in step 1 are all from the Visual Genome (VG) standard data set generated by the scene graph. The dataset consisted of 108077 images, containing the most common 150 instance classes and 50 predicate classes. The invention divides 70% of the images in the data set into training sets and the remaining 30% of the images into test sets. The target detector adopts a Faster R-CNN, and labels of instance classes and predicates, visual characteristics of the context and corresponding position information are obtained through the target detector.
In the invention, in the step 2, the steps of modeling based on the prototype and fusing the host-guest prototype are as follows:
Step 2-1: prototype-based modeling is carried out on a subject instance, an object instance and a predicate instance to obtain a subject prototype f s, an object prototype f o and a predicate prototype f p, the specific modeling process is shown in a formula (1), wherein c s,co,cp respectively represents word embedding of labels of the subject, the object and the predicate, and w s,wo,wp respectively represents learnable weight parameters of the subject, the object and the predicate, and the weight parameters can be learned in training;
Step 2-2: the main body prototype F s and the object prototype F o are fused by adopting a feature fusion function shown in a formula (2) to obtain a main body and object body combined feature F (F s,fo), wherein RELU (-) is an activation function, so that the problem of over fitting in the model training process can be effectively relieved;
F(fs,fo)=RELU(fs+fo)-(fs-fo)2 (2)
the step of encoding the visual context in step 3 of the present invention is as follows:
Step 3-1: extracting local features of visual context feature V c using convolutional encoder The network architecture of the convolutional encoder is shown in fig. 5, and the specific process is shown in formula (3), wherein DW is depth convolution, LN is layer normalization, PW is point-by-point convolution, GELU is an activation function;
Step 3-2: extracting global features of visual context feature V c using a location-aware tagging module The network architecture of the location aware tagging module is shown in fig. 6. For the local features of step 3-1Gradually downsampling is carried out, the loss of information in the downsampling process is reduced, more useful information is reserved, the specific process is shown in a formula (4),Representing the result of the step-wise downsampling, DS representing double downsampling using average pooling, φ representing multiple convolution and downsampling operations, each operation represented by DS (Conv ()) if the feature size does not match the expected size;
step 3-3: using multi-headed self-attention pairs Performing global aggregation as shown in formula (5), whereinFor the result of global aggregation, containing global information, Q ds、Kds and V ds are defined byMSA represents a multi-headed attentiveness mechanism by linear projection generation;
step 3-4: position embedding P c and merging context information using weighted summation To enrich global information as shown in formula (6), whereinThe result of weighted summation is that MLP is a multi-layer perceptron, and alpha E [0,1] is a preset weight;
step 3-5: will be Generating global features by multi-headed self-attention broadcasting to visual context features V c As shown in equation (7), where Q c is generated by linear projection from V c, K g′ and V g′ are generated byGenerated by linear projection.
Step 3-6: stitching local features along the y-axisAnd global featuresInput into the feedforward neural network FN and the two-dimensional attention network BDA to obtain final remarkable visual context characteristicsAs shown in formula (8), in whichIs a splicing operation;
The specific steps in the step 4 of the invention are as follows:
Step 4-1: combining the principal and guest features F (F s,fo) with the salient visual context features Splicing along the y axis and inputting the spliced main body and the spliced main body into a full-connection layer to obtain refined main body and guest body combined characteristicsAs shown in equation (9), where FC (·) is a fully connected function,Is a splicing operation;
Step 4-2: using cosine similarity constraint pairs Matching with predicate prototype f p to obtain matching loss L rp_cos, as shown in formula (10), whereinAndFor unitary operation, t is the index of ground truth classes, τ is a learnable temperature super-parameter, and N is the number of predicates. The smaller L rp_cos isThe closer to f p, the closer the host-guest instance pair is to the predicate within the relationship predicate class;
The Euclidean constraint in step 5 of the invention comprises the following specific steps:
Step 5-1: in order to expand the distinction between any two different predicate prototypes f pi and f pj, the invention adopts the Euclidean distance constraint mode to firstly calculate the Euclidean distance between f pi and f pj to obtain a distance matrix Dij represents the Euclidean distance of predicate prototypes fp i and fp j, dij is obtained by formula (11), where i, j e {1,2, 3..N }, N is the number of predicate classes, and |·| is the absolute value operation;
Step 5-2: the elements of each row in the distance matrix D are ordered in increasing order to obtain The first k minimum values of each row are selected for summation and averaging to obtain d, as shown in a formula (12);
Step 5-3: the distance loss L p_euc is determined by equation (13), where γ is another hyper-parameter used to adjust the distance margin. The smaller L p_euc, the greater the distinction between any two different predicate prototypes f pi and f pj;
Lp_euc=max(0,-d+γ) (13)
The final constraint L obtained in step 6 of the present invention combines the matching loss L rp_cos and the distance loss L p_euc as shown in equation (14);
L=Lrp_cos+Lp_euc (14)
The invention tests under three subtasks, predicate Classification (PREDICATE CLASSIFICATION, predcls), scene graph Classification (SCENE GRAPH Classification, SGcls) and scene graph Detection (SCENE GRAPH Detection, SGdet), respectively.
In predicate classification, the input image contains position information of all instances and instance class labels, and the object of the model is to predict the relationship class among the instances in the image. This task focuses on understanding interactions between entities, while ignoring the challenges of instance recognition.
The scene graph classification is similar to the predicate classification in that the input image contains location information for all instances, but does not contain instance category information. The model needs to identify the class of each instance first and then predict the relationships between the instances. This task is more challenging than predicate classification because the accuracy of the instance classification directly affects the effect of the relationship prediction.
Scene graph detection is the most challenging task, and the input image does not contain location information and category information for any instance. The model needs to identify the locations and categories of all instances in the image on its own and then predict the relationships between them. Since the model must identify instances from scratch, there may be deviations from the manually annotated data, which increase the difficulty of relational prediction, which in turn may affect the overall accuracy of the model.
In order to test the output trained model in the unbiased scene graph generation method based on the salient image context, the test flow of the invention comprises the following steps:
S1: inputting a test image into the trained model, and entering S2;
s2: performing similarity matching on the host-guest joint characteristics output by the model and the corresponding predicate prototypes, selecting a class label of the predicate prototypes with highest similarity as a result of relation prediction, and entering S3;
S3: comparing the relation prediction result with a real predicate class label, evaluating the result by using an evaluation index commonly used at present, and entering S4;
S4: and outputting a test result and ending.
The test image in S1 in the test flow is selected from the test set divided in the Visual Genome (VG) data set, and the specific steps are as follows:
s1-1: and executing a predicate classification task, wherein the model fully utilizes the position and class label information of all the examples provided by the test set. In particular, this means that the model is focused on predicting relationships between instances given the exact location and class of each instance.
S1-2: the scene graph classification task is performed, and the model does not contact class label information of the instance in the image. Instead, it needs to predict the class of each instance independently. These predictions are then used in the next process of relational prediction, which the model bases on to proceed with the prediction of the inter-instance relationships.
S1-3: the scene graph detection task is performed and the model does not access the location and class label information of the instance in the image. It first needs to identify and locate each instance in the image and then predict the class of those instances. After these steps are completed, the model further predicts the relationships between the instances. This process requires the model to understand the image content from scratch and is therefore more complex and challenging than the first two tasks.
In S2 in the test flow, the invention adopts the cosine similarity to the combined characteristics of the host and the guestMatching with all predicate prototypes, and selecting class labels R of the predicate prototypes with the maximum similarity results as the result of relation prediction, as shown in a formula (15), wherein i is {1,2,3,..N }, si represents the combined characteristics of the host and the guestSimilarity to predicate prototype fp i.
In S3 in the test flow, the relation prediction result is compared with the real predicate class label, and the method adopts widely accepted evaluation standards in the field of current scene graph generation, namely recall@K (R@K) and mean recall@K (mR@K), so as to accurately evaluate the test effect of the model. The larger the values of the two evaluation indexes, the better the test result, the higher the robustness of the representation model when facing diversified data and complex scenes, and the better the model performance. Wherein the recall@K index measurement model correctly identifies the frequency of predicate categories in its first K predictive relationships. Specifically, the invention uses R@20, R@50 and R@100 indexes for evaluation; mean recall@K represents the average value of the model for recall@K for each predicate class, and consistency and balance of the model over a broad range of relationship classes are evaluated. Specifically, the invention is evaluated by using the indexes of mR@20, mR@50 and mR@100.
Example 1:
The present embodiment will utilize Visual Genome (VG) datasets to accomplish the task of generating an unbiased scene graph based on salient Visual context. The dataset contained 108077 images, covering 150 of the most common instance categories and 50 predicate categories. To reasonably divide the dataset, 70% of the images were assigned to the training set, while the remaining 30% were used as the test set. To complete the training and testing, deepin 20.4.4 was chosen as the operating system and studied in the experimental environment of torch1.7.1+cu110. The weight α is set to 0.1, the parameters K and γ are set to 3 and 7, respectively, and 50K iterative exercises are completed using the SGD optimizer, with the learning rate set to 1e-3 and the batch size maintained at 8. All experiments were performed on hardware equipped with NVIDIA GeForce RTX 4090,090 GPU to ensure the efficiency of the training and testing process.
The invention accurately detects the instance and the position information thereof in the image by adopting the Faster R-CNN. As a key milestone in the field of target detection, the architecture technology of Faster R-CNN has reached a mature stage, and provides precious revenues and solid foundation for numerous target detection models developed later. Because of its excellent performance and flexibility, faster R-CNN finds wide application in a variety of computer vision tasks, including but not limited to object detection, image segmentation, face detection, and the like. In addition, it is also widely used to solve various practical problems, such as the key fields of vision processing of an automatic driving system, dynamic target tracking in safety monitoring, medical image analysis and the like, and exhibits wide influence in modern technical application.
The invention carries out index evaluation in three subtasks of predicate classification, scene graph classification and scene graph detection respectively. The method for generating the common evaluation index by using the current scene graph comprises the following steps: the larger the two evaluation indexes are, the better the test result is, and the higher the robustness of the model is shown when the model faces diversified data and complex scenes, the better the model performance is. Wherein Recall@K refers to the probability of the correct predicate label appearing in the K relation prediction results before measurement, and specifically, the invention uses R@20, R@50 and R@100 indexes for evaluation; mean recall@K refers to the mean value of recall@K of each relational term, and specifically, the method uses mR@20, mR@50 and mR@100 indexes for evaluation. The comparison result of the method of the invention and the existing scheme on the predicate classification task is shown in table 1. The results of the comparison on the scene graph classification task are shown in table 2. The comparison results on the scene graph detection task are shown in table 3. All three subtasks tested by the invention were tested on the VG dataset.
TABLE 1 comparison of the inventive method with the prior scheme on predicate classification task
TABLE 2 comparison of the method of the present invention with the prior art scheme on scene graph classification tasks
TABLE 3 comparison of the method of the present invention with the prior art scheme on scene graph detection tasks
According to the results shown in the table, the DSVC-SGG method provided by the invention shows improvement in performance on three subtasks of predicate classification, scene graph classification and scene graph detection when compared with an unbiased scene graph generation baseline model PE-Net with excellent current performance. Specifically, in the predicate classification task, the DSVC-SGG method is improved by about 2% on three key evaluation indexes of R@20, R@50 and R@100 compared with the PE-Net method respectively. At the same time, an increase of about 1% is achieved in all of mr@20, mr@50 and r@100.
The following two indices are used for the proposed model of the invention: recall@K (R@K) and mean recall@K (mR@K).
To verify the beneficial effects of the present invention, the present invention performed experiments on Visual Genome (VG) datasets, which contained the most common 150 example classes and 50 so-called word classes. The invention carries out index evaluation in three subtasks of predicate classification, scene graph classification and scene graph detection respectively. In the predicate classification task, the evaluation result finally obtained by the method is as follows: r@20 is 58.9%, R@50 is 66.5%, R@100 is 69.1%, mR@20 is 29.4%, mR@50 is 32.0%, and mR@100 is 34.5%; in the scene graph classification task, the evaluation result finally obtained by the method is as follows: r@20 is 36.3%, R@50 is 40.0%, R@100 is 40.9%, mR@20 is 16.6%, mR@50 is 18.0%, and mR@100 is 19.1%; in the scene graph detection task, the evaluation result finally obtained by the method is as follows: r@20 was 24.2%, R@50 was 31.7%, R@100 was 36.2, mR@20 was 9.8%, mR@50 was 12.7%, and mR@100 was 14.8%.
Further compared to the PE-Net scheme incorporating the existing language context module LCM, the DSVC-SGG also achieves about 1% increase in the R@20, R@50, and R@100 indices, respectively, on predicate classification tasks. In both scene graph classification and scene graph detection subtasks, DSVC-SGG showed performance improvement over both the recall@K and mean recall@K evaluation indicators, whether compared to the PE-Net model or the LCM+PE-Net model.
These achievements clearly demonstrate that incorporating significant visual context features can provide a model with richer and comprehensive context information, making the model more efficient in understanding complex scenes. Meanwhile, the obvious context characteristics enable the model to learn from richer visual information, and are helpful for maintaining stable performance of the model when facing challenges such as shielding, illumination change and scale change. The DSVC-SGG method provided by the invention improves the capability of the model in the aspects of processing visual diversity and understanding complex scenes, thereby improving the robustness of the scene graph generation model relation prediction.
Example 2:
This embodiment will describe an applicable scenario of the present invention.
On a certain social media platform, a user may share many pictures into the gallery of the platform and may also retrieve other pictures in the gallery. However, some pictures are poor in quality or complex in shooting environment, such as insufficient light, disordered background, object shielding and the like, and the platform cannot accurately add accurate subtitles to the images. The method provided by the invention can solve the problem.
Firstly, the system of the platform acquires the pictures and performs preprocessing, including image size adjustment, color correction and the like, and adopts a Faster R-CNN model to extract the characteristics of the pictures.
Then, the invention is utilized to generate a prepared and rich scene graph for each picture, and the scene graph is used in image subtitle generation, so that the model can generate reasonable and relevant subtitles for the picture according to the obvious visual context characteristics even when facing challenging conditions such as poor image quality, complex scenes or shielding.
The generated subtitles are then presented on the social media platform along with the original pictures. The subtitles can accurately describe scenes and details in the pictures, and provide richer information for users.
Finally, the user can carry out fuzzy search by inputting characters in the subtitles, so that the interested pictures can be quickly found. By the method provided by the invention, the user can search and find the pictures on the social media more easily, and the user experience and the accessibility of the social media content are improved.
While the foregoing is directed to embodiments of the present invention, other and further details of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (10)

1. An unbiased scene graph generation method based on a salient visual context, characterized by comprising the steps of:
Step 1: inputting training images in the data set into a target detector to generate an instance of a relationship to be predicted and corresponding context information;
Step 2: modeling the host-guest instance and the relation predicate by using a prototype-based representation to obtain corresponding host, guest instance prototypes and predicate prototypes, and carrying out feature fusion on the formed host and guest instance prototypes to obtain host-guest joint features;
step 3: encoding the visual features of the context using DualToken-ViT to form salient visual context features;
Step 4: fusing the host-guest combined features obtained in the step 2 with the obvious visual context features obtained in the step 3 to obtain refined host-guest combined features, and matching the refined host-guest combined features with the predicate prototypes obtained in the step 2 by adopting cosine similarity constraint to obtain matching loss;
step 5: enlarging the difference of predicate prototypes by Euclidean distance constraint to obtain distance loss;
step 6: summing the matching loss obtained in the step 4 and the distance loss obtained in the step 5 to obtain total loss, and performing training for generating a scene graph;
Step 7: if the batch-size is reached, returning to the step 1; if the training pictures are completely read, entering a step 8;
step 8: and outputting the trained model, and ending.
2. The method for generating an unbiased scene graph based on significant visual context as claimed in claim 1, wherein in the step 2, the steps of prototype-based modeling and host-guest prototype fusion of an instance are as follows:
Step 2-1: prototype-based modeling is carried out on a subject instance, an object instance and a predicate instance to obtain a subject prototype f s, an object prototype f o and a predicate prototype f p, the specific modeling process is shown in a formula (1), wherein c s,co,cp respectively represents word embedding of subject class, object class and predicate class labels, and w s,wo,wp respectively represents learnable weight parameters of the subject object and the predicate, and the weight parameters are learned in training;
fs=ws Tcs
fo=wo Tco (1);
fp=wp Tcp
Step 2-2: the main body prototype F s and the object prototype F o are fused by adopting a feature fusion function shown in a formula (2) to obtain a main body and object body combined feature F (F s,fo), wherein RELU (-) is an activation function, so that the problem of over fitting in the model training process can be effectively relieved;
F(fs,fo)=RELU(fs+fo)-(fs-fo)2 (2)。
3. the method for generating an unbiased scene graph based on salient visual context as claimed in claim 2, wherein the step of obtaining spatial features in the step 3 is as follows:
Step 3-1: extracting local features of visual context feature V c using convolutional encoder The specific process is shown in a formula (3), wherein DW is depth convolution, LN is layer normalization, PW is point-by-point convolution, GELU is an activation function;
Step 3-2: extracting global features of visual context feature V c using a location-aware tagging module For the local features of step 3-1Gradually downsampling is carried out, the loss of information in the downsampling process is reduced, more useful information is reserved, the specific process is shown in a formula (4),Representing the result of the step-wise downsampling, DS representing double downsampling using average pooling, φ representing multiple convolution and downsampling operations, each operation represented by DS (Conv ()) if the feature size does not match the expected size;
step 3-3: using multi-headed self-attention pairs Performing global aggregation as shown in formula (5), wherein V c ga is the result of global aggregation, contains global information, and Q ds、Kds and V ds are defined byMSA represents a multi-headed attentiveness mechanism by linear projection generation;
step 3-4: position embedding P c and merging context information using weighted summation To enrich global information as shown in formula (6), whereinThe result of weighted summation is that MLP is a multi-layer perceptron, and alpha E [0,1] is a preset weight;
step 3-5: will be Generating global features by multi-headed self-attention broadcasting to visual context features V c As shown in equation (7), where Q c is generated by linear projection from V c, K g′ and V g′ are generated byGenerated by linear projection;
step 3-6: stitching local features along the y-axis And global featuresInput into the feedforward neural network FN and the two-dimensional attention network BDA to obtain final remarkable visual context characteristicsAs shown in formula (8), in whichIs a splicing operation;
4. A method of generating an unbiased scene graph based on salient visual context as claimed in claim 3, in which said step 4 includes the steps of:
Step 4-1: combining the principal and guest features F (F s,fo) with the salient visual context features Splicing along the y axis and inputting the spliced main body and the spliced main body into a full-connection layer to obtain refined main body and guest body combined characteristicsAs shown in equation (9), where FC (·) is a fully connected function,Is a splicing operation;
Step 4-2: using cosine similarity constraint pairs Matching with predicate prototype f p to obtain matching loss L rp_cos, as shown in formula (10), whereinAndFor unitary operation, t is a ground truth-class subscript, τ is a learnable temperature super-parameter, and N is the number of predicates; the smaller L rp_cos isThe closer to f p, the closer the host-guest instance pair is to the predicate within the relationship predicate class;
5. the method for generating an unbiased scene graph based on a salient visual context as claimed in claim 4, including the step of:
step 5-1: to expand any two different predicate prototypes AndThe distinction between the two is that the Euclidean distance constraint mode is adopted, firstly, the calculation is carried outAndEuclidean distance between them to obtain distance matrixD ij represents a predicate prototypeAndD ij is determined by equation (11), where i, j e {1,2,3,.. N }, N is the number of predicates, and |·| is the absolute value operation;
Step 5-2: the elements of each row in the distance matrix D are ordered in increasing order to obtain The first k minimum values of each row are selected for summation and averaging to obtain d, as shown in a formula (12);
Step 5-3: the distance loss L p_euc is obtained by the formula (13), wherein gamma is another super-parameter for adjusting the distance margin; the smaller L p_euc, the more arbitrary two different predicate prototypes AndThe greater the distinction between;
Lp_euc=max(0,-d+γ) (13)。
6. The salient visual context based unbiased scene graph as recited in claim 5, wherein the final constraint L obtained in step 6 combines a matching loss L rp_cos and a distance loss L p_euc, as shown in equation (14):
L=Lrp_cos+Lp_euc (14)。
7. A test method for testing the output of a trained model in the salient visual context based unbiased scene graph generating method as claimed in any of claims 1 to 6, comprising the steps of:
s1: inputting a test image into the trained model;
s2: performing similarity matching on the relation expression output by the model and the corresponding predicate prototype, and selecting a class label of the most similar predicate prototype as a result of relation prediction;
S3: comparing the relation prediction result with a real predicate class label, and evaluating the result by using a common evaluation index;
S4: and outputting a test result and ending.
8. The method of claim 7, wherein the test image in S1 is selected from a test set of Visual Genome dataset partitions.
9. The test method according to claim 7, wherein in S2, cosine chord similarity is adopted for the combined characteristics of the host and the guestMatching with all predicate prototypes, and selecting class labels R of the predicate prototypes with the maximum similarity results as the result of relation prediction, as shown in a formula (15), wherein i is {1,2,3,..N }, and s i represents the combined characteristics of the host and the guestSimilarity to predicate prototype f pi;
10. The test method according to claim 7, wherein in S3, comparing the relationship prediction result with the real predicate class label, and adopting widely accepted evaluation criteria in the current scene graph generation field, namely recall@k (R@K) and mean recall@k (mr@k), for accurately evaluating the test results of the model; the larger the values of the two evaluation indexes are, the better the test result is, the higher the robustness of the representation model is, and the better the model performance is.
CN202410348938.6A 2024-03-26 2024-03-26 Unbiased scene graph generation method based on remarkable visual context Pending CN118298428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410348938.6A CN118298428A (en) 2024-03-26 2024-03-26 Unbiased scene graph generation method based on remarkable visual context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410348938.6A CN118298428A (en) 2024-03-26 2024-03-26 Unbiased scene graph generation method based on remarkable visual context

Publications (1)

Publication Number Publication Date
CN118298428A true CN118298428A (en) 2024-07-05

Family

ID=91685292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410348938.6A Pending CN118298428A (en) 2024-03-26 2024-03-26 Unbiased scene graph generation method based on remarkable visual context

Country Status (1)

Country Link
CN (1) CN118298428A (en)

Similar Documents

Publication Publication Date Title
Rodriguez et al. Proposal-free temporal moment localization of a natural-language query in video using guided attention
Yang et al. Video captioning by adversarial LSTM
Jiao et al. Three-dimensional attention-based deep ranking model for video highlight detection
Hu et al. Global-local enhancement network for NMF-aware sign language recognition
Wang et al. Dynamic attention guided multi-trajectory analysis for single object tracking
WO2021050773A1 (en) Keypoint based pose-tracking using entailment
Doughty et al. Action modifiers: Learning from adverbs in instructional videos
Zhang et al. Learning implicit class knowledge for RGB-D co-salient object detection with transformers
Plummer et al. Revisiting image-language networks for open-ended phrase detection
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
Zha et al. Deep position-sensitive tracking
Huang et al. An effective multimodal representation and fusion method for multimodal intent recognition
Abdar et al. A review of deep learning for video captioning
Li et al. Egocentric action recognition by automatic relation modeling
Zheng et al. Cross-directional consistency network with adaptive layer normalization for multi-spectral vehicle re-identification and a high-quality benchmark
Tu et al. Neighborhood contrastive transformer for change captioning
Jin et al. Pseudo-labeling and meta reweighting learning for image aesthetic quality assessment
Cui et al. Face recognition using total loss function on face database with ID photos
Zhao et al. Context-aware and part alignment for visible-infrared person re-identification
De Coster et al. Towards the extraction of robust sign embeddings for low resource sign language recognition
Callemein et al. Automated analysis of eye-tracker-based human-human interaction studies
Liu et al. A multimodal approach for multiple-relation extraction in videos
Zhao et al. SPACE: Finding key-speaker in complex multi-person scenes
Yang et al. Student Classroom Behavior Detection Based on YOLOv7+ BRA and Multi-model Fusion
CN115758159B (en) Zero sample text position detection method based on mixed contrast learning and generation type data enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination