CN113240033B

CN113240033B - Visual relation detection method and device based on scene graph high-order semantic structure

Info

Publication number: CN113240033B
Application number: CN202110573757.XA
Authority: CN
Inventors: 袁春; 魏萌
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2022-06-28
Anticipated expiration: 2041-05-25
Also published as: CN113240033A

Abstract

The invention provides a visual relation detection method and device based on a scene graph high-order semantic structure, wherein an algorithm comprises the steps of predicting the types and positions of all objects in a picture, outputting a visual characteristic vector corresponding to each object, pairing every two detected objects, extracting a joint visual characteristic vector based on a pairing result, and coding the positions to obtain position codes; inputting the categories of all the objects into a hierarchical semantic clustering algorithm, and processing to obtain a high-level semantic feature vector corresponding to each object; performing semantic coding on the output of the hierarchical semantic clustering algorithm; generating a relation classifier weight; and combining the visual characteristic vector, the combined visual characteristic vector and the position code into a unified characteristic vector, and performing point multiplication operation on the unified characteristic vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

Description

Visual relation detection method and device based on scene graph high-order semantic structure

Technical Field

The invention relates to the field of image processing, in particular to a visual relation detection method and device based on a scene graph high-order semantic structure.

Background

The main goal of the visual relationship detection task is to identify and locate the content of the visual ternary relationships (subjects, relationships, objects) present in the image. The identification refers to identifying the category attribute of the target object, and the positioning refers to regressing the bounding box of the target object. Understanding the visual scene is often not just the recognition of a single object, even a perfect object detector, it is difficult to perceive nuances between the person feeding the horse and the person standing beside the horse. Learning the rich semantic relationships between these objects is the sense of visual relationship detection. The key to understand visual scenes more deeply is to construct a structured representation from the scenes on the basis of identifying objects to capture the objects and their semantic relationships. Such a characterization not only provides contextual information for the underlying recognition task, but is also of great value for a variety of advanced vision tasks. This structured representation is called a scene graph. The scene graph provides an expression of explicitly modeled objects and their relationships. Briefly, a scene graph is a visual positioning graph of image objects, where the connection of edges depicts their pairwise relationship.

Visual relation detection is a bottom-layer core algorithm in many fields and has wide application, for example, in the field of image retrieval, the visual relation detection enables a retrieval algorithm to better understand the relation between input texts and images, so that the retrieval effect is improved; in the field of automatic driving, the automatic driving vehicle can be provided with a current scene structure so as to help the automatic driving vehicle to safely drive.

Detecting visual relationships in images is a difficult task, with difficulties arising mainly from two aspects: (1) it is difficult to obtain labels of the correct kind and quantity and complete triple labels; (2) there is great variability in the relationships in terms of visual appearance and language description. It is very difficult to obtain the object bounding box level label first. Detecting visual relationships in an image requires locating the subject and object in the interaction by determining a bounding box around the corresponding visual entity in the image. Thus, for a fully supervised model, the ideal training data is an image of the visual relationship with box-level labeling, i.e. drawing a bounding box around the objects, and each pair of interactive objects is labeled with a descriptive triplet. However, obtaining such annotations is very expensive.

Another reason for the difficulty in obtaining annotations is the ternary combination explosion caused by the combined nature of the visual relationships. For a vocabulary of N different object classes and K different relations, the possible relation coefficients are nxnxnxk, e.g. for N100 there may be one million possible triples. Since most of these triples are rare or invisible in the real world, the training data itself always presents a long-tailed distribution, i.e., the annotations are concentrated on few relationships, while most triples in the vocabulary have little or no training data. The long tail distribution is not caused by the quality of the label, but the distribution of the data collected under natural conditions usually shows the same long tail distribution. The algorithm is trained on a data set with long tail distribution, so that an overfitting phenomenon often occurs, namely the algorithm mainly focuses on the head category in the data volume set, and the learning of the tail category is omitted. Therefore, applying visual relationship detectors to a large number of triplets is a significant challenge. However, from the industrial demand, for the algorithm research on the long tail data, the labeling cost can be greatly reduced, and the data acquisition efficiency is improved.

Early visual relationship identification methods learned the relationship triplets as a whole, i.e., trained an independent detector for each relationship triplet category. However, this method is only suitable for small datasets with fewer relational triples and more label data per triplet. In this case, the visual relationship detection task is very similar to the target detection task except that the detection object at this time is changed from a single object to two objects and the relationship therebetween. However, as the data set is developed, the visual relational data set is no longer defined on the preset relational triples, but is an open vocabulary, so that the number of possible triples is very large, and most triples do not have enough labeled data. In this case, it is not possible to train a triplet detector, and therefore this has facilitated the development of a combined model, i.e. instead of detecting each visual relationship triplet individually, the detection target becomes a simpler visual element that can be shared among multiple visual relationship triplets. This change in perspective is inspired by the structure of natural language, where visual relationships are represented in the form of components of triples, and each component can be observed independently or as part of different visual interactions.

Disclosure of Invention

The invention provides a visual relation detection method and device based on a scene graph high-order semantic structure, and aims to solve the technical problem that labels of correct types and numbers and complete triple labels are difficult to obtain.

Therefore, the visual relationship detection method based on the scene graph high-order semantics, which is provided by the invention, specifically comprises the following steps:

s1, predicting the types and positions of all objects in the picture, outputting a visual characteristic vector corresponding to each object, pairing every two detected objects, extracting a joint visual characteristic vector based on pairing results, and coding the positions to obtain position codes;

s2, inputting the categories of all the objects into a hierarchical semantic clustering algorithm, and processing to obtain a high-level semantic feature vector corresponding to each object;

s3, carrying out semantic coding on the output of the hierarchical semantic clustering algorithm;

s4, generating the weight of the relation classifier;

and S5, combining the visual feature vector, the combined visual feature vector and the position code into a unified feature vector, and performing point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

Further, the class of the object is a number obtained by sequentially encoding objects that may appear in the input data.

Further, the position of the object is a frame, which is determined by two points, namely, the upper left corner and the lower right corner of the frame, and each point includes the values of the abscissa and the ordinate.

Further, the encoding the position specifically includes combining a difference between center points of the detection frames corresponding to the two objects and respective height widths into a feature vector based on object pairing, and processing the feature vector by using a learnable linear layer to obtain a final position code.

Further, the specific process of the hierarchical semantic clustering algorithm includes that for category names appearing in all data sets, word vectors are converted through a word space model, then hierarchical clustering is carried out on the word vectors, and after multiple iterations, a large category word vector corresponding to each category word vector can be obtained.

Further, the semantic coding specifically comprises performing secondary coding on the input high-level semantic features dynamically through a hidden space coding layer, defining all detected objects in a picture as nodes of a fully-connected graph in the secondary coding, counting the probability of common occurrence of all objects in the whole data set as edges between the nodes, and obtaining the high-level semantic features with contexts of all objects in the input image through multilayer graph convolution neural network processing.

Further, the generating of the relation classifier weight specifically includes selecting the high-level semantic features with context according to the categories of the two objects included in each pairing of the object pairings, selecting the high-level semantic features with context corresponding to the two processed object categories for each pairing, merging the two high-level semantic features with context to serve as a weight, and finally obtaining the relation classifier weight.

The visual relation detection device based on the scene graph high-order semantics specifically comprises a visual feature extraction module, a hierarchical semantic clustering module, a semantic coding module, a weight generation module and a scene graph generation module;

the visual feature extraction module predicts the categories and positions of all objects in the picture through a convolutional neural network, outputs a visual feature vector corresponding to each object, performs pairing operation on every two detected objects, extracts a combined visual feature vector through the convolutional neural network based on a pairing result, and codes the positions to obtain position codes;

the hierarchical semantic clustering module replaces the categories of all objects in the input picture by using semantic vectors obtained by a hierarchical semantic clustering algorithm to obtain a high-level semantic feature vector corresponding to each object;

The semantic coding module comprises a hidden space coding layer and a graph convolution neural network, the output of a hierarchical semantic clustering algorithm is used as input, the input high-level semantic features are dynamically subjected to secondary coding through the hidden space coding layer, and the high-level semantic features with contexts of all objects in the input image are obtained through processing of the multilayer graph convolution neural network;

the weight generation module selects the high-level semantic features with contexts according to the categories of two objects contained in each pairing in the object pairing, selects the high-level semantic features with contexts corresponding to the two processed object categories for each pairing, and takes the two high-level semantic features with contexts as a weight after being combined to finally obtain the weight of the relation classifier;

the scene graph generation module combines the visual feature vector, the joint visual feature vector and the position code into a unified feature vector, and performs point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as the scene graph.

Further, the class of the object is a number, which is obtained by sequentially encoding objects that may appear in the input data;

The position of the object is a frame, which is determined by two points, namely the upper left corner and the lower right corner of the frame, and each point comprises numerical values of an abscissa and an ordinate.

The computer readable storage medium provided by the invention stores a program which can be operated by a processor, and the program can realize the visual relation detection method based on the scene graph high-order semantics in the process of being operated by the processor.

Compared with the prior art, the invention has the following beneficial effects:

a hierarchical clustering algorithm and semantic coding are added in the algorithm process, so that the classification of the relationship between objects in the picture fully considers the semantic relationship with universality.

In some embodiments of the invention, the following advantages are also provided:

1) compared with the prior art, the position coding processing in the invention is simpler and more direct, the operation complexity is greatly optimized, and the requirement on hardware is reduced;

2) According to the method, aiming at the characteristics of objects and relations, a graph structure is used for processing the objects and the relations, and a graph convolution neural network is innovatively used, so that the method is universal, more flexible and more suitable for different use scenes.

Drawings

Fig. 1 is a flow chart of a visual relationship detection method.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

The visual relationship detection method based on the scene graph high-order semantics provided by the embodiment of the invention specifically comprises the following steps:

s1, visual feature extraction, predicting the category and position of all objects in the picture through a convolutional neural network CNN and a regional convolutional neural network RCNN, wherein the category of the object is a number, generally obtained by sequentially coding the objects possibly appearing in input data, the position of the object is a frame, the frame is determined by two points, namely the upper left corner and the lower right corner of the frame, each point comprises the numerical values of a horizontal coordinate and a vertical coordinate, meanwhile, a visual feature vector corresponding to each object, namely a feature vector of the position corresponding to each object frame on a feature map output by the convolutional neural network, then, pairing every two detected objects once, wherein in each pairing, the detection frames of the corresponding objects are combined into a large frame, extracting the combined visual feature vector through the convolutional neural network, except the visual feature, considering that the relative position of the object is also an important consideration point for classifying the relationship, the position is also encoded, and the above-mentioned object pairing combines the difference value of the central points of the detection frames corresponding to the two objects and the respective height and width into a feature vector, and processes the feature vector by using a learnable linear layer to obtain the final position code.

S2, hierarchical semantic clustering, inputting the category of all objects in the picture into a hierarchical semantic clustering algorithm, replacing the category of each object by using semantic vectors obtained by the hierarchical semantic clustering algorithm to obtain high-level semantic feature vectors corresponding to each object, converting the category names appearing in all training data and an existing word space model in the training process of the hierarchical semantic clustering algorithm, wherein the specific algorithm process is that for the category names appearing in all data sets, the category names are converted into word vectors through a word space model (such as BERT) and the like, then hierarchical clustering is performed on the word vectors, and after repeated iteration, a large category word vector corresponding to each category word vector can be obtained, for example, after repeated clustering is performed on the category "dog", the category "dog" can be combined with the words "cat", "rabbit" and the like to form an "animal" word vector. The hierarchical semantic clustering algorithm has the advantages that higher-level semantic features can be obtained more efficiently, and then the lower-level semantic features can be processed by using the higher-level semantic features, so that the embodiment of the invention has strong universality under the condition of less data quantity.

S3, semantic coding the output of the hierarchical semantic clustering algorithm, specifically, dynamically carrying out secondary coding on the input high-level semantic features through a hidden space coding layer, defining all detected objects in the picture as nodes of a fully-connected picture in the secondary coding, and counting the probability of common occurrence of all objects in the whole data set as edges between the nodes, so that the input high-level semantic features can be directly processed by a graph convolution neural network. In the field of software engineering, an adjacency matrix is generally used as a data structure defining a graph structure, each value of the adjacency matrix representing a value of an edge between two nodes, and thus the adjacency matrix of the graph is defined as a probability that each two objects in the entire data set commonly appear in the same picture. After the neural network is convolved by a plurality of layers of graphs, the high-level semantic features with the contexts of all objects in the input image can be obtained.

S4, generating a relation classifier weight, selecting the high-level semantic features with the context according to the categories of two objects contained in each pairing in the object pairing, selecting the high-level semantic features with the context corresponding to the two processed object categories for each pairing, combining the two high-level semantic features with the context to be used as a weight, and repeating the operation to obtain the relation classifier weight.

And S5, generating a scene graph, combining the visual feature vector, the combined visual feature vector and the position code obtained in the step S1 into a unified feature vector, and performing point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as the scene graph.

The visual relation detection device based on the scene graph high-order semantics comprises a visual feature extraction module, a hierarchical semantic clustering module, a semantic coding module, a weight generation module and a scene graph generation module.

The output of the visual feature extraction module comprises a visual feature vector, a joint visual feature vector and a position code, the main body of the visual feature extraction module comprises a general object detection algorithm which uses a convolutional neural network to predict the class and the position of an object, the class of the object is a number which is generally obtained by sequentially coding the objects which may appear in input data, the position of the object is a box which is determined by two points, namely the upper left corner and the lower right corner of the box, and each point comprises the numerical values of a horizontal coordinate and a vertical coordinate. The visual feature vector is the feature vector of the corresponding position of each object frame on the feature map output by the convolutional neural network. And pairing every two detected objects once in the visual feature extraction module, combining the detection frames of the corresponding objects into a large frame in each pairing, and extracting the combined visual feature vector through a convolutional neural network. The difference value of the central points of the detection frames corresponding to the two objects and the respective height and width are combined into a feature vector by the object pairing, and the feature vector is processed by using a learnable linear layer to obtain the final position code.

And the hierarchical semantic clustering module is based on a hierarchical semantic clustering algorithm, and replaces the categories of all objects in the input picture by using semantic vectors obtained by the hierarchical semantic clustering algorithm to obtain a high-level semantic feature vector corresponding to each object.

The semantic coding module comprises a hidden space coding layer and a graph convolution neural network, the output of a hierarchical semantic clustering algorithm is used as input, the input high-level semantic features are dynamically subjected to secondary coding through the hidden space coding layer, and the high-level semantic features with the contexts of all objects in the input image can be obtained after passing through a plurality of layers of the graph convolution neural network.

The weight generation module selects the high-level semantic features with the context according to the categories of the two objects included in each pairing in the object pairing, selects the high-level semantic features with the context corresponding to the two processed object categories for each pairing, combines the two high-level semantic features with the context to serve as a weight, and repeats the operation to obtain the weight of the relation classifier.

And the scene graph generation module combines the visual feature vector, the joint visual feature vector and the position code into a unified feature vector, and performs point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

For the improvement of the effect of the visual relationship detector, it is an important aspect to obtain better visual relationship feature expression. The visual features employed by embodiments of the present invention are structured feature combinations obtained based on triples, i.e., expressing visual relationships as a combination of features of subjects and objects, thus, naturally breaking the visual relationships into two components that can be shared. However, in order to achieve a better detection effect, a finer-grained decomposition method may be explored, so that the expression of the visual relationship is more intelligible and is less ambiguous, for example, in the human interaction detection, the relationship between a human and an object is expressed through the interaction between each part of the human body and the object, which is more meaningful than only regarding the relationship between the whole human and the object.

The embodiment of the invention only adopts the visual features and the 2D space position features to learn the relationship, and can also search other features, for example, depth information can help solve the ambiguity problem which cannot be solved by the 2D space position information, because the 2D space position relationships with different relationships are the same, and the segmentation result can be used for complementing the object bounding box, thereby solving the ambiguity problem caused by some occlusion.

The model provided by the embodiment of the invention adopts a two-stage framework, a pre-trained object detector is required to extract candidate frames, then the detection of the training relation is carried out according to the object formed by the candidate frames, all possible object combinations cannot be trained by considering the complexity of calculation, so that the recall rate is lower, a strategy for selecting the object pairs most likely to generate the relation can be designed according to the actual application requirements, a better detector is designed, and the group pairing strategy is replaced.

The embodiment of the invention provides a visual relation detection method and device based on scene graph high-order semantics, which enable the learning of relations to be context-dependent by training a structured relation classification space, and do not classify the relations under various completely different contexts into the same class any more, so that the problem of overlarge visual difference in the class is well relieved, and meanwhile, the structural statistical property of relation triples is introduced, and by means of context vectorization and a shared mapping function, the learning among different relation triples is shared, and the problem caused by data sparsity is relieved. Meanwhile, the generalization performance of the model is stronger due to the structure of high-level semantics, so that the problem of long tail distribution of data is greatly improved. The embodiment of the invention can well solve the problem of visual relation detection under the condition of less data volume.

The embodiment of the invention has the following beneficial effects:

1) aiming at the problem of intra-class diversity, a structured relationship classification space related to a context is established according to the structure of a global scene graph, and an independent relationship space is connected by vectorizing the context and sharing a mapping function, so that the statistical characteristic on the structure is utilized, and meanwhile, the common occurrence statistical prior knowledge of objects in a scene data set is applied to the learning of visual relationships by means of a graph neural network;

2) aiming at the problem of long-tail distribution of relational data, a semantic hierarchical clustering algorithm based on a scene knowledge graph is designed for extracting a high-level semantic structure of visual relations, and context vectors are combined and learned according to clustering results, so that the influence of long-tail distribution on model learning is relieved while the problem scale is reduced and the generalization performance of a model is improved.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and should not be taken as limiting the scope of the present invention. It should be noted that those skilled in the art should recognize that they may make equivalent variations to the embodiments of the present invention without departing from the spirit and scope of the present invention.

Claims

1. A visual relation detection method based on scene graph high-order semantics is characterized by comprising the following steps:

s4, generating the weight of the relation classifier;

2. The method of claim 1, wherein the object type is a number, and the object type is obtained by sequentially encoding objects that may appear in the input data.

3. The method as claimed in claim 1, wherein the object is located in a box, and the box is defined by two points, which are respectively the top left corner and the bottom right corner of the box, and each point includes the values of the abscissa and the ordinate.

4. The scene graph high-order semantic based visual relationship detection method according to claim 1, wherein the encoding of the position specifically includes combining a difference between center points and respective height widths of detection boxes corresponding to two objects into one feature vector based on object pairing, and processing the feature vector using a learnable linear layer to obtain a final position code.

5. The visual relationship detection method based on scene graph higher-order semantics of claim 1, wherein a specific process of the hierarchical semantic clustering algorithm includes that for category names appearing in all data sets, a word space model is used for converting the category names into word vectors, then hierarchical clustering is performed on the word vectors, and after multiple iterations, a large category word vector corresponding to each category word vector can be obtained.

6. The scene graph high-order semantic-based visual relationship detection method according to claim 1, wherein the semantic coding specifically includes performing secondary coding on the input high-level semantic features dynamically through a hidden space coding layer, defining all detected objects in a picture as nodes of a fully-connected graph in the secondary coding, counting probabilities of all objects appearing together in the whole data set as edges between the nodes, and obtaining the high-level semantic features with contexts of all objects in the input image through multilayer graph convolution neural network processing.

7. The visual relationship detection method based on scene graph high-order semantics as claimed in claim 6, wherein the generating of the relationship classifier weight specifically includes selecting a high-level semantic feature with context according to categories of two objects included in each pairing of object pairings, selecting a high-level semantic feature with context corresponding to the two processed object categories for each pairing, merging the two high-level semantic features with context as a weight, and finally obtaining the relationship classifier weight.

8. An apparatus for visual relationship detection based on scene graph higher-order semantics, comprising a memory of a processor, the memory having stored therein a computer program executable by the processor to implement the method of any one of claims 1-7.

9. A computer-readable storage medium, in which a program executable by a processor is stored, and the program is capable of implementing the method for detecting visual relationships based on higher-order semantics of a scene graph according to any one of claims 1 to 7 during execution by the processor.