CN113240033A

CN113240033A - Visual relation detection method and device based on scene graph high-order semantic structure

Info

Publication number: CN113240033A
Application number: CN202110573757.XA
Authority: CN
Inventors: 袁春; 魏萌
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-10
Anticipated expiration: 2041-05-25
Also published as: CN113240033B

Abstract

The invention provides a visual relation detection method and a device based on a scene graph high-order semantic structure.A method comprises the steps of predicting the types and positions of all objects in a picture, outputting a visual characteristic vector corresponding to each object, carrying out pairing operation on every two detected objects, extracting a combined visual characteristic vector based on a pairing result, and coding the position to obtain a position code; inputting the categories of all the objects into a hierarchical semantic clustering algorithm, and processing to obtain a high-level semantic feature vector corresponding to each object; performing semantic coding on the output of the hierarchical semantic clustering algorithm; generating a relationship classifier weight; and merging the visual feature vector, the combined visual feature vector and the position code into a unified feature vector, and performing point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

Description

Visual relation detection method and device based on scene graph high-order semantic structure

Technical Field

The invention relates to the field of image processing, in particular to a visual relation detection method and device based on a scene graph high-order semantic structure.

Background

The main goal of the visual relationship detection task is to identify and locate the content of the visual ternary relationships (subjects, relationships, objects) that exist in the image. The identification refers to identifying the category attribute of the target object, and the positioning refers to regressing the bounding box of the target object. Understanding the visual scene is often not just the recognition of a single object, even a perfect object detector, it is difficult to perceive nuances between a person feeding a horse and a person standing beside the horse. Learning rich semantic relationships between these objects is the sense of visual relationship detection. The key to understand visual scenes more deeply is to construct a structured representation from the scene based on identifying objects to capture objects and their semantic relationships. Such a characterization not only provides contextual information for the underlying recognition task, but is also of great value for a variety of advanced visual tasks. This structured representation is called a scene graph. The scene graph provides an expression of explicitly modeled objects and their relationships. Briefly, a scene graph is a visual positioning graph of image objects, where the connection of edges depicts their pairwise relationship.

Visual relation detection is a bottom-layer core algorithm in many fields and has wide application, for example, in the field of image retrieval, the visual relation detection enables a retrieval algorithm to better understand the relation between input texts and images, so that the retrieval effect is improved; in the field of automatic driving, the automatic driving vehicle can be provided with a current scene structure so as to help the automatic driving vehicle to safely drive.

Detecting visual relationships in images is a difficult task, with difficulties arising mainly from two aspects: (1) it is difficult to obtain labels of the correct kind and quantity and complete triple labels; (2) there is great variability in the relationships in terms of visual appearance and language description. It is very difficult to obtain the object bounding box level label first. Detecting visual relationships in an image requires locating the subject and object in the interaction by determining a bounding box around the corresponding visual entity in the image. Thus, for a fully supervised model, the ideal training data is an image of the visual relationship with box-level labeling, i.e. drawing a bounding box around the objects, and each pair of interactive objects is labeled with a descriptive triplet. However, obtaining such annotations is very expensive.

Another reason for the difficulty in obtaining annotations is the ternary combination explosion caused by the combined nature of the visual relationships. For a vocabulary of N different object classes and K different relations, the possible relation coefficients are nxnxnxk, e.g. for N100 there may be one million possible triples. Since most of these triples are rare or invisible in the real world, the training data itself always presents a long-tailed distribution, i.e., the annotations are concentrated on few relationships, while most triples in the vocabulary have little or no training data. The long tail distribution is not caused by the quality of the label, but the distribution of the data collected under natural conditions usually shows the same long tail distribution. The algorithm is trained on a data set with long tail distribution, so that an overfitting phenomenon often occurs, namely the algorithm mainly focuses on the head category in the data volume set, and the learning of the tail category is omitted. Therefore, applying visual relationship detectors to a large number of triplets is a significant challenge. However, from the industrial demand, for the algorithm research on the long tail data, the labeling cost can be greatly reduced, and the data acquisition efficiency is improved.

Early visual relationship identification methods learned the relationship triplets as a whole, i.e., trained an independent detector for each relationship triplet category. However, this method is only suitable for small datasets with fewer relational triples and more label data per triplet. In this case, the visual relationship detection task is very similar to the target detection task except that the detection object at this time is changed from a single object to two objects and the relationship therebetween. However, as the data set is developed, the visual relational data set is no longer defined on the preset relational triples, but is an open vocabulary, so that the number of possible triples is very large, and most triples do not have enough labeled data. In this case, it is not possible to train a triplet detector, and therefore this has facilitated the development of a combined model, i.e. instead of detecting each visual relationship triplet individually, the detection target becomes a simpler visual element that can be shared among multiple visual relationship triplets. This change in perspective is inspired by the structure of natural language, where visual relationships are represented in the form of components of triples, and each component can be observed independently or as part of different visual interactions.

Disclosure of Invention

In order to solve the technical problem that labels of correct types and numbers and complete triple labels are difficult to obtain, the invention provides a visual relationship detection method and device based on a scene graph high-order semantic structure.

Therefore, the visual relationship detection method based on the scene graph high-order semantics, which is provided by the invention, specifically comprises the following steps:

s1, predicting the types and positions of all objects in the picture, outputting a visual feature vector corresponding to each object, pairing every two detected objects, extracting a joint visual feature vector based on a pairing result, and coding the positions to obtain position codes;

s2, inputting the categories of all the objects into a hierarchical semantic clustering algorithm, and processing to obtain a high-level semantic feature vector corresponding to each object;

s3, carrying out semantic coding on the output of the hierarchical semantic clustering algorithm;

s4, generating the weight of the relation classifier;

s5, merging the visual feature vector, the combined visual feature vector and the position code into a unified feature vector, and performing dot product operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

Further, the class of the object is a number, which is obtained by sequentially encoding objects that may appear in the input data.

Further, the position of the object is a frame, which is determined by two points, namely the upper left corner and the lower right corner of the frame, each point including the values of the abscissa and the ordinate.

Further, the encoding the position specifically includes combining a difference value between center points of detection frames corresponding to the two objects and respective height widths into a feature vector based on object pairing, and processing the feature vector by using a learnable linear layer to obtain a final position code.

Further, the specific process of the hierarchical semantic clustering algorithm includes converting category names appearing in all data sets into word vectors through a word space model, then performing hierarchical clustering on the word vectors, and obtaining a large category word vector corresponding to each category word vector after multiple iterations.

Further, the semantic coding specifically includes performing secondary coding on the input high-level semantic features dynamically through a hidden space coding layer, defining all detected objects in a picture as nodes of a fully-connected graph in the secondary coding, counting the probability of common occurrence of all objects in the whole data set as edges between the nodes, and obtaining the high-level semantic features with contexts of all the objects in the input image through multilayer graph convolution neural network processing.

Further, the generating of the relation classifier weight specifically includes selecting the high-level semantic features with context according to the categories of the two objects included in each pairing of the object pairings, selecting the high-level semantic features with context corresponding to the two processed object categories for each pairing, merging the two high-level semantic features with context to serve as a weight, and finally obtaining the relation classifier weight.

The visual relation detection device based on the scene graph high-order semantics concretely comprises a visual feature extraction module, a hierarchical semantic clustering module, a semantic coding module, a weight generation module and a scene graph generation module;

the visual feature extraction module predicts the categories and positions of all objects in the picture through a convolutional neural network, outputs a visual feature vector corresponding to each object, performs pairing operation on every two detected objects, extracts a combined visual feature vector through the convolutional neural network based on a pairing result, and codes the positions to obtain position codes;

the hierarchical semantic clustering module replaces the categories of all objects in the input picture by using semantic vectors obtained by a hierarchical semantic clustering algorithm to obtain a high-level semantic feature vector corresponding to each object;

the semantic coding module comprises a hidden space coding layer and a graph convolution neural network, the output of a hierarchical semantic clustering algorithm is used as input, the input high-level semantic features are dynamically subjected to secondary coding through the hidden space coding layer, and the high-level semantic features with contexts of all objects in an input image are obtained through multi-layer graph convolution neural network processing;

the weight generation module selects the high-level semantic features with the context according to the categories of the two objects included in each pairing in the object pairing, selects the high-level semantic features with the context corresponding to the two processed object categories for each pairing, and takes the two high-level semantic features with the context as a weight after being combined to finally obtain the weight of the relation classifier;

the scene graph generation module combines the visual feature vector, the combined visual feature vector and the position code into a unified feature vector, and performs point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as the scene graph.

Further, the class of the object is a number, which is obtained by sequentially encoding objects that may appear in the input data;

the position of the object is a frame, which is determined by two points, respectively the upper left corner and the lower right corner of the frame, each point comprising a numerical value of an abscissa and an ordinate.

The computer-readable storage medium provided by the invention stores a program which can be run by a processor, and the program can realize the visual relation detection method based on the scene graph high-order semantics in the process of being run by the processor.

Compared with the prior art, the invention has the following beneficial effects:

a hierarchical clustering algorithm and semantic coding are added in the algorithm process, so that the classification of the relationship between objects in the picture fully considers the semantic relationship with universality.

In some embodiments of the invention, the following advantages are also provided:

1) compared with the prior art, the position coding processing in the invention is simpler and more direct, the operation complexity is greatly optimized, and the requirement on hardware is reduced;

2) according to the method, aiming at the characteristics of objects and relations, a graph structure is used for processing the objects and the relations, and a graph convolution neural network is innovatively used, so that the method is universal, more flexible and more suitable for different use scenes.

Drawings

FIG. 1 is a flow chart of a visual relationship detection method.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

The visual relationship detection method based on the scene graph high-order semantics provided by the embodiment of the invention specifically comprises the following steps:

s1, visual feature extraction, predicting the category and position of all objects in the picture through a convolutional neural network CNN and a regional convolutional neural network RCNN, wherein the category of the object is a number, generally obtained by sequentially coding the objects possibly appearing in input data, the position of the object is a frame, the frame is determined by two points, namely the upper left corner and the lower right corner of the frame, each point comprises the numerical values of a horizontal coordinate and a vertical coordinate, meanwhile, a visual feature vector corresponding to each object, namely a feature vector of the position corresponding to each object frame on a feature map output by the convolutional neural network, then, pairing every two detected objects once, wherein in each pairing, the detection frames of the corresponding objects are combined into a large frame, extracting the combined visual feature vector through the convolutional neural network, except the visual feature, considering that the relative position of the object is also an important consideration point for classifying the relationship, the position is also encoded, and the above-mentioned object pairing combines the difference value of the central points of the detection frames corresponding to the two objects and the respective height and width into a feature vector, and processes the feature vector by using a learnable linear layer to obtain the final position code.

S2, hierarchical semantic clustering, inputting the categories of all objects in the picture into a hierarchical semantic clustering algorithm, replacing the categories of all the objects by using semantic vectors obtained by the hierarchical semantic clustering algorithm to obtain high-level semantic feature vectors corresponding to all the objects, converting the category names appearing in all training data and an existing word space model in the training process of the hierarchical semantic clustering algorithm, specifically, converting the category names appearing in all the data sets into word vectors through a word space model (such as BERT) and the like, then, hierarchically clustering the word vectors, and obtaining a large category word vector corresponding to each category word vector after multiple iterations, for example, the category "dog" can be combined with the words "cat", "rabbit" and the like into an "animal" word vector after multiple clustering. The hierarchical semantic clustering algorithm has the advantages that higher-level semantic features can be obtained more efficiently, and then the lower-level semantic features can be processed by using the higher-level semantic features, so that the embodiment of the invention has strong universality under the condition of less data volume.

S3, semantically coding the output of the hierarchical semantic clustering algorithm, specifically, dynamically coding the input high-level semantic features secondarily through a hidden space coding layer, defining all detected objects in the picture as nodes of a full-connected graph in the secondary coding, and counting the probability of common occurrence of all objects in the whole data set as edges between the nodes, so that the input high-level semantic features can be directly processed by a graph convolution neural network. In the field of software engineering, adjacency matrices are generally used as data structures defining graph structures, each value of an adjacency matrix representing a value of an edge between two nodes, and thus, the adjacency matrix of the graph is defined as a probability that every two objects in the entire data set appear together in the same picture. After the neural network is convolved by a plurality of layers of graphs, the high-level semantic features with the contexts of all objects in the input image can be obtained.

S4, generating a relation classifier weight, selecting the high-level semantic features with the context according to the categories of the two objects included in each pairing in the object pairing, selecting the high-level semantic features with the context corresponding to the two processed object categories for each pairing, combining the two high-level semantic features with the context to be used as a weight, and repeating the operation to obtain the relation classifier weight.

And S5, generating a scene graph, combining the visual feature vector, the combined visual feature vector and the position code obtained in the step S1 into a unified feature vector, and performing point multiplication on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as the scene graph.

The visual relation detection device based on the scene graph high-order semantics comprises a visual feature extraction module, a hierarchical semantic clustering module, a semantic coding module, a weight generation module and a scene graph generation module.

The output of the visual feature extraction module comprises a visual feature vector, a joint visual feature vector and a position code, the main body of the visual feature extraction module comprises a general object detection algorithm which uses a convolutional neural network to predict the class and the position of an object, the class of the object is a number and is generally obtained by sequentially coding the objects which may appear in input data, the position of the object is a box which is determined by two points, namely the upper left corner and the lower right corner of the box, and each point comprises the numerical values of a horizontal coordinate and a vertical coordinate. The visual feature vector is the feature vector of the corresponding position of each object frame on the feature map output by the convolutional neural network. And pairing every two detected objects once in the visual feature extraction module, combining the detection frames of the corresponding objects into a large frame in each pairing, and extracting the combined visual feature vector through a convolutional neural network. The difference value of the central points of the detection frames corresponding to the two objects and the respective height and width are combined into a feature vector by the above-mentioned object pairing, and a learnable linear layer is used for processing the feature vector to obtain the final position code.

And the hierarchical semantic clustering module replaces the categories of all objects in the input picture by using semantic vectors obtained by the hierarchical semantic clustering algorithm based on the hierarchical semantic clustering algorithm to obtain a high-level semantic feature vector corresponding to each object.

The semantic coding module comprises a hidden space coding layer and a graph convolution neural network, the output of a hierarchical semantic clustering algorithm is used as input, the input high-level semantic features are dynamically subjected to secondary coding through the hidden space coding layer, and the high-level semantic features with the contexts of all objects in the input image can be obtained after passing through a plurality of layers of graph convolution neural networks.

The weight generation module selects the high-level semantic features with the context according to the categories of the two objects included in each pairing in the object pairing, selects the high-level semantic features with the context corresponding to the two processed object categories for each pairing, combines the two high-level semantic features with the context to serve as a weight, and repeats the operation to obtain the weight of the relation classifier.

And the scene graph generation module combines the visual feature vector, the joint visual feature vector and the position code into a unified feature vector, and performs point multiplication operation on the unified feature vector by using the weight of the relation classifier to finally obtain the relation conditional probability between every two objects as a scene graph.

For the improvement of the effect of the visual relationship detector, it is an important aspect to obtain better visual relationship feature expression. The visual features employed by embodiments of the present invention are structured feature combinations obtained based on triples, i.e., expressing visual relationships as a combination of features of subjects and objects, thus, naturally breaking the visual relationships into two components that can be shared. However, in order to achieve a better detection effect, a finer-grained decomposition method may be explored, so that the expression of the visual relationship is more intelligible and is less ambiguous, for example, in the human interaction detection, the relationship between a human and an object is expressed through the interaction between each part of the human body and the object, which is more meaningful than only regarding the relationship between the whole human and the object.

The embodiment of the invention only adopts the visual features and the 2D space position features to learn the relationship, and can also search other features, for example, depth information can help solve the ambiguity problem which cannot be solved by the 2D space position information, because the 2D space position relationships with different relationships are the same, and the segmentation result can be used for complementing the object bounding box, thereby solving the ambiguity problem caused by some occlusion.

The model provided by the embodiment of the invention adopts a two-stage framework, a candidate frame is extracted by means of a pre-trained object detector, then the detection of the training relation is carried out according to the object formed by the candidate frame, all possible object combinations can not be trained by considering the complexity of calculation, so that the recall rate is lower, a strategy for selecting the object pairs which are most likely to generate the relation can be designed according to the actual application requirements, a better detector is designed, and a group pairing strategy is replaced.

The embodiment of the invention provides a visual relation detection method and device based on scene graph high-order semantics, which enable the learning of relations to be context-dependent by training a structured relation classification space, and do not classify the relations under various completely different contexts into the same class any more, so that the problem of overlarge visual difference in the class is well relieved, and meanwhile, the structural statistical property of relation triples is introduced, and by means of context vectorization and a shared mapping function, the learning among different relation triples is shared, and the problem caused by data sparsity is relieved. Meanwhile, the generalization performance of the model is stronger due to the structure of high-level semantics, so that the problem of long-tail distribution of data is greatly improved. The embodiment of the invention can well solve the problem of visual relation detection under the condition of less data volume.

The embodiment of the invention has the following beneficial effects:

1) aiming at the problem of intra-class diversity, a structured relationship classification space related to a context is established according to the structure of a global scene graph, and an independent relationship space is connected by vectorizing the context and sharing a mapping function, so that the statistical characteristic on the structure is utilized, and meanwhile, the common occurrence statistical prior knowledge of objects in a scene data set is applied to the learning of visual relationships by means of a graph neural network;

2) aiming at the problem of long-tail distribution of relational data, a semantic hierarchical clustering algorithm based on a scene knowledge graph is designed for extracting a high-level semantic structure of visual relations, and context vectors are combined and learned according to clustering results, so that the influence of long-tail distribution on model learning is relieved while the problem scale is reduced and the generalization performance of a model is improved.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it should not be understood that the scope of the present invention is limited thereby. It should be noted that those skilled in the art should recognize that they may make equivalent variations to the embodiments of the present invention without departing from the spirit and scope of the present invention.

Claims

1. A visual relation detection method based on scene graph high-order semantics is characterized by specifically comprising the following steps:

s4, generating the weight of the relation classifier;

2. The method of claim 1, wherein the object type is a number, and the object type is obtained by sequentially encoding objects that may appear in the input data.

3. The method of claim 1, wherein the object is located in a frame, and the position is determined by two points, namely, the upper left corner and the lower right corner of the frame, and each point includes a value of an abscissa and a value of an ordinate.

4. The method of claim 1, wherein the encoding of the position specifically includes combining a difference between center points of detection frames corresponding to two objects and respective height widths into a feature vector based on object pairing, and processing the feature vector using a learnable linear layer to obtain a final position code.

5. The method for detecting visual relationship based on scene graph high-order semantics as claimed in claim 1, wherein a specific process of the hierarchical semantic clustering algorithm includes converting category names appearing in all data sets into word vectors through a word space model, then performing hierarchical clustering on the word vectors, and obtaining a large category word vector corresponding to each category word vector after multiple iterations.

6. The method for detecting visual relationship based on scene graph high-order semantics as claimed in claim 1, wherein the semantic coding specifically includes performing secondary coding on the input high-level semantic features dynamically through a hidden space coding layer, defining all detected objects in a picture as nodes of a fully-connected graph in the secondary coding, counting probabilities of all objects appearing together in the whole data set as edges between the nodes, and obtaining the high-level semantic features with contexts of all objects in the input image through multi-layer graph convolution neural network processing.

7. The visual relationship detection method based on scene graph high-order semantics as claimed in claim 6, wherein the generating of the relationship classifier weight specifically includes selecting a high-level semantic feature with context according to categories of two objects included in each pairing of object pairings, selecting a high-level semantic feature with context corresponding to the two processed object categories for each pairing, merging the two high-level semantic features with context as a weight, and finally obtaining the relationship classifier weight.

8. An apparatus for visual relationship detection based on scene graph higher-order semantics, comprising a memory of a processor, the memory having stored therein a computer program executable by the processor to implement the method of any one of claims 1-7.

9. A computer-readable storage medium, in which a program executable by a processor is stored, and the program is capable of implementing the method for detecting visual relationships based on higher-order semantics of a scene graph according to any one of claims 1 to 7 during execution by the processor.