WO2022045531A1

WO2022045531A1 - Scene graph generation system using deep neural network

Info

Publication number: WO2022045531A1
Application number: PCT/KR2021/006634
Authority: WO
Inventors: 김인철; 정가영
Original assignee: 경기대학교 산학협력단
Priority date: 2020-08-24
Filing date: 2021-05-28
Publication date: 2022-03-03
Also published as: KR20220025524A; KR102533140B1

Abstract

A scene graph generation system using a deep neural network is disclosed. The system comprises: an object area detection unit for detecting a plurality of object areas from an input image; an object and relationship detection unit which detects objects and relationships within the image on the basis of the inferred object areas, and which detects the objects and relationships by using multi-modal contextual information including linguistic contextual features, in addition to visual contextual features; and a graph generation unit generating a scene graph for the input image according to the detection results of the object and relationship detection unit.

Description

Scene graph generation system using deep neural network

The present invention relates to a technique for generating a scene graph, and more particularly, to a technique for recognizing an object in an image, grasping a relationship between them, and expressing it in a graph form.

As one of the representative artificial intelligence and computer vision problems that require deep image understanding, there is a scene graph generation problem. A scene graph expresses a scene contained in an image in the form of a graph. Each node constituting the graph represents an object in the image, and each edge represents a relationship between the objects. Therefore, it can be seen as a fact set in the form of <subject-relationship predicate-object>. That is, the scene graph generation problem is a problem of generating a knowledge graph representing the scene of the corresponding image as a result of in-depth understanding of the input image.

1 shows a general longevity graph generation process. In order to generate a scene graph, not only object detection in an image but also relationship detection between objects is essential. Object detection is a problem that has been studied a lot in the conventional computer vision field, but it is still in the early stages of research as a problem that has only recently been attracting attention in relation detection or in the field of relationship detection. The possible relationships between the two objects in the image are very diverse. In general, there are spatial relationships and semantic relationships between objects, which are often dealt with in the study of scene graph generation. Spatial relationship represents the relative positional relationship between objects placed in the image, such as ‘on’, ‘next to’, and ‘in front of’. On the other hand, semantic relationships are those related to the actions of one object to another, such as ‘wearing’, ‘eating’, and ‘holding’.

Object detection technology using a convolutional neural network (CNN) has reached a high level at present, but there may still be errors in object identification and area detection. This means that there may be uncertainties and errors in the identification of two objects that are the basis for relationship detection. Although the identification of two objects forming a relationship is very clear, since the number of possible relationships between two objects is also large, it is by no means easy to accurately determine the relationship between objects. Moreover, in general, there are various semantic restrictions on a specific relationship and the types of two objects that can have the relationship. Referring to FIG. 1 as an example, although the relationship of <man-wearing-shoes> is possible, the human common sense is well aware that the relationship such as <man-wearing-racket> or <shoes-wearing-man> is impossible. Therefore, it should be possible to effectively generate an accurate scene graph from an image in consideration of the characteristics of this problem.

An object of the present invention is to provide a technical method capable of generating an appropriate scene graph for an image.

A scene graph generating system using a deep neural network according to an aspect detects an object and a relationship in an image based on an object region detector that detects a plurality of object regions in an input image, and inferred object regions, but uses a convolutional neural network Network)-based visual context features, as well as an object and relationship detector that detects objects and relationships using multi-modal context information including language context features, and a scene graph for the input image according to the detection results of the object and relationship detector It may include a graph generator for generating the graph.

The object region detector may detect object regions in the input image by using a Faster Region of Convolutional Neural Network (R-CNN).

The object and relationship detector generates object nodes and relationship nodes to compose a graph based on the inferred object regions, and provides an initial feature value for each generated node. A graph reasoning unit that updates the feature value of each node by exchanging context information between neighboring nodes based on the feature values, and classifies objects and relationships based on the final feature value of each node updated through the graph reasoning unit (node classification) ) may include a graph labeling unit.

The graph initializer creates an object node for each object region and assigns an initial feature value to the created object node, and creates one relation node for each pair of object regions It may include a relation node initializer for allocating an initial feature value to the relation node, but assigning multi-modal context information including a text-based linguistic context feature in addition to an image-based visual context feature as an initial feature value.

The object node initializer may allocate a visual feature and an object class probability distribution of each object region as initial feature values of each object node.

The linguistic context feature includes a feature in which the expected object category of the subject object is embedded, location information in the image of the subject-object region and the object-object region, and a feature in which the expected category name of the object is embedded. can do.

The visual context feature includes the visual feature of the entire input image, the visual feature of the image region surrounding the subject object region and the object object region that can form a relationship, and location information of the region surrounding the subject object and the object object. can do.

The relationship node initializer may embed the components of the language context feature using a bidirectional recurrent neural network.

The graph reasoning unit can use an attentional graph convolution neural network to identify a node to focus on among neighboring nodes, and differentially reflect the information of the neighboring nodes in updating the feature values of each node.

The graph reasoning unit is composed of a visual inference layer based on the attention graph convolutional product neural network and a semantic reasoning layer based on the attention graph convolutional product neural network, and the object and relationship class probability distribution of each node as a result of the visual reasoning layer is semantic inference It may be provided as an initial input value of the layer.

On the other hand, the method for generating a scene graph using a deep neural network according to an aspect includes an object region detection step of detecting a plurality of object regions in an input image, an object and a relationship in an image based on the detected object regions, and a convolutional product neural network ( A graph that generates a scene graph for an input image according to the object and relationship detection step of detecting objects and relationships using multi-modal context information including language context features in addition to visual context features based on Convolutional Neural Network), and the detection results It may include a generating step.

The present invention creates an effect that makes it possible to generate an appropriate scene graph for an input image.

1 is an exemplary diagram of scene graph generation.

2 is a block diagram of a scene graph generation system using a deep neural network according to an embodiment.

3A to 3C are structural diagrams of a scene graph generation model using a deep neural network according to an embodiment.

4 is a diagram illustrating a language context feature embedding process based on a bidirectional recurrent neural network according to an embodiment.

The foregoing and further aspects of the present invention will become more apparent through preferred embodiments described with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail so that those skilled in the art can easily understand and reproduce it through these examples.

2 is a block diagram of a scene graph generation system using a deep neural network according to an embodiment. The scene graph generation system includes an object region detection unit 100 , an object and relationship detection unit 200 , and a graph generation unit 600 . The object region detector 100 detects a plurality of object regions in an image given as an input. In an embodiment, the object region detector 100 detects object regions using a Faster Region of Convolutional Neural Network (R-CNN). In this case, the number of detected areas may be up to 64. And each object region has the values of the visual characteristics of the convolutional neural network (CNN) and the position and class probability distribution. These feature values are obtained in the object domain reasoning process and then used to initialize the values of object nodes and relation nodes.

The object and relationship detector 200 detects objects in the input image and a relationship therebetween based on the detected object regions. In an embodiment, the object and relationship detection unit 200 detects an object and a relationship using multi-context information including language context features in addition to visual context features based on a convolutional neural network (CNN). Specifically, the object and relationship detection unit 200 represents each object zone detected by the object zone detection unit 100 as an object node, and an object pair between them as a relationship node, and initializes each node. The object and relationship detector 200 updates the feature value of the node by exchanging context information between neighboring nodes using a Graph Convolution Neural Network (GCN), and based on the finally obtained feature value to classify each object node and relation node. And the graph generating unit 600 generates a scene graph for the input image according to the detection result of the object and relation detecting unit 200, based on the object nodes and relation nodes classified by the object and relation detecting unit 200 to create a scene graph.

The object and relationship detection unit 200 may include a graph initialization unit 300 , a graph reasoning unit 400 , and a graph labeling unit 500 . The graph initialization unit 300 generates object nodes and relational nodes to construct the graph based on the object regions, and gives initial feature values to each generated node. As shown in FIG. 2 , the graph initializer 300 may include an object node initializer 310 and a relation node initializer 320 , and the relation node initializer 320 includes a language context feature embedding unit ( 321) may be included. The object node initialization unit 310 creates an object node for each object area, and assigns an initial feature value to the created object node. In this case, the object node initialization unit 310 may allocate the visual feature and object class probability distribution of the object region as initial feature values of the object node. In addition, the relationship node initialization unit 320 creates one relationship node for each pair of object nodes, and assigns an initial feature value to the created relationship node. In addition to the image-based visual context feature, text-based Multi-modal context information including language context features is assigned as initial feature values.

The visual context feature includes visual features of the entire input image, visual features of the image region surrounding the subject object region and the object object region that can form a relationship, and location information of the region surrounding the subject and object objects. The linguistic context feature consists of a feature in which the expected object category of the subject object is embedded, location information in the image of the subject object region and the object object region, and a feature in which the expected category name of the object object is embedded. elements may be included. In addition, the language context feature embedding unit 321 may embed the components of the language context feature using a bidirectional recurrent neural network.

The graph inference unit 400 updates the characteristic values of each node by exchanging context information between neighboring nodes based on the initial characteristic values of each node obtained from the graph initialization unit 300 . In one embodiment, the graph reasoning unit 400 uses an attentional graph convolution neural network to identify a node to be focused on among neighboring nodes, and information of neighboring nodes to update feature values of each node differentially reflects the When the feature value of an object node is updated using the attention-grabbing convolutional neural network, contextual information is exchanged between the subject object node and the object object node, between the subject object node and the relationship node, and between the object object node and the relationship node. On the other hand, when the feature value of a relational node is updated, contextual information exchange occurs between the relational node and the subject object node, and between the relational node and the object object node.

As shown in FIG. 2 , the graph reasoning unit 400 may include a visual reasoning unit 410 and a semantic reasoning unit 420 . The visual reasoning unit 410 represents a graph synthesis product neural network layer for visual reasoning, and the semantic reasoning unit 420 represents a graph synthesis product neural network layer for semantic reasoning. In each layer, the feature value update process of each node is performed by exchanging context information between neighboring nodes of the graph based on the initial feature values of each node given through the graph initializer 300 . In this case, the object and relation class probability distribution of each node obtained from the visual reasoning unit 410 may be provided as an initial node input of the semantic reasoning unit 420 .

The graph labeling unit 500 classifies objects and relationships based on the final feature values of each node updated through the graph inference unit 400 . The feature value of each node finally obtained by the semantic inference unit 420 is classified into a category having the largest value through a softmax function. The object node classifying unit 510 of the graph labeling unit 500 labels the object node with the largest value in the object class probability distribution, and the relational node classifying unit 520 labels the relational node through the same process. Through this, a standardized result in the form of <subject-descriptor-object> is obtained.

Hereinafter, a method for generating a scene graph according to the above-described system will be described in more detail. 3A to 3C are diagrams illustrating a neural network structure model for generating a scene graph. This model consists of three steps: region proposals (RP) of FIG. 3A , object & relationship detection (ORD) of FIG. 3B , and graph generation (GG) of FIG. 3C . In the object area detection (RP) stage, Faster R-CNN, a representative object detection module, is used, and the ResNet101 visual feature vector for each object candidate area of the input image, the location and size of the bounding box, and probability distribution by object category (object class distribution), etc.

The object and relationship detection (ORD) stage again consists of detailed stages of graph initialization, graph reasoning, and graph labeling. In the graph initialization step, object nodes and relational nodes to compose a scene graph are generated based on each object region in the input image obtained through the object region detection (RP) process, and initial values are assigned to these nodes. In the graph reasoning step, using a graph convolutional neural network (GCN), context information is exchanged between neighboring object nodes and relational nodes in the graph, and feature values of each node are updated. In the graph labeling step, an object and a relationship are classified based on the final feature value of each node. Finally, in the graph generation stage, one scene graph is completed based on each classified node.

In the graph initialization step of this model, one object node is created in the graph for each object area detected in the image, and an initial feature value is assigned to the corresponding node. In this model, Faster R-CNN, a representative object detection module, is applied to the input image, and the visual feature vector and object class probability distribution extracted for each object candidate area are assigned as initial feature values of each object node. This initial feature value is then used for classification of object nodes after rich contextual information of neighboring nodes is combined through a graph neural network. Therefore, the object category of each node finally determined in this model may be different from the initial object category estimated by Faster R-CNN.

● object visual feature

-

: Convolutional product (CNN) visual characteristics of the object area

● class probability distribution

-

: object class probability distribution of the object area

Therefore, the initial feature vector of each object node

is the same as in Equation 1.

of Equation 1

represents a concatenate operation.

In the graph initialization step, in addition to initialization of object nodes, initialization of relation nodes is also performed. That is, one relational node is created in the graph for each pair of object regions detected in the image, and an initial feature value is assigned to the corresponding node. In this model, for effective relationship detection, rich multi-modal context information including text-based linguistic context features in addition to image-based visual context features is assigned as initial feature values of relationship nodes. do. The composition of the visual context feature set and the linguistic context feature set for the relation node is as follows.

● visual context feature set

-

: Synthetic product visual characteristics of the entire input image

-

: Convolutional product visual feature of the image region (union box) surrounding the subject and object regions that can form a relationship

-

: Location information of the union box surrounding the subject and object objects

of Equation 2

are the center coordinates, width, and height of the object area, respectively,

represents the width and height of the union box, respectively. On the other hand, in Equation 3

is the coordinates of the upper left corner of the union box,

represents the coordinates of the lower right corner, respectively.

● linguistic context feature set

-

: The feature of embedding the expected object category of the subject object as a multi-layer perceptron (MLP)

-

: Location information of the subject object area and the object object area in the image

-

: Characteristic of embedding the expected category name of the object as a multi-layer perceptron

here,

is the same as in Equation 3.

On the other hand, a language context feature vector for expressing a relationship

is the previously introduced

The three components such as simple concatenation (concatenate), one-way recurrent neural network (RNN), and two-way recurrent neural network (biRNN) can be obtained by various coupling methods. In general, it is desirable to express the relationship between two objects as a sequence, considering the position, order, and role of each of the three language components, such as <subject-relational descriptor-object>. With this in mind, in this model, the three language components are

is sequentially combined using a bidirectional Recurrent Neural Network (biRNN),

creates In particular, based on the conceptual relationship of a language, a bidirectional constraint between a possible subject object type and an object object type that can establish a corresponding relationship is a feature vector.

In order to effectively capture the linguistic context sequence in

to embed 4 shows a biRNN-based language context feature value embedding process, and Equation 4 represents the process as an equation.

is the learning parameter,

is the hidden state in the forward direction,

is the hidden state in the reverse direction. In this model, the initial feature value of each relational node is given as in Equation 5 by combining the visual context feature vector and the language context feature vector embedded in biRNN.

The graph reasoning process of this model consists of two layers: a graph convolutional network representing a visual level and a semantic level, respectively. In each layer, the feature values of each node are newly updated by exchanging context information between neighboring nodes of the graph based on the initial feature values of each node given in the graph initialization step. In particular, by using an attentional graph convolutional neural network (attentional GCN) in this model, the information of the neighboring nodes is differentially reflected in the feature value update of each node by distinguishing the node to be focused from among the neighboring nodes and the node not to be focused. do. attention value of each node

is the feature value of the two nodes, as shown in Equations 6 and 7

Wow

predicted based on

In Equation 6 and Equation 7

is a two-layer perceptron (MLP),

Wow

represents the parameters for learning, respectively.

When the feature value of an object node is updated using the attentional graph neural network, contextual information is exchanged between the subject object node <-> object object node, the subject object node <-> relation node, and the object object node <-> relation node. On the other hand, when the feature value of a relation node is updated, contextual information exchange occurs between the relation node <-> subject object node and the relation node <-> object object node. Therefore, the update of the feature value of each object node in the graph is the same as Equation 8, whereas the update of the feature value of the relation node is the same as Equation 9.

In Equation 8 and Equation 9

denotes a subject node, a relationship node, and an object node, respectively. In the two attention-grabbing graph neural network layers composed of a visual reasoning step and a semantic reasoning step, such a node feature value update process is performed, respectively. Instead, the object and relation class probability distribution of each node, the result of the visual reasoning step, is provided as the initial node input of the semantic reasoning step.

Finally, in the graph labeling stage, objects and relationships are classified based on the final feature values of each node obtained in the semantic inference stage. Object nodes are labeled with the largest value in the object class probability distribution. Relational nodes are also labeled through the same process. Through this, a standardized result in the form of <subject-descriptor-object> is obtained.

So far, the present invention has been looked at with respect to preferred embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

an object region detector configured to detect a plurality of object regions in the input image;

An object that detects objects and relationships in an image based on inferred object regions, but detects objects and relationships using multi-modal context information including language context features in addition to visual context features based on a convolutional neural network and a relationship detection unit; and

a graph generating unit for generating a scene graph for the input image according to the detection result of the object and relation detecting unit;

A scene graph generation system using a deep neural network, including
The method of claim 1,

The object region detector is a scene graph generation system using a deep neural network that detects object regions in an input image using Faster R-CNN (Region of Convolutional Neural Network).
The method of claim 1, wherein the object and relationship detection unit comprises:

a graph initializer that generates object nodes and relational nodes to construct a graph based on the inferred object regions, and assigns initial feature values to each generated node;

a graph inference unit that updates the characteristic values of each node by exchanging context information between neighboring nodes based on the initial characteristic values of each node obtained from the graph initialization unit;

a graph labeling unit for classifying objects and relationships based on the final feature value of each node updated through the graph inference unit;

A scene graph generation system using a deep neural network, including
The method of claim 3, wherein the graph initialization unit:

an object node initialization unit that creates an object node for each object region and assigns an initial feature value to the created object node; and

Multi-modal context information including a text-based linguistic context feature in addition to an image-based visual context feature in which one relationship node is generated for each pair of object regions and an initial feature value is assigned to the generated relationship node a relation node initializer for allocating as initial feature values;

A scene graph generation system using a deep neural network, including
5. The method of claim 4,

The object node initializer is a scene graph generation system using a deep neural network that allocates the visual features and object class probability distributions of each object region as initial feature values of each object node.
5. The method of claim 4,

The linguistic context feature includes a feature in which the expected object category of the subject object is embedded, location information in the image of the subject-object region and the object-object region, and a feature in which the expected category name of the object is embedded. A scene graph generation system using deep neural networks.
7. The method of claim 6,

The visual context characteristic includes the visual characteristics of the entire input image, the visual characteristics of the image region surrounding the subject object region and the object object region that can form a relationship, and location information of the region surrounding the subject object and the object object. A scene graph generation system using deep neural networks.
The method of claim 6, wherein the relationship node initialization unit:

A scene graph generation system using a linguistic context feature deep neural network that embeds the components of a linguistic context feature using a bidirectional recurrent neural network.
4. The method of claim 3,

The graph reasoning unit uses the Attentional Graph Convolution Neural Network to identify the node to focus on among the neighboring nodes, and uses a deep neural network that differentially reflects the information of the neighboring nodes to update the feature value of each node. Scenegraph generation system.
10. The method of claim 9,

The graph reasoning unit is composed of a visual inference layer based on the attention graph convolutional product neural network and a semantic reasoning layer based on the attention graph convolutional product neural network, and the object and relationship class probability distribution of each node as a result of the visual reasoning layer is semantic inference A scene graph generation system using a deep neural network provided as an initial input for the layer.
an object region detection step of detecting a plurality of object regions in the input image;

Objects and relationships are detected in an image based on the detected object regions, but objects and relationships are detected using multi-modal context information including language context features in addition to visual context features based on a convolutional neural network and relationship detection; and

a graph generating step of generating a scene graph for the input image according to the detection result;

A scene graph generation method using a deep neural network, including
12. The method of claim 11, wherein detecting objects and relationships comprises:

a graph initialization step of generating object nodes and relational nodes to construct a graph based on the detected object regions, and assigning initial feature values to each created node;

a graph inference step of updating the feature values of each node by exchanging context information between neighboring nodes based on the initial feature values of each node; and

a graph labeling step of classifying objects and relationships based on the final feature values of each node updated through the graph inference step;

A scene graph generation method using a deep neural network, including
13. The method of claim 12, wherein the graph initialization step comprises:

an object node initialization step of creating an object node for each object region and assigning an initial feature value to the created object node; and

Multi-modal context information including a text-based linguistic context feature in addition to an image-based visual context feature in which one relationship node is generated for each pair of object regions and an initial feature value is assigned to the generated relationship node a relation node initialization step of allocating as an initial feature value;

A method of creating a scene graph using a deep neural network.
14. The method of claim 13,

The linguistic context feature in the relation node initialization stage is the linguistic context feature embedding the expected object category of the subject object, location information of the subject object region and the object object region in the image, and the expected category name of the object object The visual context feature includes the visual features of the entire input image, the visual features of the subject object region and the image region surrounding the object object region that can form a relationship, and the subject object A method for creating a scene graph using a deep neural network including location information of a region surrounding an object and an object.
15. The method of claim 14,

The relation node initialization step is a scene graph generation method using a deep neural network that embeds components of language context features using a bidirectional recurrent neural network.
15. The method of claim 14,

The object node initialization step is a scene graph generation method using a deep neural network that assigns visual features and object class probability distributions of each object region as initial feature values of each object node.
13. The method of claim 12,

Using Attentional Graph Convolution Neural Network to identify the node to focus on among neighboring nodes, and to create a scene graph using a deep neural network that differentially reflects the information of neighboring nodes in updating feature values of each node method.
18. The method of claim 17, wherein the graph inference step comprises:

A visual inference step of updating the feature values of each node through the exchange of contextual information between neighboring nodes using an attention-grabbing graph convolutional product neural network for visual inference based on the initial feature values of each node given in the graph initialization step; and

A semantic reasoning step in which the feature value of each node is updated by exchanging contextual information between neighboring nodes using an attention-focused graph convolutional product neural network for visual inference based on the initial feature values of each node given in the graph initialization step including;

A scene graph generation method using a deep neural network in which the object and relation class probability distribution of each node obtained through the visual reasoning step is provided as an initial node input in the semantic reasoning step.