CN114677544A

CN114677544A - Scene graph generation method, system and equipment based on global context interaction

Info

Publication number: CN114677544A
Application number: CN202210297025.7A
Authority: CN
Inventors: 罗敏楠; 杨名帆; 郑庆华; 董怡翔; 刘欢; 秦涛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-06-28
Anticipated expiration: 2042-03-24
Also published as: CN114677544B

Abstract

The invention discloses a scene graph generation method, system and device based on global context interaction, 1) vector joint representation based on the fusion of various features such as object visual features, spatial coordinates, semantic labels, etc.; 2) based on bidirectional gated cyclic neural network Network global feature generation; 3) message iterative delivery mechanism based on global feature vector; 4) scene graph generation based on target and relation state representation. Compared with the existing scene graph generation method, the scene graph generation method based on the global context interaction disclosed in the present invention fully utilizes the global features of the image through context interaction, and has more extensive application; meanwhile, the global features after context interaction are obtained. Then, the message transfer between the target pair and its relationship is carried out, and the existing state is updated by using the potential connection between the targets to generate a more accurate scene graph, which has the advantage of practical application.

Description

A method, system and device for generating scene graph based on global context interaction

技术领域technical field

本发明属于计算机视觉领域，特别涉及一种基于全局上下文交互的场景图生成方法及系统及设备。The invention belongs to the field of computer vision, and in particular relates to a scene graph generation method, system and device based on global context interaction.

背景技术Background technique

由<主语-关系-宾语>三元组构成的场景图能够描述图像中的物体及物体对之间的场景结构关系。场景图主要有两个方面的优点：首先，场景图的<主语-关系-宾语>三元组具有结构化的语义内容，相较于自然语言文本，在细粒化的信息获取与处理过程中有明显优势；其次，场景图能够充分表示图像中的物体及场景结构关系，在多种计算机视觉任务中有广泛的应用前景，例如：在车辆自动驾驶领域，使用场景图进行环境建模可以为决策系统提供更全面的环境信息；在语义图像检索任务中，图像供应商通过场景图对图像的场景结构关系进行建模，使得用户仅需要对主要目标或关系进行描述即可检索到符合需求的图像。基于海量图片以及下游任务对场景图的实时要求，使用计算机进行场景图生成逐渐成为研究热点，对图像理解领域具有重要的意义。The scene graph composed of <subject-relation-object> triples can describe the scene structure relationship between objects and object pairs in the image. The scene graph has two main advantages: First, the <subject-relation-object> triplet of the scene graph has structured semantic content. Compared with natural language text, in the process of fine-grained information acquisition and processing There are obvious advantages; secondly, the scene graph can fully represent the objects in the image and the structure relationship of the scene, and has a wide range of application prospects in a variety of computer vision tasks. The decision-making system provides more comprehensive environmental information; in the semantic image retrieval task, the image supplier models the scene structure relationship of the image through the scene graph, so that the user only needs to describe the main target or relationship to retrieve the desired image. image. Based on the massive images and the real-time requirements of downstream tasks for scene graphs, the use of computers to generate scene graphs has gradually become a research hotspot, which is of great significance to the field of image understanding.

现有的基于消息传递的场景图生成方法目标检查的结果构建目标节点和关系边，并基于消息传递机制，利用循环神经网络在局部子图内进行状态更新，将消息传递后的特征用于关系预测。此种方法采用基于局部上下文思想的消息传递机制，忽略目标之间的隐含约束，仅将目标节点的视觉特征作为初始状态，对关系的检测仅依赖于其主宾语节点特征、联合视觉特征的反复交流，模型无法考虑图像的整体结构，全局信息未在关系预测中发挥作用，因此，限制了模型的预测能力。此外，现有方法未能利用物体坐标，没有从空间角度分析目标间的视觉关系。针对以上问题，本发明提出了一种基于全局上下文交互的场景图生成方法。对现存的场景图生成方法：The existing message-passing-based scene graph generation method constructs target nodes and relation edges based on the result of target inspection, and uses recurrent neural network to update the state in the local subgraph based on the message-passing mechanism, and uses the message-passing features for the relation predict. This method adopts the message passing mechanism based on the idea of local context, ignores the implicit constraints between the targets, only takes the visual features of the target nodes as the initial state, and the detection of the relationship only depends on its subject-object node features and joint visual features. After repeated communication, the model cannot consider the overall structure of the image, and global information does not play a role in relation prediction, therefore, limiting the predictive ability of the model. In addition, existing methods fail to utilize object coordinates and do not analyze the visual relationship between objects from a spatial perspective. In view of the above problems, the present invention proposes a scene graph generation method based on global context interaction. For existing scene graph generation methods:

现有技术1提出了一种图像场景图生成方法，该方法采用将关系分为父类与子类的方式，进行双重关系预测，并采用归一化函数确定精确关系，生成该图像的场景图。The prior art 1 proposes a method for generating an image scene graph. The method adopts a method of dividing relationships into parent classes and subclasses, performs dual relationship prediction, and uses a normalization function to determine the exact relationship to generate a scene graph of the image. .

现有技术2提出了一种基于深度关系自注意力网络的场景图生成方法，方法主要包括：首先，对输入图像进行目标检测，获得标签、物体边框特征、联合边框特征；然后，构建目标特征、相对关系特征；最后，利用深度神经网络生成最终的视觉场景图。The prior art 2 proposes a scene graph generation method based on a deep relational self-attention network. The method mainly includes: first, performing target detection on an input image to obtain labels, object frame features, and joint frame features; then, constructing target features , relative relationship features; finally, a deep neural network is used to generate the final visual scene graph.

现有技术1中的场景图生成方法没有考虑以特征融合方式充分利用特征向量；现有技术2的方法未使用消息传递机制，没有考虑进行目标对与其关系间的信息交互，不能进行上下文传递后的状态更新。且两者均没有使用图像中全体目标之间存在的隐含约束来构建上下文，存在一定不足。The scene graph generation method in the prior art 1 does not consider fully utilizing the feature vector by feature fusion; the method in the prior art 2 does not use the message passing mechanism, does not consider the information exchange between the target pair and its relationship, and cannot perform context transfer. status update. And neither of them uses the implicit constraints that exist among all the objects in the image to construct the context, which has certain shortcomings.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于全局上下文交互的场景图生成方法及系统及设备，以解决上述问题。The purpose of the present invention is to provide a scene graph generation method, system and device based on global context interaction to solve the above problems.

为实现上述目的，本发明采用以下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

与现有技术相比，本发明有以下技术效果：Compared with the prior art, the present invention has the following technical effects:

本发明相较于使用视觉特征代表目标特征的特征表示方法，本发明充分利用目标视觉特征、类别特征与空间坐标信息，使得本发明对信息利用更加充分，提升了场景图生成的关系预测性能；Compared with the feature representation method that uses visual features to represent target features, the present invention makes full use of target visual features, category features and spatial coordinate information, so that the present invention utilizes information more fully and improves the relationship prediction performance of scene graph generation;

本发明相较于使用局部上下文交互的场景图生成方法，本发明利用循环神经网络进行图像的全局上下文提取，实现基于全局上下文的信息交互，随后进行消息传递，充分实现数据交互与信息拓展。Compared with the scene graph generation method using local context interaction, the present invention uses the cyclic neural network to extract the global context of the image, realizes information interaction based on the global context, and then performs message transmission, fully realizing data interaction and information expansion.

附图说明Description of drawings

图1是本发明基于全局上下文交互的场景图生成方法框图。FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.

图2是基于特征融合的向量联合表示的流程图。Figure 2 is a flowchart of a vector joint representation based on feature fusion.

图3是双向门控循环神经网络BiGRU的结构图。Figure 3 is a structural diagram of the bidirectional gated recurrent neural network BiGRU.

图4是基于全局特征向量的消息迭代传递机制的流程图。Figure 4 is a flow chart of the message iterative delivery mechanism based on the global feature vector.

图5是目标检测结果及对应场景图示意图。FIG. 5 is a schematic diagram of a target detection result and a corresponding scene graph.

图6是本发明性能测试结果图。Fig. 6 is the performance test result graph of the present invention.

具体实施方式Detailed ways

以下结合附图及实施例对本发明的实施方式进行详细说明。需要说明的是，此处描述的实施例只用以解释本发明，并不用于限定本发明。此外，在不冲突的情况下，本发明中的实施例涉及的技术特征可以相互结合。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the embodiments described herein are only used to explain the present invention, and are not used to limit the present invention. In addition, the technical features involved in the embodiments of the present invention may be combined with each other without conflict.

本发明的具体实施过程包括图像的目标检测与特征向量融合、基于全局上下文交互的特征生成和消息传递过程。图1是本发明基于全局上下文交互的场景图生成方法框图。The specific implementation process of the present invention includes image target detection and feature vector fusion, feature generation based on global context interaction, and message transmission processes. FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.

1.图像的目标检测与特征向量融合1. Image target detection and feature vector fusion

给出输入图像后，本发明使用Faster-RCNN深度学习模型进行目标检测，得到其目标集合O＝(o₁,o₂,…,o_n)，对应的视觉特征集合V＝(v₁,v₂,…,v_n)，坐标特征集合B＝(b₁,b₂,…,b_n)、预分类标签集合L＝(l₁,l₂,…,l_n)、两两目标坐标并集框内的视觉特征C＝(c_i→j,i≠j)。After the input image is given, the present invention uses the Faster- _RCNN deep learning model for target detection, and obtains its target set O=(o ₁ ,o ₂ ,...,on ), and the corresponding visual feature set V=(v ₁ ,v ₂ ,...,v _n ), coordinate feature set B=(b ₁ ,b ₂ ,...,b _n ), pre-classification label set L=(l ₁ ,l ₂ ,...,l _n ), pairwise target coordinates The visual feature C=( _ci→j , i≠j) in the set frame.

首先，本发明使用特征融合方法，对每个目标对应的空间坐标特征b_i、视觉特征的向量v_i进行联合表示。对于目标o_i，其绝对位置坐标b＝(x₁,y₁,x₂,y₂)，其中x₁,y₁,x₂,y₂分别代表其矩形回归框左上与右下坐标，本发明利用如下公式将其转化为在图像中相对位置编码b_i：First, the present invention uses a feature fusion method to jointly represent the spatial coordinate feature b _i corresponding to each target and the vector v _i of the visual feature. For the target o _i , its absolute position coordinates b=(x ₁ , y ₁ , x ₂ , y ₂ ), where x ₁ , y ₁ , x ₂ , y ₂ represent the upper left and lower right coordinates of its rectangular regression frame, respectively. The invention uses the following formula to convert it into the relative position code _bi in the image:

式中，wid代表图像I原有宽度，hei代表图像I原有高度。In the formula, wid represents the original width of image I, and hei represents the original height of image I.

然后，使用神经网络的全连接层将相对位置编码b_i扩充为128维特征s_i：Then, the relative position encoding b _i is augmented into a 128-dimensional feature s _i using the fully connected layers of the neural network:

s_i＝σ(W_sb_i+b_s),s _i =σ(W _s b _i +b _s ),

其中，σ代表ReLU激活函数，W_s与b_s为线性变换参数，由神经网络自行学习调整。同时，本方法使用全连接层将目标视觉特征v_i由4096维特征转为512维。Among them, σ represents the ReLU activation function, W _s and b _s are linear transformation parameters, which are learned and adjusted by the neural network. At the same time, this method uses a fully connected layer to convert the target visual feature v _i from 4096-dimensional features to 512-dimensional features.

随后，本发明将经过维度变换的相对位置特征向量s_i和视觉特征v_i进行拼接并维度变换，得到512维目标视觉与坐标特征融合向量f_i，计算流程如下所示：Subsequently, the present invention splices and dimensionally transforms the relative position feature vector _si and the visual feature vi that have undergone dimension transformation to obtain a 512-dimensional target vision and coordinate feature fusion vector f _i _, and the calculation process is as follows:

f_i＝σ(W_f[s_i,v_i]+b_f),f _i =σ(W _f [ _s _i ,vi ]+b _f ),

式中，[·]代表拼接操作，σ代表ReLU激活函数，W_f与b_f为线性变换参数。In the formula, [ ] represents the splicing operation, σ represents the ReLU activation function, and W _f and b _f are linear transformation parameters.

以上特征向量融合流程如图2所示。The above feature vector fusion process is shown in Figure 2.

2.基于双向门控循环神经网络的全局特征生成2. Global feature generation based on bidirectional gated recurrent neural network

在全局特征生成过程中，本发明构建双向门控循环神经网络BiGRU，并使用零向量作为其初始状态，其结构如图3所示。在得到目标集合的特征融合向量F＝(f₁,f₂,…,f_n)后，将其按照相对坐标中的第一项x坐标由左向右进行排序，并按序输入BiGRU中，得到全局上下文目标特征γ＝(γ₁,γ₂,…,γ_n)。具体生成步骤为：In the process of global feature generation, the present invention constructs a bidirectional gated recurrent neural network BiGRU, and uses the zero vector as its initial state, the structure of which is shown in Figure 3. After obtaining the feature fusion vector F=(f ₁ , f ₂ ,..., f _n ) of the target set, sort it from left to right according to the x coordinate of the first item in the relative coordinates, and input it into BiGRU in sequence, Obtain the global context target feature γ=(γ ₁ ,γ ₂ ,...,γ _n ). The specific generation steps are:

(1)初始化零向量作为BiGRU初始状态；(1) Initialize the zero vector as the initial state of BiGRU;

(2)在BiGRU两端，分别将目标集合中的第一个与最后一个特征融合向量f₀与f_n输入，生成对应方向与顺序的隐藏状态

(2) At both ends of BiGRU, input the first and last feature fusion vectors f ₀ and f _n in the target set, respectively, to generate hidden states corresponding to the direction and order.

(3)按序依次向BiGRU两端输入特征向量，生成

(3) Input feature vectors to both ends of BiGRU in sequence to generate

(4)将正向、逆向隐藏状态融合，得到每个目标的上下文融合状态γ_i。(4) Integrate the forward and reverse hidden states to obtain the context fusion state γ _i of each target.

随后，本发明利用Glove词嵌入向量，将目标检测过程中对目标的预分类结果L＝(l₁,l₂,…,l_n)转换为128维的目标类别特征向量g_i。Subsequently, the present invention uses the Glove word embedding vector to convert the pre-classification result L=(l ₁ ,l ₂ ,...,l _n ) of the target in the target detection process into a 128-dimensional target category feature vector g _i .

最后，本发明使用神经网络全连接层将每个目标的全局上下文目标特征γ_i与其类别特征向量g_i进行融合，得到此目标的全局特征c_i。上述计算过程如公式所示：Finally, the present invention uses a neural network fully connected layer to fuse the global contextual target feature γ _i of each target with its category feature vector _gi to obtain the global feature _ci of the target. The above calculation process is shown in the formula:

g_i＝Glove(l_i),g _i =Glove(li _i ),

c_i＝σ(W_c[γ_i,g_i]+b_c),c _i =σ(W _c [γ _i , _gi ]+b _c ),

其中，Glove(l_i)代表使用Glove方式对目标的预分类标签进行编码，[·]代表拼接操作,W_c与b_c为线性变换参数。Among them, Glove(l _i ) represents the use of Glove to encode the pre-classification label of the target, [ ] represents the splicing operation, and W _c and b _c are linear transformation parameters.

3.基于全局特征向量的消息迭代传递机制3. Message iterative delivery mechanism based on global feature vector

消息迭代传递机制分为消息聚合函数和状态更新函数两部分。The message iterative delivery mechanism is divided into two parts: the message aggregation function and the state update function.

首先，本发明构建消息聚合函数：在场景图拓扑中，节点与边分别表示视觉关系中的主宾语目标及其关系，在消息传递时，单一节点或边会同时收到多个来源的信息，需要设计池化函数以计算每部分消息的权重，并使用其加权和以聚合最终的传入消息。根据消息的接收者不同，可将传入消息为由目标节点接收的消息

与由关系边接收的消息

First, the present invention constructs a message aggregation function: in the scene graph topology, nodes and edges respectively represent the subject-object target and its relationship in the visual relationship, and during message transmission, a single node or edge will receive information from multiple sources at the same time, A pooling function needs to be designed to calculate the weight of each part of the message and use its weighted sum to aggregate the final incoming message. Depending on the recipient of the message, incoming messages can be treated as messages received by the destination node

with messages received by relational edges

已知当前节点GRU和关系边GRU的隐藏状态

与

将第t次迭代时传入第i个节点的消息表示为

由目标GRU自身隐藏状态

其出度边GRU隐藏状态

入度边隐藏状态

计算得到，其中i→j代表此关系中目标i为主语，目标j为宾语。Know the hidden state of the current node GRU and relation edge GRU

and

Denote the message passed to the i-th node at the t-th iteration as

Hidden state by the target GRU itself

Its out-degree edge GRU hidden state

In-degree edge hidden state

Calculated, where i→j represents the target i as the subject and the target j as the object in this relation.

相似的，第t次迭代时由第i个目标节点到第j个目标节点的关系边，其聚合消息为

由关系边GRU的上一迭代对应的隐藏状态

主语节点GRU隐藏状态

宾语节点GRU隐藏状态

组成。

与

由以下自适应加权函数求得：Similarly, in the t-th iteration, from the i-th target node to the j-th target node, the aggregated message is

The hidden state corresponding to the previous iteration of the relational edge GRU

Subject node GRU hidden state

Object Node GRU Hidden State

composition.

and

It is obtained by the following adaptive weighting function:

其中，[·]代表拼接操作，σ代表ReLU激活函数，w₁、w₂和v₁、v₂是可学习参数。Among them, [ ] represents the splicing operation, σ represents the ReLU activation function, and w ₁ , w ₂ and v ₁ , v ₂ are learnable parameters.

其次，本发明构建状态更新函数：分别构建目标节点GRU和关系边GRU，对目标和目标间关系的特征向量

的存储和更新。首先，在t＝0时，将每个目标节点与关系边的GRU状态初始化为零向量，将目标的全局特征向量c_i作为目标节点GRU的输入，将两两目标坐标并集框内的视觉特征c_i→j作为其关系边GRU的输入，分别生成目标节点和关系边在初始时刻的隐藏状态

Secondly, the present invention constructs a state update function: constructing the target node GRU and the relationship edge GRU respectively, and the feature vector of the relationship between the target and the target

storage and updating. First, at t=0, initialize the GRU state of each target node and relation edge to a zero vector, use the global feature vector c _i of the target as the input of the target node GRU, and combine the pairwise target coordinates to set the visual in the frame. The feature c _i→j is used as the input of its relational edge GRU to generate the hidden state of the target node and relational edge at the initial moment, respectively

在后续迭代中，每一次迭代t，每个GRU，根据其是目标GRU或关系GRU，将其上一迭代的隐藏状态

或

和上一迭代的传入消息

或

作为输入，并生成一个新的隐藏状态

或

作为输出，用于消息聚合函数生成下一次迭代的消息。In subsequent iterations, at each iteration t, each GRU, depending on whether it is a target GRU or a relational GRU, converts the hidden state of the previous iteration

or

and incoming messages from the previous iteration

or

as input, and generate a new hidden state

or

As output, the message used for the message aggregation function to generate the message for the next iteration.

故整个消息传递机制的具体步骤为：Therefore, the specific steps of the entire message passing mechanism are as follows:

(1)将每个目标节点与关系边的GRU状态初始化为零向量；(1) Initialize the GRU state of each target node and relation edge to a zero vector;

(2)将目标的全局特征向量c_i作为目标节点GRU的输入，将两两目标坐标并集框内的视觉特征c_i→j作为其关系边GRU的输入，分别生成目标节点和关系边在初始时刻的隐藏状态

(2) The global feature vector c _i of the target is used as the input of the target node GRU, and the visual feature c _i→j in the union frame of the paired target coordinates is used as the input of its relational edge GRU, and the target node and relational edge are generated respectively in Hidden state at the initial moment

(3)利用消息聚合函数，计算每个目标与关系的接收到的消息

与

(3) Using the message aggregation function, calculate the received messages for each target and relationship

and

(4)结合隐藏状态

接受到的消息

与

利用GRU更新状态，得到下一时刻状态

(4) Combine the hidden state

received message

and

Use GRU to update the state to get the state at the next moment

(5)若迭代次数达到设定次数，则保存当前目标与关系的状态

否则，返回步骤(3)。(5) If the number of iterations reaches the set number, save the current state of the target and relationship

Otherwise, go back to step (3).

上述消息传递机制流程如图4所示。The flow of the above message passing mechanism is shown in FIG. 4 .

4.基于目标与关系状态表示的场景图生成4. Scene graph generation based on object and relational state representation

将经过消息传递机制更新后的目标与关系隐藏状态视为目标与关系的特征向量，送入神经网络中，使用softmax函数对目标、关系分别进行类别预测，得到每个目标的种类，以及每一对目标之间的关系类别，进而得到能够反映图像中目标与目标间关系的场景图。The target and relationship hidden states updated by the message passing mechanism are regarded as the feature vector of the target and the relationship, and sent to the neural network. The softmax function is used to predict the target and relationship respectively, and the type of each target and For the relationship categories between objects, a scene graph that can reflect the relationship between objects and objects in the image is obtained.

给定输入图像后，目标检测结果及对应场景图示意图如图5所示，本模型的性能测试结果如图6所示。Given the input image, the target detection results and the corresponding scene graph are shown in Figure 5, and the performance test results of this model are shown in Figure 6.

本发明再一实施例中，提供一种基于全局上下文交互的场景图生成系统，能够用于实现上述的基于全局上下文交互的场景图生成方法，具体的，该系统包括：In yet another embodiment of the present invention, a system for generating a scene graph based on global context interaction is provided, which can be used to implement the above-mentioned method for generating scene graphs based on global context interaction. Specifically, the system includes:

目标检测模块，用于对输入图像I进行目标检测，得到其目标集合O＝(o₁,o₂,…,o_n)，以及对应的视觉特征集合V＝(v₁,v₂,…,v_n)、坐标特征集合B＝(b₁,b₂,…,b_n)、预分类标签集合L＝(l₁,l₂,…,l_n)、两两目标坐标并集框内的视觉特征C＝(c_i→j,i≠j)；The target detection module is used for target detection on the input image I to obtain its target set O=(o ₁ , _o ₂ ,...,on ), and the corresponding visual feature set V=(v ₁ ,v ₂ ,..., v _n ), coordinate feature set B=(b ₁ ,b ₂ ,...,b _n ), pre-classification label set L=(l ₁ ,l ₂ ,...,l _n ), in the frame of the union of pairwise target coordinates Visual feature C=( _ci→j , i≠j);

目标视觉与坐标特征的联合表示向量获取模块，用于利用神经网络将各目标的绝对位置坐标，转化得到目标视觉与坐标特征的联合表示向量f_i；The joint representation vector acquisition module of target vision and coordinate features is used to convert the absolute position coordinates of each target using a neural network to obtain a joint representation vector f _i of target vision and coordinate features;

目标全局特征获取模块，用于根据特征融合向量F＝(f₁,f₂,…,f_n)，得到局上下文目标特征γ_i与其类别特征向量g_i，使用神经网络将目标的全局上下文目标特征γ_i与其类别特征向量g_i进行融合，得到此目标的全局特征c_i；The target global feature acquisition module is used to obtain the local context target feature γ _i and its category feature vector g _i according to the feature fusion vector F=(f ₁ , f ₂ ,..., f _n ), and use the neural network to fuse the target global context target The feature γ _i is fused with its category feature vector _gi to obtain the global feature _ci of the target;

场景图获取模块，用于基于每个目标的全局特征向量c_i，每个关系的特征向量c_i→j，初始化其隐藏状态

进而初始计算各节点传入消息

各边传入消息

并进行迭代传递，利用循环神经网络更新隐藏状态

并进行消息聚合得到各时刻i的传入消息

直至达到设置的迭代次数，然后利用目标节点与关系边的最终状态生成能够反映图像中目标与目标间关系的场景图。A scene graph acquisition module for initializing its hidden state based on the global feature vector c _i of each target and the feature vector c _i→j of each relation

Then initially calculate the incoming messages of each node

incoming messages from all sides

And iterative pass, using recurrent neural network to update the hidden state

And perform message aggregation to get the incoming message at each time i

Until the set number of iterations is reached, and then use the final state of the target node and the relationship edge to generate a scene graph that can reflect the relationship between the target and the target in the image.

本发明实施例中对模块的划分是示意性的，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，另外，在本发明各个实施例中的各功能模块可以集成在一个处理器中，也可以是单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。The division of modules in the embodiments of the present invention is schematic, and is only a logical function division. In actual implementation, there may be other division methods. In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit. In the device, it can also exist physically alone, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.

本发明再一个实施例中，提供了一种计算机设备，该计算机设备包括处理器以及存储器，所述存储器用于存储计算机程序，所述计算机程序包括程序指令，所述处理器用于执行所述计算机存储介质存储的程序指令。处理器可能是中央处理单元(CentralProcessing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital SignalProcessor、DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable GateArray，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，其是终端的计算核心以及控制核心，其适于实现一条或一条以上指令，具体适于加载并执行计算机存储介质内一条或一条以上指令从而实现相应方法流程或相应功能；本发明实施例所述的处理器可以用于基于全局上下文交互的场景图生成方法的操作。In yet another embodiment of the present invention, a computer device is provided, the computer device includes a processor and a memory, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is used for executing the computer Program instructions stored in the storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable GateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal, which are suitable for implementing one or more instructions, specifically suitable for One or more instructions in the computer storage medium are loaded and executed to implement the corresponding method process or corresponding function; the processor according to the embodiment of the present invention can be used for the operation of the scene graph generation method based on the global context interaction.

本发明公开了一种基于全局上下文交互的场景图生成方法，1)基于物体视觉特征、空间坐标、语义标签等多种特征融合的向量联合表示；2)基于双向门控循环神经网络的全局特征生成；3)基于全局特征向量的消息迭代传递机制；4)基于目标与关系状态表示的场景图生成。本发明所公开的基于全局上下文交互的场景图生成方法，同现存的场景图生成方法相比，通过上下文交互充分利用图像的全局特征，更具有应用广泛性；同时，得到上下文交互后的全局特征后进行目标对与其关系间的消息传递，利用目标间的潜在联系更新现有状态，进行更准确的场景图生成，具有实际应用的优势。The invention discloses a scene graph generation method based on global context interaction, 1) vector joint representation based on the fusion of various features such as object visual features, spatial coordinates, semantic labels, etc.; 2) global features based on bidirectional gated cyclic neural network generation; 3) message iterative delivery mechanism based on global feature vector; 4) scene graph generation based on target and relation state representation. Compared with the existing scene graph generation method, the method for generating a scene graph based on global context interaction disclosed in the present invention fully utilizes the global feature of the image through context interaction, and has more extensive application; at the same time, the global feature after context interaction is obtained. Then, the message transfer between the target pair and its relationship is carried out, and the existing state is updated by using the potential connection between the targets to generate a more accurate scene graph, which has the advantage of practical application.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: the present invention can still be Modifications or equivalent replacements are made to the specific embodiments of the present invention, and any modifications or equivalent replacements that do not depart from the spirit and scope of the present invention shall be included within the protection scope of the claims of the present invention.

Claims

1. A scene graph generation method based on global context interaction is characterized by comprising

Carrying out target detection on the input image I to obtain a target set O ═ O₁，o₂，…，o_n) And a corresponding set of visual features V ═ (V ═ V)₁，v₂，…，v_n) And the coordinate feature set B ═ B₁，b₂，…，b_n) And the set of pre-classified labels L ═ L₁，l₂，…，l_n) And (C) combining the two target coordinates and collecting the visual characteristics C in the frame_i→j，i≠j)；

The absolute position coordinates of each target are converted by utilizing a neural network to obtain a joint expression vector f of the target vision and coordinate characteristics_i；

According to the feature fusion vector F ═ F₁，f₂，…，f_n) Obtaining local context target characteristics gamma_iAnd its class feature vector g_iUsing a neural network to map the global context target feature gamma of the target_iAnd its class feature vector g_iFusing to obtain the global feature c of the target_i；

Global feature vector c based on each target_iFeature vector c of each relation_i→jInitialize its hidden state

Further, each node incoming message is initially calculated

Each side incoming message

And performing iterative transfer, and updating hidden state by using recurrent neural network

And carrying out message aggregation to obtain the incoming message of each time i

And generating a scene graph capable of reflecting the relation between the target and the target in the image by using the final states of the target node and the relation edge until the set iteration number is reached.

2. The method as claimed in claim 1, wherein the neural network is used to convert the absolute position coordinates of each target into relative position codes in the image and expand the relative position codes into relative position features s_iVisual features v of the object_iConverting into 512 dimensions, adopting a feature fusion method to obtain a relative position feature vector s_iAnd a visual feature v_iSplicing and converting to obtain a joint expression vector f of the target vision and the coordinate characteristics_i。

3. The method as claimed in claim 2, wherein in the feature fusion-based vector joint representation, after target detection is performed on the input image I by using the fast-RCNN model, the absolute position coordinates of the target are converted into the relative position code b in the image_iFor the object o_iIts coordinate (x)₁，y₁，x₂，y₂) Wherein x is₁，y₁，x₂，y₂Respectively representing the upper left coordinate and the lower right coordinate of a rectangular regression frame, and a relative position code calculation formula:

in the formula, wid represents the original width of the image I, hei represents the original height of the image I; then, the relative position is encoded b using the full connection layer_iExtended to 128-dimensional features s_i：

s_i＝σ(W_sb_i+b_s)，

Where σ represents the ReLU activation function, W_sAnd b_sThe parameters are linear transformation parameters and are automatically learned and adjusted by a neural network; meanwhile, the visual characteristics v of the target are obtained by detecting the target by the same method_iPerforming dimension transformation, and converting 4096-dimensional features into 512-dimensional features by using a full connection layer; then, the relative position feature vector s subjected to dimension transformation_iAnd a visual feature v_iSplicing and converting are carried out, and finally a 512-dimensional target vision and coordinate feature fusion vector f is obtained_iThe calculation flow is as follows:

f_i＝σ(W_f[s_i，v_i]+b_f)，

in the formula [ ·]Represents the stitching operation, σ represents the ReLU activation function, W_fAnd b_fAre linear transformation parameters.

4. The method of claim 1, wherein the fusion vector is F ═ F (F) according to features₁，f₂，…，f_n) Obtaining global context target characteristic gamma (gamma) by using bidirectional gating recurrent neural network (BiGRU)₁，γ₂，…，γ_n) (ii) a Classifying the target by the target detection module to obtain a result L ═ L₁，l₂，…，l_n) To obtain the category feature vector g of each target_iUsing a neural network to characterize the global context of the target by the target feature gamma_iAnd its class feature vector g_iPerforming fusion to obtain the global characteristic c of the target_i。

5. The method as claimed in claim 4, wherein in the global feature generation process based on the bidirectional gated recurrent neural network, a feature fusion vector F ═ F (F) of the target set is obtained₁，f₂，…，f_n) Then, it is expressed as x-coordinate in relative coordinatesSequencing from left to right, and inputting the sequence into a bidirectional gated recurrent neural network BiGRU to realize global context interaction to obtain a global context target characteristic gamma (gamma is ═₁，γ₂，…，γ_n)；

Subsequently, the classification result L ═ of the target using the target detection (L)₁，l₂，…，l_n) Calculating a Glove word embedding vector of the classification label to obtain a 128-dimensional target class feature vector g_iFinally, the global context target feature gamma of each target is determined_iAnd its class feature vector g_iFusing to obtain the global feature c of the target_iThe above calculation process is shown as the formula:

g_i＝Glove(l_i)，

c_i＝σ(W_c[γ_i，g_i]+b_c)，

wherein, Glove (l)_i) Represents the encoding of a pre-sorted tag of an object using the Glove approach [. cndot]Representing a splicing operation, W_cAnd b_cAre linear transformation parameters.

6. The method as claimed in claim 5, wherein γ is a set of parameters_iThe specific generation steps are as follows:

(1) initializing a zero vector as a BiGRU initial state;

(2) at two ends of the BiGRU, respectively fusing the first and the last feature fusion vectors f in the target set₀And f_nInputting, generating hidden states corresponding to the direction and sequence

(3) Sequentially inputting feature vectors to two ends of the BiGRU to generate

(4) Fusing the forward and reverse hidden states to obtain the context of each targetFusion state gamma_i。

7. The method for generating a scene graph based on global context interaction according to claim 1, wherein the message iterative transfer mechanism based on the global feature vector comprises two calculation functions of constructing a message aggregation function and a state update function;

constructing a message aggregation function: known ith target node GRU hidden state

Hidden state of relationship edge GRU from ith target node to jth target node

The message that is transmitted into the ith node at the t iteration is represented as

Then

Hidden state by the target GRU itself

Its out-of-range GRU hidden state

In-degree edge hidden state

Calculated, where i → j represents that target i is the subject and target j is the object in the relationship:

similarly, the ith target node goes to the jth target node at the tth iterationAggregated messages for relational edges of individual target nodes

Hidden states corresponding to last iteration of relational edge GRU

Subject node GRU hidden state

Object node GRU hidden state

The components of the components are as follows,

and

the following adaptive weighting function:

wherein [ ·]Represents the stitching operation, σ represents the ReLU activation function, w₁、w₂And v₁、v₂Is a learnable parameter;

constructing a state updating function: respectively constructing a target node GRU and a relation edge GRU, and carrying out feature vector on the relation between the targets

Storage and update of (2): first, when t is 0, the GRU state of each target node and relationship edge is initialized to a zero vector, and the global feature vector c of the target is initialized to a zero vector_iAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the frame_i→jAs input to its relational edge GRU, respectively generate targetsHidden states of nodes and relational edges at initial time

In subsequent iterations, each iteration t, each GRU, depending on whether it is a target GRU or a relationship GRU, will have its previous iteration hidden state

Or

And the incoming message of the previous iteration

Or

As input, and generates a new hidden state

Or

As an output, the message aggregation function generates a message for the next iteration:

8. the method for generating a scene graph based on global context interaction according to claim 1, wherein a message iteration delivery mechanism based on a global feature vector specifically comprises the following steps:

(1) initializing GRU states of each target node and the relation edges into zero vectors;

(2) global feature vector c of target_iAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the frame_i→jAs the input of the relation edge GRU, the hidden states of the target node and the relation edge at the initial time are respectively generated

(3) Computing received messages for each target and relationship using a message aggregation function

And with

(4) Combined with hidden state

Received message

And

utilizing GRU to update state to obtain state of next time

(5) If the iteration times reach the set times, the state of the current target and the relation is stored

Otherwise, returning to the step (3);

(6) and after the message is transmitted, sending the final state vector of the target and the relation into a neural network to obtain a scene graph capable of reflecting the relation between the target and the target in the image.

9. A scene graph generation system based on global context interaction, comprising:

a target detection module for performing target detection on the input image I to obtain a target set O ═ O₁，o₂，…，o_n) And a corresponding set of visual features V ═ V (V ═ V)₁，v₂，…，v_n) And the coordinate feature set B ═ B₁，b₂，…，b_n) And the pre-classified label set L ═ (L)₁，l₂，…，l_n) And (C) combining the two target coordinates and collecting the visual characteristics C in the frame_i→j，i≠j)；

A joint expression vector acquisition module of the target vision and coordinate characteristics, which is used for transforming the absolute position coordinates of each target by using a neural network to obtain a joint expression vector f of the target vision and coordinate characteristics_i；

A target global feature obtaining module for obtaining (F) according to the feature fusion vector F₁，f₂，…，f_n) Obtaining local context target characteristics gamma_iAnd its class feature vector g_iUsing a neural network to map the global context target feature gamma of the target_iAnd its class feature vector g_iFusing to obtain the global feature c of the target_i；

A scene graph acquisition module for acquiring a global feature vector c based on each target_iFeature vector c of each relation_i→jInitialize its hidden state

Further, each node incoming message is initially calculated

Each side incoming message

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the scene graph generation method based on global context interaction according to any one of claims 1 to 8 when executing the computer program.