CN114677544A - Scene graph generation method, system and equipment based on global context interaction - Google Patents
Scene graph generation method, system and equipment based on global context interaction Download PDFInfo
- Publication number
- CN114677544A CN114677544A CN202210297025.7A CN202210297025A CN114677544A CN 114677544 A CN114677544 A CN 114677544A CN 202210297025 A CN202210297025 A CN 202210297025A CN 114677544 A CN114677544 A CN 114677544A
- Authority
- CN
- China
- Prior art keywords
- target
- feature
- global
- vector
- gru
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000003993 interaction Effects 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 68
- 238000013528 artificial neural network Methods 0.000 claims abstract description 28
- 230000000007 visual effect Effects 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 8
- 238000012546 transfer Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 26
- 238000001514 detection method Methods 0.000 claims description 16
- 230000002776 aggregation Effects 0.000 claims description 11
- 238000004220 aggregation Methods 0.000 claims description 11
- 230000000306 recurrent effect Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 238000007500 overflow downdraw method Methods 0.000 claims description 2
- 239000013604 expression vector Substances 0.000 claims 4
- 239000004576 sand Substances 0.000 claims 1
- 230000001131 transforming effect Effects 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 4
- 125000004122 cyclic group Chemical group 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011056 performance test Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于全局上下文交互的场景图生成方法及系统及设备,1)基于物体视觉特征、空间坐标、语义标签等多种特征融合的向量联合表示;2)基于双向门控循环神经网络的全局特征生成;3)基于全局特征向量的消息迭代传递机制;4)基于目标与关系状态表示的场景图生成。本发明所公开的基于全局上下文交互的场景图生成方法,同现存的场景图生成方法相比,通过上下文交互充分利用图像的全局特征,更具有应用广泛性;同时,得到上下文交互后的全局特征后进行目标对与其关系间的消息传递,利用目标间的潜在联系更新现有状态,进行更准确的场景图生成,具有实际应用的优势。
The invention discloses a scene graph generation method, system and device based on global context interaction, 1) vector joint representation based on the fusion of various features such as object visual features, spatial coordinates, semantic labels, etc.; 2) based on bidirectional gated cyclic neural network Network global feature generation; 3) message iterative delivery mechanism based on global feature vector; 4) scene graph generation based on target and relation state representation. Compared with the existing scene graph generation method, the scene graph generation method based on the global context interaction disclosed in the present invention fully utilizes the global features of the image through context interaction, and has more extensive application; meanwhile, the global features after context interaction are obtained. Then, the message transfer between the target pair and its relationship is carried out, and the existing state is updated by using the potential connection between the targets to generate a more accurate scene graph, which has the advantage of practical application.
Description
技术领域technical field
本发明属于计算机视觉领域,特别涉及一种基于全局上下文交互的场景图生成方法及系统及设备。The invention belongs to the field of computer vision, and in particular relates to a scene graph generation method, system and device based on global context interaction.
背景技术Background technique
由<主语-关系-宾语>三元组构成的场景图能够描述图像中的物体及物体对之间的场景结构关系。场景图主要有两个方面的优点:首先,场景图的<主语-关系-宾语>三元组具有结构化的语义内容,相较于自然语言文本,在细粒化的信息获取与处理过程中有明显优势;其次,场景图能够充分表示图像中的物体及场景结构关系,在多种计算机视觉任务中有广泛的应用前景,例如:在车辆自动驾驶领域,使用场景图进行环境建模可以为决策系统提供更全面的环境信息;在语义图像检索任务中,图像供应商通过场景图对图像的场景结构关系进行建模,使得用户仅需要对主要目标或关系进行描述即可检索到符合需求的图像。基于海量图片以及下游任务对场景图的实时要求,使用计算机进行场景图生成逐渐成为研究热点,对图像理解领域具有重要的意义。The scene graph composed of <subject-relation-object> triples can describe the scene structure relationship between objects and object pairs in the image. The scene graph has two main advantages: First, the <subject-relation-object> triplet of the scene graph has structured semantic content. Compared with natural language text, in the process of fine-grained information acquisition and processing There are obvious advantages; secondly, the scene graph can fully represent the objects in the image and the structure relationship of the scene, and has a wide range of application prospects in a variety of computer vision tasks. The decision-making system provides more comprehensive environmental information; in the semantic image retrieval task, the image supplier models the scene structure relationship of the image through the scene graph, so that the user only needs to describe the main target or relationship to retrieve the desired image. image. Based on the massive images and the real-time requirements of downstream tasks for scene graphs, the use of computers to generate scene graphs has gradually become a research hotspot, which is of great significance to the field of image understanding.
现有的基于消息传递的场景图生成方法目标检查的结果构建目标节点和关系边,并基于消息传递机制,利用循环神经网络在局部子图内进行状态更新,将消息传递后的特征用于关系预测。此种方法采用基于局部上下文思想的消息传递机制,忽略目标之间的隐含约束,仅将目标节点的视觉特征作为初始状态,对关系的检测仅依赖于其主宾语节点特征、联合视觉特征的反复交流,模型无法考虑图像的整体结构,全局信息未在关系预测中发挥作用,因此,限制了模型的预测能力。此外,现有方法未能利用物体坐标,没有从空间角度分析目标间的视觉关系。针对以上问题,本发明提出了一种基于全局上下文交互的场景图生成方法。对现存的场景图生成方法:The existing message-passing-based scene graph generation method constructs target nodes and relation edges based on the result of target inspection, and uses recurrent neural network to update the state in the local subgraph based on the message-passing mechanism, and uses the message-passing features for the relation predict. This method adopts the message passing mechanism based on the idea of local context, ignores the implicit constraints between the targets, only takes the visual features of the target nodes as the initial state, and the detection of the relationship only depends on its subject-object node features and joint visual features. After repeated communication, the model cannot consider the overall structure of the image, and global information does not play a role in relation prediction, therefore, limiting the predictive ability of the model. In addition, existing methods fail to utilize object coordinates and do not analyze the visual relationship between objects from a spatial perspective. In view of the above problems, the present invention proposes a scene graph generation method based on global context interaction. For existing scene graph generation methods:
现有技术1提出了一种图像场景图生成方法,该方法采用将关系分为父类与子类的方式,进行双重关系预测,并采用归一化函数确定精确关系,生成该图像的场景图。The
现有技术2提出了一种基于深度关系自注意力网络的场景图生成方法,方法主要包括:首先,对输入图像进行目标检测,获得标签、物体边框特征、联合边框特征;然后,构建目标特征、相对关系特征;最后,利用深度神经网络生成最终的视觉场景图。The
现有技术1中的场景图生成方法没有考虑以特征融合方式充分利用特征向量;现有技术2的方法未使用消息传递机制,没有考虑进行目标对与其关系间的信息交互,不能进行上下文传递后的状态更新。且两者均没有使用图像中全体目标之间存在的隐含约束来构建上下文,存在一定不足。The scene graph generation method in the
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种基于全局上下文交互的场景图生成方法及系统及设备,以解决上述问题。The purpose of the present invention is to provide a scene graph generation method, system and device based on global context interaction to solve the above problems.
为实现上述目的,本发明采用以下技术方案:To achieve the above object, the present invention adopts the following technical solutions:
与现有技术相比,本发明有以下技术效果:Compared with the prior art, the present invention has the following technical effects:
本发明相较于使用视觉特征代表目标特征的特征表示方法,本发明充分利用目标视觉特征、类别特征与空间坐标信息,使得本发明对信息利用更加充分,提升了场景图生成的关系预测性能;Compared with the feature representation method that uses visual features to represent target features, the present invention makes full use of target visual features, category features and spatial coordinate information, so that the present invention utilizes information more fully and improves the relationship prediction performance of scene graph generation;
本发明相较于使用局部上下文交互的场景图生成方法,本发明利用循环神经网络进行图像的全局上下文提取,实现基于全局上下文的信息交互,随后进行消息传递,充分实现数据交互与信息拓展。Compared with the scene graph generation method using local context interaction, the present invention uses the cyclic neural network to extract the global context of the image, realizes information interaction based on the global context, and then performs message transmission, fully realizing data interaction and information expansion.
附图说明Description of drawings
图1是本发明基于全局上下文交互的场景图生成方法框图。FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.
图2是基于特征融合的向量联合表示的流程图。Figure 2 is a flowchart of a vector joint representation based on feature fusion.
图3是双向门控循环神经网络BiGRU的结构图。Figure 3 is a structural diagram of the bidirectional gated recurrent neural network BiGRU.
图4是基于全局特征向量的消息迭代传递机制的流程图。Figure 4 is a flow chart of the message iterative delivery mechanism based on the global feature vector.
图5是目标检测结果及对应场景图示意图。FIG. 5 is a schematic diagram of a target detection result and a corresponding scene graph.
图6是本发明性能测试结果图。Fig. 6 is the performance test result graph of the present invention.
具体实施方式Detailed ways
以下结合附图及实施例对本发明的实施方式进行详细说明。需要说明的是,此处描述的实施例只用以解释本发明,并不用于限定本发明。此外,在不冲突的情况下,本发明中的实施例涉及的技术特征可以相互结合。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the embodiments described herein are only used to explain the present invention, and are not used to limit the present invention. In addition, the technical features involved in the embodiments of the present invention may be combined with each other without conflict.
本发明的具体实施过程包括图像的目标检测与特征向量融合、基于全局上下文交互的特征生成和消息传递过程。图1是本发明基于全局上下文交互的场景图生成方法框图。The specific implementation process of the present invention includes image target detection and feature vector fusion, feature generation based on global context interaction, and message transmission processes. FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.
1.图像的目标检测与特征向量融合1. Image target detection and feature vector fusion
给出输入图像后,本发明使用Faster-RCNN深度学习模型进行目标检测,得到其目标集合O=(o1,o2,…,on),对应的视觉特征集合V=(v1,v2,…,vn),坐标特征集合B=(b1,b2,…,bn)、预分类标签集合L=(l1,l2,…,ln)、两两目标坐标并集框内的视觉特征C=(ci→j,i≠j)。After the input image is given, the present invention uses the Faster- RCNN deep learning model for target detection, and obtains its target set O=(o 1 ,o 2 ,...,on ), and the corresponding visual feature set V=(v 1 ,v 2 ,...,v n ), coordinate feature set B=(b 1 ,b 2 ,...,b n ), pre-classification label set L=(l 1 ,l 2 ,...,l n ), pairwise target coordinates The visual feature C=( ci→j , i≠j) in the set frame.
首先,本发明使用特征融合方法,对每个目标对应的空间坐标特征bi、视觉特征的向量vi进行联合表示。对于目标oi,其绝对位置坐标b=(x1,y1,x2,y2),其中x1,y1,x2,y2分别代表其矩形回归框左上与右下坐标,本发明利用如下公式将其转化为在图像中相对位置编码bi:First, the present invention uses a feature fusion method to jointly represent the spatial coordinate feature b i corresponding to each target and the vector v i of the visual feature. For the target o i , its absolute position coordinates b=(x 1 , y 1 , x 2 , y 2 ), where x 1 , y 1 , x 2 , y 2 represent the upper left and lower right coordinates of its rectangular regression frame, respectively. The invention uses the following formula to convert it into the relative position code bi in the image:
式中,wid代表图像I原有宽度,hei代表图像I原有高度。In the formula, wid represents the original width of image I, and hei represents the original height of image I.
然后,使用神经网络的全连接层将相对位置编码bi扩充为128维特征si:Then, the relative position encoding b i is augmented into a 128-dimensional feature s i using the fully connected layers of the neural network:
si=σ(Wsbi+bs),s i =σ(W s b i +b s ),
其中,σ代表ReLU激活函数,Ws与bs为线性变换参数,由神经网络自行学习调整。同时,本方法使用全连接层将目标视觉特征vi由4096维特征转为512维。Among them, σ represents the ReLU activation function, W s and b s are linear transformation parameters, which are learned and adjusted by the neural network. At the same time, this method uses a fully connected layer to convert the target visual feature v i from 4096-dimensional features to 512-dimensional features.
随后,本发明将经过维度变换的相对位置特征向量si和视觉特征vi进行拼接并维度变换,得到512维目标视觉与坐标特征融合向量fi,计算流程如下所示:Subsequently, the present invention splices and dimensionally transforms the relative position feature vector si and the visual feature vi that have undergone dimension transformation to obtain a 512-dimensional target vision and coordinate feature fusion vector f i , and the calculation process is as follows:
fi=σ(Wf[si,vi]+bf),f i =σ(W f [ s i ,vi ]+b f ),
式中,[·]代表拼接操作,σ代表ReLU激活函数,Wf与bf为线性变换参数。In the formula, [ ] represents the splicing operation, σ represents the ReLU activation function, and W f and b f are linear transformation parameters.
以上特征向量融合流程如图2所示。The above feature vector fusion process is shown in Figure 2.
2.基于双向门控循环神经网络的全局特征生成2. Global feature generation based on bidirectional gated recurrent neural network
在全局特征生成过程中,本发明构建双向门控循环神经网络BiGRU,并使用零向量作为其初始状态,其结构如图3所示。在得到目标集合的特征融合向量F=(f1,f2,…,fn)后,将其按照相对坐标中的第一项x坐标由左向右进行排序,并按序输入BiGRU中,得到全局上下文目标特征γ=(γ1,γ2,…,γn)。具体生成步骤为:In the process of global feature generation, the present invention constructs a bidirectional gated recurrent neural network BiGRU, and uses the zero vector as its initial state, the structure of which is shown in Figure 3. After obtaining the feature fusion vector F=(f 1 , f 2 ,..., f n ) of the target set, sort it from left to right according to the x coordinate of the first item in the relative coordinates, and input it into BiGRU in sequence, Obtain the global context target feature γ=(γ 1 ,γ 2 ,...,γ n ). The specific generation steps are:
(1)初始化零向量作为BiGRU初始状态;(1) Initialize the zero vector as the initial state of BiGRU;
(2)在BiGRU两端,分别将目标集合中的第一个与最后一个特征融合向量f0与fn输入,生成对应方向与顺序的隐藏状态 (2) At both ends of BiGRU, input the first and last feature fusion vectors f 0 and f n in the target set, respectively, to generate hidden states corresponding to the direction and order.
(3)按序依次向BiGRU两端输入特征向量,生成 (3) Input feature vectors to both ends of BiGRU in sequence to generate
(4)将正向、逆向隐藏状态融合,得到每个目标的上下文融合状态γi。(4) Integrate the forward and reverse hidden states to obtain the context fusion state γ i of each target.
随后,本发明利用Glove词嵌入向量,将目标检测过程中对目标的预分类结果L=(l1,l2,…,ln)转换为128维的目标类别特征向量gi。Subsequently, the present invention uses the Glove word embedding vector to convert the pre-classification result L=(l 1 ,l 2 ,...,l n ) of the target in the target detection process into a 128-dimensional target category feature vector g i .
最后,本发明使用神经网络全连接层将每个目标的全局上下文目标特征γi与其类别特征向量gi进行融合,得到此目标的全局特征ci。上述计算过程如公式所示:Finally, the present invention uses a neural network fully connected layer to fuse the global contextual target feature γ i of each target with its category feature vector gi to obtain the global feature ci of the target. The above calculation process is shown in the formula:
gi=Glove(li),g i =Glove(li i ),
ci=σ(Wc[γi,gi]+bc),c i =σ(W c [γ i , gi ]+b c ),
其中,Glove(li)代表使用Glove方式对目标的预分类标签进行编码,[·]代表拼接操作,Wc与bc为线性变换参数。Among them, Glove(l i ) represents the use of Glove to encode the pre-classification label of the target, [ ] represents the splicing operation, and W c and b c are linear transformation parameters.
3.基于全局特征向量的消息迭代传递机制3. Message iterative delivery mechanism based on global feature vector
消息迭代传递机制分为消息聚合函数和状态更新函数两部分。The message iterative delivery mechanism is divided into two parts: the message aggregation function and the state update function.
首先,本发明构建消息聚合函数:在场景图拓扑中,节点与边分别表示视觉关系中的主宾语目标及其关系,在消息传递时,单一节点或边会同时收到多个来源的信息,需要设计池化函数以计算每部分消息的权重,并使用其加权和以聚合最终的传入消息。根据消息的接收者不同,可将传入消息为由目标节点接收的消息与由关系边接收的消息 First, the present invention constructs a message aggregation function: in the scene graph topology, nodes and edges respectively represent the subject-object target and its relationship in the visual relationship, and during message transmission, a single node or edge will receive information from multiple sources at the same time, A pooling function needs to be designed to calculate the weight of each part of the message and use its weighted sum to aggregate the final incoming message. Depending on the recipient of the message, incoming messages can be treated as messages received by the destination node with messages received by relational edges
已知当前节点GRU和关系边GRU的隐藏状态与将第t次迭代时传入第i个节点的消息表示为由目标GRU自身隐藏状态其出度边GRU隐藏状态入度边隐藏状态计算得到,其中i→j代表此关系中目标i为主语,目标j为宾语。Know the hidden state of the current node GRU and relation edge GRU and Denote the message passed to the i-th node at the t-th iteration as Hidden state by the target GRU itself Its out-degree edge GRU hidden state In-degree edge hidden state Calculated, where i→j represents the target i as the subject and the target j as the object in this relation.
相似的,第t次迭代时由第i个目标节点到第j个目标节点的关系边,其聚合消息为由关系边GRU的上一迭代对应的隐藏状态主语节点GRU隐藏状态宾语节点GRU隐藏状态组成。与由以下自适应加权函数求得:Similarly, in the t-th iteration, from the i-th target node to the j-th target node, the aggregated message is The hidden state corresponding to the previous iteration of the relational edge GRU Subject node GRU hidden state Object Node GRU Hidden State composition. and It is obtained by the following adaptive weighting function:
其中,[·]代表拼接操作,σ代表ReLU激活函数,w1、w2和v1、v2是可学习参数。Among them, [ ] represents the splicing operation, σ represents the ReLU activation function, and w 1 , w 2 and v 1 , v 2 are learnable parameters.
其次,本发明构建状态更新函数:分别构建目标节点GRU和关系边GRU,对目标和目标间关系的特征向量的存储和更新。首先,在t=0时,将每个目标节点与关系边的GRU状态初始化为零向量,将目标的全局特征向量ci作为目标节点GRU的输入,将两两目标坐标并集框内的视觉特征ci→j作为其关系边GRU的输入,分别生成目标节点和关系边在初始时刻的隐藏状态 Secondly, the present invention constructs a state update function: constructing the target node GRU and the relationship edge GRU respectively, and the feature vector of the relationship between the target and the target storage and updating. First, at t=0, initialize the GRU state of each target node and relation edge to a zero vector, use the global feature vector c i of the target as the input of the target node GRU, and combine the pairwise target coordinates to set the visual in the frame. The feature c i→j is used as the input of its relational edge GRU to generate the hidden state of the target node and relational edge at the initial moment, respectively
在后续迭代中,每一次迭代t,每个GRU,根据其是目标GRU或关系GRU,将其上一迭代的隐藏状态或和上一迭代的传入消息或作为输入,并生成一个新的隐藏状态或作为输出,用于消息聚合函数生成下一次迭代的消息。In subsequent iterations, at each iteration t, each GRU, depending on whether it is a target GRU or a relational GRU, converts the hidden state of the previous iteration or and incoming messages from the previous iteration or as input, and generate a new hidden state or As output, the message used for the message aggregation function to generate the message for the next iteration.
故整个消息传递机制的具体步骤为:Therefore, the specific steps of the entire message passing mechanism are as follows:
(1)将每个目标节点与关系边的GRU状态初始化为零向量;(1) Initialize the GRU state of each target node and relation edge to a zero vector;
(2)将目标的全局特征向量ci作为目标节点GRU的输入,将两两目标坐标并集框内的视觉特征ci→j作为其关系边GRU的输入,分别生成目标节点和关系边在初始时刻的隐藏状态 (2) The global feature vector c i of the target is used as the input of the target node GRU, and the visual feature c i→j in the union frame of the paired target coordinates is used as the input of its relational edge GRU, and the target node and relational edge are generated respectively in Hidden state at the initial moment
(3)利用消息聚合函数,计算每个目标与关系的接收到的消息与 (3) Using the message aggregation function, calculate the received messages for each target and relationship and
(4)结合隐藏状态接受到的消息与利用GRU更新状态,得到下一时刻状态 (4) Combine the hidden state received message and Use GRU to update the state to get the state at the next moment
(5)若迭代次数达到设定次数,则保存当前目标与关系的状态否则,返回步骤(3)。(5) If the number of iterations reaches the set number, save the current state of the target and relationship Otherwise, go back to step (3).
上述消息传递机制流程如图4所示。The flow of the above message passing mechanism is shown in FIG. 4 .
4.基于目标与关系状态表示的场景图生成4. Scene graph generation based on object and relational state representation
将经过消息传递机制更新后的目标与关系隐藏状态视为目标与关系的特征向量,送入神经网络中,使用softmax函数对目标、关系分别进行类别预测,得到每个目标的种类,以及每一对目标之间的关系类别,进而得到能够反映图像中目标与目标间关系的场景图。The target and relationship hidden states updated by the message passing mechanism are regarded as the feature vector of the target and the relationship, and sent to the neural network. The softmax function is used to predict the target and relationship respectively, and the type of each target and For the relationship categories between objects, a scene graph that can reflect the relationship between objects and objects in the image is obtained.
给定输入图像后,目标检测结果及对应场景图示意图如图5所示,本模型的性能测试结果如图6所示。Given the input image, the target detection results and the corresponding scene graph are shown in Figure 5, and the performance test results of this model are shown in Figure 6.
本发明再一实施例中,提供一种基于全局上下文交互的场景图生成系统,能够用于实现上述的基于全局上下文交互的场景图生成方法,具体的,该系统包括:In yet another embodiment of the present invention, a system for generating a scene graph based on global context interaction is provided, which can be used to implement the above-mentioned method for generating scene graphs based on global context interaction. Specifically, the system includes:
目标检测模块,用于对输入图像I进行目标检测,得到其目标集合O=(o1,o2,…,on),以及对应的视觉特征集合V=(v1,v2,…,vn)、坐标特征集合B=(b1,b2,…,bn)、预分类标签集合L=(l1,l2,…,ln)、两两目标坐标并集框内的视觉特征C=(ci→j,i≠j);The target detection module is used for target detection on the input image I to obtain its target set O=(o 1 , o 2 ,...,on ), and the corresponding visual feature set V=(v 1 ,v 2 ,..., v n ), coordinate feature set B=(b 1 ,b 2 ,...,b n ), pre-classification label set L=(l 1 ,l 2 ,...,l n ), in the frame of the union of pairwise target coordinates Visual feature C=( ci→j , i≠j);
目标视觉与坐标特征的联合表示向量获取模块,用于利用神经网络将各目标的绝对位置坐标,转化得到目标视觉与坐标特征的联合表示向量fi;The joint representation vector acquisition module of target vision and coordinate features is used to convert the absolute position coordinates of each target using a neural network to obtain a joint representation vector f i of target vision and coordinate features;
目标全局特征获取模块,用于根据特征融合向量F=(f1,f2,…,fn),得到局上下文目标特征γi与其类别特征向量gi,使用神经网络将目标的全局上下文目标特征γi与其类别特征向量gi进行融合,得到此目标的全局特征ci;The target global feature acquisition module is used to obtain the local context target feature γ i and its category feature vector g i according to the feature fusion vector F=(f 1 , f 2 ,..., f n ), and use the neural network to fuse the target global context target The feature γ i is fused with its category feature vector gi to obtain the global feature ci of the target;
场景图获取模块,用于基于每个目标的全局特征向量ci,每个关系的特征向量ci→j,初始化其隐藏状态进而初始计算各节点传入消息各边传入消息并进行迭代传递,利用循环神经网络更新隐藏状态并进行消息聚合得到各时刻i的传入消息直至达到设置的迭代次数,然后利用目标节点与关系边的最终状态生成能够反映图像中目标与目标间关系的场景图。A scene graph acquisition module for initializing its hidden state based on the global feature vector c i of each target and the feature vector c i→j of each relation Then initially calculate the incoming messages of each node incoming messages from all sides And iterative pass, using recurrent neural network to update the hidden state And perform message aggregation to get the incoming message at each time i Until the set number of iterations is reached, and then use the final state of the target node and the relationship edge to generate a scene graph that can reflect the relationship between the target and the target in the image.
本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本发明各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。The division of modules in the embodiments of the present invention is schematic, and is only a logical function division. In actual implementation, there may be other division methods. In addition, each functional module in each embodiment of the present invention may be integrated into one processing unit. In the device, it can also exist physically alone, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
本发明再一个实施例中,提供了一种计算机设备,该计算机设备包括处理器以及存储器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于执行所述计算机存储介质存储的程序指令。处理器可能是中央处理单元(CentralProcessing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital SignalProcessor、DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其是终端的计算核心以及控制核心,其适于实现一条或一条以上指令,具体适于加载并执行计算机存储介质内一条或一条以上指令从而实现相应方法流程或相应功能;本发明实施例所述的处理器可以用于基于全局上下文交互的场景图生成方法的操作。In yet another embodiment of the present invention, a computer device is provided, the computer device includes a processor and a memory, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is used for executing the computer Program instructions stored in the storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable GateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal, which are suitable for implementing one or more instructions, specifically suitable for One or more instructions in the computer storage medium are loaded and executed to implement the corresponding method process or corresponding function; the processor according to the embodiment of the present invention can be used for the operation of the scene graph generation method based on the global context interaction.
本发明公开了一种基于全局上下文交互的场景图生成方法,1)基于物体视觉特征、空间坐标、语义标签等多种特征融合的向量联合表示;2)基于双向门控循环神经网络的全局特征生成;3)基于全局特征向量的消息迭代传递机制;4)基于目标与关系状态表示的场景图生成。本发明所公开的基于全局上下文交互的场景图生成方法,同现存的场景图生成方法相比,通过上下文交互充分利用图像的全局特征,更具有应用广泛性;同时,得到上下文交互后的全局特征后进行目标对与其关系间的消息传递,利用目标间的潜在联系更新现有状态,进行更准确的场景图生成,具有实际应用的优势。The invention discloses a scene graph generation method based on global context interaction, 1) vector joint representation based on the fusion of various features such as object visual features, spatial coordinates, semantic labels, etc.; 2) global features based on bidirectional gated cyclic neural network generation; 3) message iterative delivery mechanism based on global feature vector; 4) scene graph generation based on target and relation state representation. Compared with the existing scene graph generation method, the method for generating a scene graph based on global context interaction disclosed in the present invention fully utilizes the global feature of the image through context interaction, and has more extensive application; at the same time, the global feature after context interaction is obtained. Then, the message transfer between the target pair and its relationship is carried out, and the existing state is updated by using the potential connection between the targets to generate a more accurate scene graph, which has the advantage of practical application.
最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制,尽管参照上述实施例对本发明进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本发明的具体实施方式进行修改或者等同替换,而未脱离本发明精神和范围的任何修改或者等同替换,其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: the present invention can still be Modifications or equivalent replacements are made to the specific embodiments of the present invention, and any modifications or equivalent replacements that do not depart from the spirit and scope of the present invention shall be included within the protection scope of the claims of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210297025.7A CN114677544B (en) | 2022-03-24 | 2022-03-24 | A scene graph generation method, system and device based on global context interaction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210297025.7A CN114677544B (en) | 2022-03-24 | 2022-03-24 | A scene graph generation method, system and device based on global context interaction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114677544A true CN114677544A (en) | 2022-06-28 |
CN114677544B CN114677544B (en) | 2024-08-16 |
Family
ID=82073908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210297025.7A Active CN114677544B (en) | 2022-03-24 | 2022-03-24 | A scene graph generation method, system and device based on global context interaction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114677544B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115546589A (en) * | 2022-11-29 | 2022-12-30 | 浙江大学 | An Image Generation Method Based on Graph Neural Network |
CN118015522A (en) * | 2024-03-22 | 2024-05-10 | 广东工业大学 | Time transition regularization method and system for video scene graph generation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111462282A (en) * | 2020-04-02 | 2020-07-28 | 哈尔滨工程大学 | Scene graph generation method |
WO2020244287A1 (en) * | 2019-06-03 | 2020-12-10 | 中国矿业大学 | Method for generating image semantic description |
CN113221613A (en) * | 2020-12-14 | 2021-08-06 | 国网浙江宁海县供电有限公司 | Power scene early warning method for generating scene graph auxiliary modeling context information |
CN113627557A (en) * | 2021-08-19 | 2021-11-09 | 电子科技大学 | A Scene Graph Generation Method Based on Context Graph Attention Mechanism |
CN113836339A (en) * | 2021-09-01 | 2021-12-24 | 淮阴工学院 | Scene graph generation method based on global information and position embedding |
KR20220025524A (en) * | 2020-08-24 | 2022-03-03 | 경기대학교 산학협력단 | System for generating scene graph using deep neural network |
-
2022
- 2022-03-24 CN CN202210297025.7A patent/CN114677544B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020244287A1 (en) * | 2019-06-03 | 2020-12-10 | 中国矿业大学 | Method for generating image semantic description |
CN111462282A (en) * | 2020-04-02 | 2020-07-28 | 哈尔滨工程大学 | Scene graph generation method |
KR20220025524A (en) * | 2020-08-24 | 2022-03-03 | 경기대학교 산학협력단 | System for generating scene graph using deep neural network |
CN113221613A (en) * | 2020-12-14 | 2021-08-06 | 国网浙江宁海县供电有限公司 | Power scene early warning method for generating scene graph auxiliary modeling context information |
CN113627557A (en) * | 2021-08-19 | 2021-11-09 | 电子科技大学 | A Scene Graph Generation Method Based on Context Graph Attention Mechanism |
CN113836339A (en) * | 2021-09-01 | 2021-12-24 | 淮阴工学院 | Scene graph generation method based on global information and position embedding |
Non-Patent Citations (1)
Title |
---|
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115546589A (en) * | 2022-11-29 | 2022-12-30 | 浙江大学 | An Image Generation Method Based on Graph Neural Network |
CN115546589B (en) * | 2022-11-29 | 2023-04-07 | 浙江大学 | Image generation method based on graph neural network |
CN118015522A (en) * | 2024-03-22 | 2024-05-10 | 广东工业大学 | Time transition regularization method and system for video scene graph generation |
Also Published As
Publication number | Publication date |
---|---|
CN114677544B (en) | 2024-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110084296B (en) | A Graph Representation Learning Framework Based on Specific Semantics and Its Multi-label Classification Method | |
CN114048331A (en) | A Knowledge Graph Recommendation Method and System Based on Improved KGAT Model | |
CN110188167A (en) | An end-to-end dialogue method and system incorporating external knowledge | |
CN110399518A (en) | A Visual Question Answering Enhancement Method Based on Graph Convolution | |
CN111191526A (en) | Pedestrian attribute recognition network training method, system, medium and terminal | |
CN114332578A (en) | Image anomaly detection model training method, image anomaly detection method and device | |
CN113095346A (en) | Data labeling method and data labeling device | |
CN114677544B (en) | A scene graph generation method, system and device based on global context interaction | |
CN111462324A (en) | Online spatiotemporal semantic fusion method and system | |
Sutanto et al. | Learning equality constraints for motion planning on manifolds | |
US11270425B2 (en) | Coordinate estimation on n-spheres with spherical regression | |
CN118393329B (en) | A system for testing the performance of AI chips in model training and reasoning | |
CN110196928A (en) | Fully parallelized end-to-end more wheel conversational systems and method with field scalability | |
CN116523583A (en) | Electronic commerce data analysis system and method thereof | |
CN116151270A (en) | Parking test system and method | |
CN118196089A (en) | Lightweight method and system for glass container defect detection network based on knowledge distillation | |
CN117036545A (en) | Image scene feature-based image description text generation method and system | |
CN119128790A (en) | Multimodal data fusion method, system, readable storage medium and computer device | |
CN115204171A (en) | Document-level event extraction method and system based on hypergraph neural network | |
CN113923099B (en) | Root cause positioning method for communication network fault and related equipment | |
CN115880552A (en) | Cross-scale graph similarity guided aggregation system, method and application | |
CN113723511B (en) | Target detection method based on remote sensing electromagnetic radiation and infrared image | |
CN115457268A (en) | Hybrid structure-based segmentation method and device and storage medium | |
CN109993188B (en) | Data label identification method, behavior identification method and device | |
CN110222839A (en) | A kind of method, apparatus and storage medium of network representation study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |