CN112184805B

CN112184805B - Drawing meaning network construction method based on vision and spatial relation fusion

Info

Publication number: CN112184805B
Application number: CN202010946723.6A
Authority: CN
Inventors: 俞俊; 杨艳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2024-04-09
Anticipated expiration: 2040-09-10
Also published as: CN112184805A

Abstract

The invention discloses a graph annotation meaning network construction method based on vision and spatial relationship fusion. The method comprises the following steps of 1, calculating visual characteristics and absolute position characteristics of a target object in an input image. Forming a dual-attribute node in a graph formed by the input image by utilizing the two features, and finally forming a graph; 2. calculating the space geometrical relative position characteristics from each adjacent node to the central node in the graph; 3. calculating the attention weight between each adjacent node and the central node; 4. calculating the transfer information from each adjacent node to the central node; 5. and multiplying the transmission information of all adjacent nodes corresponding to the central node by the corresponding attention weight, and then summing to obtain the information of the central node after aggregation. The visual features are updated with this information, leaving the absolute position features unchanged. The invention can be used for assisting in analysis of various visual scenes and is a universal model capable of being embedded with different visual tasks.

Description

Drawing meaning network construction method based on vision and spatial relation fusion

Technical Field

The invention relates to the field of visual and spatial position relation modeling and graph meaning force network among different objects, in particular to a graph meaning force network construction method based on visual and spatial relation fusion, and a graph meaning force network combining visual features and spatial position features.

Background

Exploring the relationship between target objects in an image may assist in understanding the visual scene, thereby helping to improve performance of related visual tasks, such as target detection, image captioning, visual question answering, and the like. However, different images contain different target objects, and there are various relationships between the different objects. In most practical related task applications, no annotation of relationships between objects is provided in advance. Therefore, it is important to design a model that can mine relationships between target objects in detail.

Attention networks are a common object relational modeling mechanism. In general, objects detected in an image and the relationships between them are formed into a graph that includes nodes and connecting edges between the nodes. The attention network is mainly composed of two processes: attention weight calculation and information transfer aggregation. However, most previous attention networks only use the visual information of the nodes to accomplish both processes, ignoring the spatial relationship between the nodes. However, the spatial information of the objects has a great influence on the modeling of the relationships between the objects.

To solve this problem, researchers have attempted to add spatial positional relationships between targets to the attention network as well, creating a spatial attention network. Nevertheless, these methods still suffer from at least two problems: the current spatial characteristics are not comprehensive for describing the spatial relationship, and most of the spatial attention networks only add spatial information in the weight calculation of the edges.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a graph annotation force network based on fusion of visual and spatial relations, which fuses the visual and spatial relations of images into a double-attribute graph and can accurately model the relation among objects in the images. The network can complete the aggregation and updating of visual and spatial information at the same time, and is well applied to related visual tasks to help understand visual scene information. The method has better effect on tasks such as target detection, image captions, visual question answers and the like.

A method for constructing a graph annotation meaning network based on fusion of vision and spatial relationship comprises the following specific steps:

and (1) calculating visual features and absolute position features corresponding to the target object in the input image.

Forming a dual-attribute node in a graph formed by the input image by utilizing the visual characteristic and the absolute position characteristic of each target object, and finally forming a graph;

step (2), calculating the space geometric relative position characteristic between each adjacent node and the central node in the graph;

step (3), calculating the attention weight between each adjacent node and the central node, wherein the attention weight comprises visual attention and composite attention;

step (4), calculating the transmission information from each adjacent node to the central node, wherein the transmission information comprises visual transmission information and composite transmission information;

and (5) multiplying the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively, and then summing to obtain the information of the central node after aggregation. This information is only used to update the visual features of this central node. The absolute position characteristics of the central node remain unchanged.

Further, inputting an image in the step (1), detecting the target, and calculating the detected target to obtain the corresponding visual characteristicsAnd absolute position feature->Utilize visual characteristics of each object +.>And absolute position feature->Forming a dual attribute node in the graph formed by the image, and finally forming a graph, wherein the implementation is as follows:

visual features in step (1)Refers to the visual characteristics of a target object in the input image.

Absolute position feature in step (1)Refers to the absolute position feature of a target object in an input image; the specific formula is as follows:

wherein i represents an ith dual attribute node in the graph; c _1i And c _2i Is the central position coordinate, w, of the target object _i And h _i Representing the length and width of the target object, respectively. The target object may be an object in a detection frame obtained by target detection, or may be a target object framed manually.

Further, the spatial geometric relative position characteristics of each neighboring node to the central node in the calculation graph in the step (2) are as follows:

given a central nodeAnd its corresponding neighbor node->The relative distance between them ∈>Relative dimension->And relative direction->The calculation formulas of the three spatial position relations are respectively as follows:

finally, spatial geometric relative position featuresIs calculated as follows, and is obtained by embedding the three spatial position features:

wherein Emb (·) is a location embedding mapping operation.

Further, the calculating of the attention weight from each neighboring node to the central node in the step (3) includes two parts of visual attention and composite attention, and the specific process is as follows:

3-1 characterization of the relative position of the space geometryGenerating a composite feature in combination with the visual features of the neighboring nodes>The specific formula is as follows:

wherein || represents a stitching operation; sigma represents a nonlinear transformation operation; w (W) _* And b _* Respectively represent a weight matrix and a bias, W in the following formula _* And b _* And the same is done; wherein the subscripts are referred to.

3-2. In passingVisual characteristics of cardiac nodesAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:

wherein W is _ap+2 、W _ap+1 And W is _t Representing weight matrixes corresponding to different full connection layers; similarly, b _ap+2 、b _ap+1 And b _t Representing the bias corresponding to the different fully connected layers.

3-3 visual characteristics through a center nodeAnd visual characteristics of its corresponding neighbor node +.>Calculating visual attentionThe specific formula is as follows:

wherein,d _av representing a weight matrix->Corresponding dimension and weightHeavy matrixThe dimensions are the same.

3-4 visual attention obtained by calculationAnd composite attention->The attention weight a between each adjacent node and the central node in the calculation graph _ij The specific formula is as follows:

wherein, alpha is a harmonic weight for balancing visual attention and composite attention, and N refers to the number of adjacent nodes corresponding to the central node.

Further, in the step (4), the visual transfer information and the composite transfer information are adopted to calculate the transfer information from each neighboring node to the central node, which is specifically as follows:

4-1. Calculating visual Transmission informationAnd composite delivery information->The corresponding formulas are as follows:

4-2 Using the obtained visual transfer informationAnd composite delivery information->Calculating the transfer information of each neighboring node to the central node +.>The specific formula is as follows:

further, in step (5), the transmission information from all neighboring nodes corresponding to a central node to the central node is multiplied by the corresponding attention weight and summed to obtain updated and aggregated information of the central nodeThe specific formula is as follows:

n is the number of all adjacent nodes corresponding to the central node; this aggregate information is used only to update the visual information of the central node, the absolute position characteristics of the node remaining unchanged.

The invention has the beneficial effects that:

the invention uses a dual attribute map to build an analysis model of the target object relationship based on the predecessor. The node attributes of the graph include both visual features of the object and spatial location features of the object. In this figure, the spatial information of the object can naturally be seamlessly combined with the visual features and aggregate propagated in the figure. Meanwhile, the invention combines the relative distance, the relative scale and the relative direction of the target object to construct the spatial position characteristic.

Aiming at the problem of relation modeling of each object in an image, the invention provides a graph annotation meaning network model based on fusion of visual and spatial relations, which combines the visual characteristics and the spatial position characteristics of the object, and improves the accuracy of relation analysis of each object in the image to a certain extent. The model is beneficial to visual scene understanding, and has better effects on tasks such as target detection, image captions, visual question answering and the like. The method can be used for practical applications such as automatic driving behavior detection in the future.

Drawings

Fig. 1 is a schematic flow diagram of a schematic force network based on fusion of visual and spatial relationships in the method of the present invention.

Fig. 2 is a detailed flow diagram of a schematic diagram force network based on fusion of visual and spatial relationships in the method of the present invention.

Fig. 3 is a network framework in which the method of the present invention is applied in object detection.

FIG. 4 is a schematic diagram of the application of the method of the present invention to a target detection task.

Fig. 5 is a schematic diagram of the application of the method of the present invention in an image captioning task.

Fig. 6 is a schematic diagram of the application of the method of the present invention in a visual question answering task.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention provides a graph annotation force network based on fusion of visual and spatial relations, which can be used for modeling the relation between visual objects, thereby helping visual scene analysis, and the specific flow details are shown in FIG. 2. The invention can be embedded and applied to different specific visual tasks, such as three different tasks corresponding to figures 4, 5 and 6 respectively: object detection, image captioning, and visual question answering. It is worth noting that in addition to these three tasks, other related tasks can also embed the present invention to provide a visual and spatial relationship fusion-based graph annotation network into a visual model for better analysis and understanding of visual scenes.

The embodiment of the object detection task is specifically described herein by taking the object detection task of fig. 3 as an example. In general, target detection is the generation of a detection box and corresponding classification category for a target object. FIG. 3 employs a pre-trained ResNet-101 on an ImageNet image library as the backbone network, and the overall target detection network employs a Fast R-CNN architecture. The graph intent network of the present invention, based on a fusion of visual and spatial relationships, is then added behind the candidate box generation module and placed before the object classification and detection box regression layer. Two graph meaning force networks based on fusion of visual and spatial relations are embedded in the implementation process. In detail, the invention adds a layer of normalization operation in front of the graph annotation meaning network based on the fusion of the visual and spatial relations, and adds a full connection layer with ReLU after the graph annotation meaning network based on the fusion of the visual and spatial relations. This graph-annotation-force network based on fusion of visual and spatial relationships helps to improve the accuracy of the final target detection task.

The specific implementation steps are as follows:

and (1) generating candidate targets for one input image to obtain a plurality of candidate frames. In the target detection network, the invention converts the output linearity of the 5 th convolution layer into 512-dimensional characteristics serving as the visual characteristics of the nodes. The absolute position of a node is characterized by the center coordinates, length and height of the candidate box. Then, using the visual characteristics of each targetAnd absolute position feature->A dual attribute node in the graph formed by the image is formed, and finally a graph is formed. The number of the double-attribute nodes of the graph is the number of the candidate frames;

step (2), calculating the space geometric relative position characteristic between each adjacent node and the central node in the dual-attribute graph;

given a central nodeAnd its corresponding neighbor node->Calculate the relative distance between them ∈>Relative dimension->And relative direction->Corresponding calculation formulas are respectively formula 2, formula 3 and formula 4.

Computing spatial geometric relative position featuresThe three spatial position features are embedded into the system. Wherein Emb (·) is a location embedding mapping operation.

Step (3), calculating the attention weight between each adjacent node and a central node in the dual-attribute graph, wherein the attention weight contains two parts of information of visual attention and composite attention;

first, combining the spatial geometric relative position features with the corresponding visual features of the neighborhood to generate a composite featureWhere || denotes a stitching operation. Sigma represents a nonlinear transformation operation. W (W) _* And b _* Respectively representing the weight matrix and the bias.

Then, visual characteristics through the nodeAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:

visual characteristics through the central nodeAnd visual characteristics of its neighbors +.>Calculating visual attentionThe specific formula is as follows:

finally, the attention weight a between each adjacent node and the central node in the graph is calculated by adopting the calculated visual attention and the composite attention _ij The specific formula is as follows, and alpha is a harmonic weight for balancing visual attention and compound attention.

Step (4), calculating the transmission information from each adjacent node to the central node, wherein the transmission information comprises two aspects of visual transmission information and composite transmission information: visually communicating informationAnd composite delivery information->

Then, the obtained visual transfer information and composite transfer information are adopted to calculate the transfer information from each adjacent node to the central node

Step (5), multiplying the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively, and then summing to obtain the information of the central node after aggregation and updatingThis information is taken as a visual feature of the updated node, but the absolute position feature of the node remains unchanged.

And connecting the updated dual-attribute node to the next graph annotation meaning network based on the fusion of the visual and spatial relations, and performing the same operation. After the visual characteristics of the nodes are updated twice, the final visual characteristics are used for classification and regression to obtain the category information of the final detection target and the coordinate information of the detection frame.

In order to test the performance of the graph meaning force model based on the fusion of visual and spatial relations, different graph meaning force models (SA, SA-P, RN, SORN, SGRN, GAGA-Net) are added into a target detection task, and the detection performance of the graph meaning force models is tested. The experimental results are shown in table 1, wherein the APs, AP50 and AP75 are indexes for measuring the detection performance. As can be seen from Table 1, the graph annotation force network (GAGA-Net) constructed by the method based on the fusion of visual and spatial relations has obvious effect improvement on target detection.

TABLE 1 comparison of Performance of different attention models on target detection

In order to further test the performance of the graph annotation meaning network based on the fusion of visual and spatial relations, different graph annotation meaning models (SA, SA-P, RN, SORN, SGRN, GAGA-Net) are added into the image subtitle and the visual question-answering task, the performance of the image subtitle and the visual question-answering task is tested, and experimental results are shown in the following tables 2 and 3 respectively. CIDEr, BLEU-1, ROUGE and METEOR are four common indicators measuring the accuracy of subtitle generation in Table 2. ALL, Y/N, number and Other in Table 3 are question types of four types of visual questions and answers, and the corresponding index is answer accuracy. As can be seen from tables 2 and 3, the graph annotation force network (GAGA-Net) based on the fusion of visual and spatial relations has obvious effects on the image captions and visual question-answering tasks.

TABLE 2 comparison of Performance of different attention models on image captions

TABLE 3 comparison of the performance of different attention models on visual questions and answers

Claims

1. The method for constructing the graph annotation meaning network based on the fusion of the visual and spatial relations is characterized by comprising the following steps of:

step (1), calculating visual features and absolute position features corresponding to a target object in an input image;

step (5), multiplying and summing the transmission information from all adjacent nodes corresponding to a central node to the central node and the corresponding attention weight respectively to obtain the information of the central node after aggregation;

the specific implementation process of the step (2) is as follows:

2-1 given a central nodeAnd corresponding neighbor node-> The relative distance between them ∈>Relative dimension->And relative direction->The calculation formulas of the three spatial position relations are respectively as follows:

wherein i represents an ith dual attribute node in the graph; c _1i And c _2i Is the central position coordinate, w, of the target object _i And h _i Representing the length and width of the target object, respectively;

2-2 calculating the spatial geometric relative position characteristicsThe method is obtained by embedding the three spatial position features:

wherein Emb (·) is a location embedding mapping operation;

the specific implementation process of the step (3) is as follows:

3-1, combining the spatial geometrical relative position features with the visual features of the adjacent nodes to generate composite featuresThe specific formula is as follows:

wherein the splicing operation is represented by the following steps of; sigma represents a nonlinear transformation operation; w (W) _* And b _* Respectively representing a weight matrix and bias, and the following formulas are the same; wherein refers to subscripts;

3-2 visual characteristics through a center nodeAnd composite features->Calculate composite attention +.>The specific calculation formula is as follows:

wherein W is _ap+2 、W _ap+1 And W is _t Representing weight matrixes corresponding to different full connection layers; similarly, b _ap+2 、b _ap+1 And b _t Representing the corresponding bias of different full connection layers;

3-3 visual characteristics through a center nodeAnd visual characteristics of its corresponding neighbor node +.>Calculating visual attention +.>The specific formula is as follows:

wherein,d _av representing pairs of weight matricesThe dimension of the response;

2. The method for constructing a graph-meaning network based on fusion of visual and spatial relations according to claim 1, wherein the step (1) is specifically implemented as follows:

visual features in step (1)Refers to the visual characteristics of a target object in an input image;

wherein i represents an ith dual attribute node in the graph; c _1i And c _2i Is the central position coordinate, w, of the target object _i And h _i Representing the length and width of the target object, respectively.

3. The graph-meaning force network based on the fusion of visual and spatial relations according to claim 1, wherein the step (4) uses visual transfer information and composite transfer information to calculate transfer information from each neighboring node to the central node, specifically as follows:

4-2, calculating the transfer information from each adjacent node to the central node by adopting the obtained visual transfer information and the composite transfer informationThe specific formula is as follows:

4. the method for constructing a graph-meaning force network based on fusion of visual and spatial relations according to claim 3, wherein the specific implementation formula of the step (5) is as follows: