CN111340189B

CN111340189B - Space pyramid graph convolution network implementation method

Info

Publication number: CN111340189B
Application number: CN202010108770.3A
Authority: CN
Inventors: 林宙辰; 李夏
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-11-24
Anticipated expiration: 2040-02-21
Also published as: CN111340189A

Abstract

The invention discloses a space pyramid graph convolution network realization method, which comprises the steps of firstly generating an incidence matrix aiming at a depth network characteristic graph, and deriving an efficient calculation method by decomposing the incidence matrix and using a multiplication combination law, so that graph convolution can be directly carried out in an original characteristic space. The invention is used as a lightweight network, breaks through the limitation that graph convolution needs to be carried out in semantic nodes in the past method, and further improves the expression capability of the network by carrying out graph reasoning on a plurality of scales; the invention can effectively solve the problem of insufficient perception domain of the full convolution network and remarkably improve the performance of the full convolution network; the graph convolution scheme provided by the invention obtains remarkable performance in a density prediction task in computer vision, and examples and experiments fully verify the effectiveness and potential application value of the graph convolution scheme.

Description

Space pyramid graph convolution network implementation method

Technical Field

The invention belongs to the field of graph convolution network and depth network structure design, and particularly relates to a method for realizing a spatial pyramid graph convolution network.

Background

In recent years, network architectures based on fully-convoluted networks have achieved tremendous success in computer vision tasks. The full convolution network consists of only convolution layers and pooling layers, and by stacking the convolution layers, the theoretical perceptual domain of the network can increase as the depth of the network increases. But their effective perceptual domain is limited so that they can only capture local information for each. Thus, full convolution networks have difficulty capturing complex context information. While for dense prediction tasks, such as semantic segmentation, depth estimation, etc., context information is critical. This problem limits the performance of a full convolution network.

There are a number of approaches currently being attempted to address this problem. Spatial pyramids based on hole convolution propose to capture context information at different distances using different coefficients of convolution expansion. The deformable convolution adaptively determines the final perceptual domain by learning the offset of the convolution locations. Non-local neural networks and dual-attention networks attempt to introduce new interaction modules that make the pixels perceive the entire space. Recurrent neural networks are also used to perform remote reasoning. The above methods enlarge the perception domain and can help the network to capture the remote information, but have the problem of large calculation amount.

Standard convolution extends over unstructured data, producing graph convolution. Subsequent studies performed approximate calculations on the graph convolution formulas to reduce the computational and training costs. Based on the above work, graph convolution has achieved a series of results on graph structure data such as semi-supervised learning, node or graph classification, and molecular prediction. Since the graph convolution can capture global information, it is introduced into the full convolution network as a complement to the standard convolution. The methods map pixels to a semantic node space, perform a graph convolution operation in the semantic space, and then map back to the pixel space.

Disclosure of Invention

The invention aims to provide a space pyramid graph convolution network realization method aiming at the defects of the prior art. The invention can effectively enlarge the effective perception domain of the network and solve the problem of insufficient perception domain of the full convolution network by executing compact graph convolution operation in the original feature space.

The aim of the invention is realized by the following technical scheme: a method for realizing a spatial pyramid graph convolution network comprises the following steps:

(1) The image is convolved to obtain the feature X of original scale ⁰ Then to X ⁰ Downsampling to obtain multi-scale features, the formula is as follows:

X ^s+1 ＝∏ _down (X ^s )

wherein, pi _down Representing downsampling; the superscript indicates the scale level, the range is 0-S, X ^s+1 、X ^s Features representing scale levels s, s+1;

(2) Carrying out graph convolution on the multi-scale features obtained in the step (1), and then carrying out upscalingSampling and adding the graph convolution output of the previous stage of scale characteristics to finally obtain the output Y of the network ⁰ The formula is as follows:

Y ^s ＝GR(X ^s )+∏ _up (Y ^s+1 )

wherein Y is ^s Network output representing a scale feature of level s, and Y ^S ＝GR(X ^S )，X ^S Features that are of minimum scale; GR represents graph convolution; pi-shaped structure _up Representing upsampling.

Further, the graph convolution is implemented by the steps of:

(2.1) input feature map X ε R ^H×W×C The method comprises the steps of carrying out a first treatment on the surface of the Wherein H, W, C respectively refer to the height, width and channel number of the feature map X;

(2.2) obtaining an association matrix a, comprising the sub-steps of:

(2.1.1) transformation of X inputted in the step (2.1) by 1X 1 convolution to obtain phi (X; W) with M channels _Φ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein W is _Φ Representing a 1 x 1 convolution parameter;

(2.1.2) the X input in step (2.1) is first spatially pooled to 1×1×C, then a vector with dimension M is obtained by 1×1 convolution, and finally p (X; W) is obtained by sigmoid operation _p ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein W is _p Representing a 1 x 1 convolution parameter;

(2.1.3) calculating the correlation matrix a according to steps (2.1.1) to (2.1.2):

A＝φ(X；W _Φ )diag(p(X；W _p ))φ(X；W _Φ ) ^T

(2.3) calculating a degree matrix D according to the incidence matrix A obtained in the step (2.2):

Λ(X)＝diag(p(X；W _p ))

wherein,is an all 1 vector of length HW;

(2.4) calculating a normalized Laplace matrix L according to the incidence matrix A obtained in the step (2.2) and the degree matrix D obtained in the step (2.3):

L＝I-D ^-1/2 AD ^-1/2

wherein I is an identity matrix;

(2.5) the output GR (X) of the graph convolution is obtained by:

LX＝X-P(Λ(X)(P ^T X))

GR(X)＝σ(LXΘ)

where Θ represents the parameters of the graph convolution and σ is the nonlinear activation function.

Further, the channel number M is in the range of

Further, the upsampling is achieved by quadratic linear interpolation.

Further, the downsampling is achieved by a max pooling layer.

Compared with the prior art, the invention has the beneficial effects that: according to the space pyramid graph convolution network implementation method, the generation mode of the incidence matrix A is designed according to the characteristics of the depth network feature graph, and the calculation complexity is remarkably reduced by decomposing the incidence matrix A and using a multiplication combination law; the invention is used as a lightweight network, so that the graph volume layer which is only applied to the compressed semantic node space can be directly applied to the original feature space at present; in addition, the invention proposes to fully exploit the potential of graph convolution using a spatial pyramid structure, further increasing the expressive power of the perceptual domain features of the network by graph reasoning over multiple scales. The invention has remarkable performance in the density prediction task in computer vision, and examples and experiments fully verify the effectiveness and potential application value of the invention.

Drawings

FIG. 1 is a schematic diagram of an incidence matrix operation process used in the present invention;

FIG. 2 is a schematic diagram of a spatial pyramid convolution network employing the present invention;

fig. 3 is a schematic diagram of the overall network for use with the present invention.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention designs the generation mode of the Laplace matrix according to the characteristics of the full convolution network, and the calculated amount is obviously reduced by decomposing the Laplace matrix; this change makes it possible to run directly on the original feature space, avoiding the loss of information due to the mapping-inverse mapping procedure. The invention discloses a spatial pyramid graph rolling network, which comprises a graph rolling module shown in figure 1, wherein the graph rolling module is realized by the following steps:

(1) Inputting a characteristic diagram X epsilon R extracted by a backbone network ^H×W×C The method comprises the steps of carrying out a first treatment on the surface of the Wherein, H, W, and C refer to the height, width, and channel number of the feature map X, respectively, h×w is the number of nodes of the feature map X, and C is the feature dimension of the nodes of the feature map X.

(2) The process of obtaining the association matrix A is shown in FIG. 1, and comprises the following substeps:

(1.1) converting the X input in step 1 by 1×1×C convolution to obtain Φ (X; W) _Φ ) The number of channels is reduced from C to M; wherein M is C HW, takingW _Φ Is a 1 x 1 convolution parameter.

(1.2) the X input in step 1 is subjected to space pooling to 1×1×C, then a vector with dimension M is obtained through 1×1 convolution, and finally p (X; W) is obtained through sigmoid operation _p ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein W is _p Is a 1 x 1 convolution parameter.

(1.3) calculating an association matrix A:

A＝φ(X；W _Φ )diag(p(X；W _p ))φ(X；W _Φ ) ^T

wherein diag is used to extract matrix diagonal elements.

(3) Calculating a degree matrix D according to the incidence matrix A obtained in the step (2):

wherein,is an all 1 vector of length HW; Λ (X) =diag (p (X; W) _p ) Λ (X) is correlated with the input data, a better metric can be learned for the correlation matrix a. Here, the computational complexity required to calculate D is combined by matrix multiplication from O (HWM) ² +(HW) ² M+(HW) ² ) Down to O (HWM) ² +M ² )。

(4) Calculating a normalized Laplace matrix L according to the incidence matrix A obtained in the step (2) and the degree matrix D obtained in the step (3):

L＝I-D ^-1/2 AD ^-1/2

wherein I is an identity matrix; the dimension of L is hw×hw.

(5) The output GR (X) of the graph convolution operation is obtained by:

GR(X)＝σ(LXΘ)

wherein GR represents a graph convolution mode proposed by the invention;Θ represents the parameters of the graph convolution, σ is the nonlinear activation function; the computational complexity of LX is derived from O (HWM) ² +(HW) ² M+(HW) ² C+(HW) ² ) Reduced to O (HWM) ² +HWCM)。

Although graph reasoning enables capturing global context, we note that the same image contains multiple long-range context patterns. For example, a finer granularity representation may have a more detailed remote context, while a coarser granularity representation may provide more global dependencies. Since our graph inference module is directly implemented in the original feature space, we organize the input features into a spatial pyramid to extend the remote context patterns that our method can capture.

It has a similar form to the feature pyramid network; however, we implement our method on the original features, rather than multi-scale features in the convolutional neural network (Convolutional Neural Networks, CNN) backbone. Graphical reasoning is performed on each scale obtained by downsampling, and then the output features are combined by upsampling, the flow of which is shown in fig. 2.

In the present invention, graph reasoning about the spatial pyramid can be expressed as:

(1) The image is subjected to convolution network to obtain a feature map X of original scale ⁰ Then to X ⁰ Downsampling to obtain multi-scale features:

X ^s+1 ＝∏ _down (X ^s )

wherein, pi _down Representing downsampling; the superscript indicates the scale level, the range is 0-S, X ^s+1 、X ^s Features with scale levels s, s+1 are represented.

(2) Carrying out graph convolution on the multi-scale features obtained in the step (1), then carrying out up-sampling and combining the graph convolved output features to finally obtain the output Y of the network ⁰ The formula is as follows:

Y ^s ＝GR(X ^s )+∏ _up (Y ^s+1 )

wherein Y is ^s Network output representing a scale feature of level s, and Y ^S ＝GR(X ^S )，X ^S Features that are of minimum scale; pi-shaped structure _up And pi _down Respectively, up-sampling and down-sampling, in particular up-sampling is achieved by quadratic linear interpolation and down-sampling is achieved by a max pooling layer.

The invention can be applied to any occasion using the deep neural network, and can be directly cooperated with the deep neural network through residual connection. The effectiveness of the proposed spatial pyramid graph convolution module is demonstrated through the classical task of semantic segmentation. The present invention significantly improves the baseline model and defeats the attention mechanism that is widely used in computer vision.

For semantic segmentation, the embodiment specifically includes the following steps:

step 1, collecting images and labeling correct segmentation results:

natural images of different scenes and under the illumination condition are acquired through the camera lens. Semantic information and categories of objects in the image are annotated at the pixel level. And eliminating errors in the marked data by a mode of marking and taking by multiple persons.

Step 2, establishing an objective function of the semantic segmentation problem:

in a specific implementation, cross entropy is typically used as the loss function. Taking the characteristics of semantic segmentation into consideration, cross entropy can be added to different layers of the depth network as an additional loss function.

And 3, selecting a network structure serving the semantic segmentation task, and adding a spatial pyramid graph rolling module (SpyGR), wherein the whole flow is shown in fig. 3. :

classical ResNet-101 may generally be chosen as the backbone network for the semantic segmentation task. ResNet has been largely improved in generalization ability through pre-training of the ImageNet classification task. To accommodate the needs of the semantic segmentation task, the stride of ResNet c4, c5 can be set to 1, and the condition is changed to 2 and 4, respectively, such that the entire network is downsampled spatially to 8. Placing 3*3 convolutions on top of the ImageNet pre-trained ResNet-101 reduces the number of channels from 2048 to 512 dimensions.

On the basis of the characteristics of network extraction, the spatial pyramid graph convolution module is placed to further obtain better characteristics for complementing the deficiency of the representation of the backbone network extraction and removing redundant information irrelevant to targets. The representation processed by the spatial pyramid convolution module is further reduced to 512 dimensions, and finally pixel-by-pixel classification and interpolation by the full convolutional neural network FCN is performed to recover the original size.

Step 4, preprocessing input data:

for training data sets, the image needs to be transformed to standard size and cropped. Common data enhancements include flipping and multi-scale transforms. In addition, the input data is normalized.

Step 5, determining super parameters of network training:

prior to training, super parameters of the network training, including batch size, learning rate, iteration number, etc., are determined. In the problem of semantic segmentation, different datasets possess different hyper-parameters. For the Cityscapes dataset, the optional superparameter is batch size 8, initial learning rate 0.01, learning rate decay strategy Poly decay, and index 0.9.

Step 6, performing network training:

after the network structure is obtained, the semantic segmentation data for training can be utilized to train the network, and training is stopped after the iteration times are completed. In the implementation example of the invention, the above steps are completed, and the trained deep neural network can be used for executing the semantic segmentation task.

Table 1: network module complexity contrast

Method	Floating point calculation amount (GB)	Video memory occupation (MB)
			Nonlocal	14.60	1072
A ² -Net	3.11	110
			GloRe	3.11	103
SGR	6.24	118
			DANet	19.54	1114
SpyGR without pyramid	3.11	120
			SpyGR using pyramids	4.12	164

Table 2: cityscapes dataset test set result comparison

Table 1 compares the computational complexity and the spatial complexity of the spatial pyramid convolution module (SpyGR) and other modules of the present invention. Table 2 lists the performance of SpyGR on the Cityscapes dataset, using the index mIoU common to semantic segmentation tasks, the best results are bolded, sub-optimal underlined. It can be seen from the combination of tables 1 and 2 that the spatial pyramid graph convolution module (SpyGR) provided by the present invention achieves the highest performance at lower calculation and storage costs.

Table 3: PASCAL VOC dataset comparison results

Table 3 lists the comparison results of the paspal VOC datasets. The results show that SpyGR is significantly better than Deeplabv3 and Deeplabv3+ on the Cityscapes and PASCAL VOC datasets for a variety of testing methods. The invention provides a graph volume integration algorithm based on a spatial pyramid, and the graph volume integration algorithm is realized as a lightweight neural network module. It achieves better performance while reducing the amount of computation and the amount of parameters.

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. The method for realizing the spatial pyramid graph convolution network is characterized by comprising the following steps of:

X ^s+1 ＝П _down (X ^s )

wherein the II is a kind of a _down Representing downsampling; the superscript indicates the scale level, the range is 0-S, X ^s+1 、X ^s Features representing scale levels s, s+1;

(2) Carrying out graph convolution on the multi-scale features obtained in the step (1), then carrying out up-sampling and adding graph convolution output of the previous-stage scale features to obtain the output Y of the network ⁰ The formula is as follows:

Y ^s ＝GR(X ^s )+П _up (Y ^s+1 )

wherein Y is ^s Network output representing a scale feature of level s, and Y ^S ＝GR(X ^S )，X ^S Features that are of minimum scale; GR represents graph convolution; pi (Pi) _up Representing upsampling;

the spatial pyramid graph convolution network realized by the step (1) and the step (2) is applied to a deep neural network for semantic segmentation, and comprises the following steps:

s1: collecting images and labeling correct segmentation results;

s2: establishing an objective function of the semantic segmentation problem;

s3: selecting a network structure serving a semantic segmentation task, and adding a spatial pyramid graph convolution network;

s4: preprocessing input data;

s5: determining super parameters of network training;

s6: performing network training;

s7: the trained deep neural network can be used for executing semantic segmentation tasks.

2. The method of claim 1, wherein the graph rolling is performed by:

(2.2) obtaining an association matrix a, comprising the sub-steps of:

A＝Φ(X；W _Φ )diag(p(X；W _p ))Φ(X；W _Φ ) ^T

Λ(X)＝diag(p(X；W _p ))

wherein,is an all 1 vector of length HW;

L＝I-D- ^1/2 AD ^-1/2

wherein I is an identity matrix;

(2.5) the output GR (X) of the graph convolution is obtained by:

LX＝X-P(Λ(X)(P ^T X))

GR(X)＝σ(LXΘ)

3. The method of claim 2, wherein the channel number M is in a range of about

4. The method of claim 1, wherein the upsampling is performed by quadratic linear interpolation.

5. The spatial pyramid graph convolution network implementation method according to claim 1, wherein said downsampling is implemented by a max pooling layer.