CN111210432A

CN111210432A - Image semantic segmentation method based on multi-scale and multi-level attention mechanism

Info

Publication number: CN111210432A
Application number: CN202010030667.1A
Authority: CN
Inventors: 许海霞; 黄云佳; 刘用; 周维; 王帅龙
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2020-01-12
Filing date: 2020-01-12
Publication date: 2020-05-29
Anticipated expiration: 2040-01-12
Also published as: CN111210432B

Abstract

The invention discloses an image semantic segmentation method based on a multi-scale and multi-level attention mechanism. The invention comprises the following steps: 1. and carrying out data preprocessing on the image and the real label image. 2. And establishing a neural network structure of the multi-scale attention mechanism model, and extracting and fusing image features. 3. And establishing a neural network structure of the multi-level attention mechanism model, and performing feature fusion of the multi-level images. 4. And (4) model training, namely training neural network parameters by using a back propagation algorithm until the network converges. The invention relates to a neural network model for image semantic segmentation, in particular to a unified modeling method for extracting self attention information of an image on multiple scales and a network structure for fusing image features of different levels on a multi-level, and a better segmentation effect in the field of semantic segmentation is obtained.

Description

Image semantic segmentation method based on multi-scale and multi-level attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, relates to a deep neural network model for image semantic segmentation, and particularly relates to a method for uniformly modeling image feature data and a method for learning relevance among pixel points on image features so as to establish a deep model for image semantic segmentation.

Background

The image semantic segmentation technology is that a machine automatically segments and identifies the content of an image. Semantic segmentation of 2D images, videos, and even 3D data is a key issue in the field of computer vision. Semantic segmentation is a highly difficult task aimed at scene understanding. Scene understanding, as a core problem of computer vision, is particularly important today when the number of applications for extracting knowledge from images is dramatically enhanced. These applications include: autopilot, human-computer interaction, computer photography, image search engines, and augmented reality. These problems have been solved in the past using a variety of computer vision and machine learning methods. Despite the popularity of these approaches, deep learning changes this situation and many computer vision problems, including semantic segmentation, are being addressed by the deep framework. Typically a deep convolutional neural network, which can significantly improve accuracy and efficiency. Deep learning is then far less sophisticated than machine learning and other branches of computer vision. In view of this, there is still a lot of research space for semantic segmentation of images under the deep learning framework.

With the rapid development of deep learning in recent years, end-to-end problem modeling using a deep Neural network (CNN) and a full Convolutional Neural network (FCN) has become a mainstream research method in the computer visual direction. In the image semantic segmentation algorithm, the idea of end-to-end modeling is introduced, meanwhile, the end-to-end modeling is carried out on the characteristic image by using a proper network structure, and the problem of directly inputting the predicted semantic image is a problem worthy of deep discussion.

Because the content of the image in a natural scene is complex and the main body is various, semantic analysis on the image pixel by pixel is too laborious and inefficient, finding the relation between the middle pixel points of the characteristic image is a cut-in of several key difficulties of the task.

In summary, it is necessary to introduce attention learning (connection between pixel points) into an image semantic segmentation method based on end-to-end modeling, which is a direction worth of deep research.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an image semantic segmentation method based on a multi-scale and multi-level attention mechanism.

The technical scheme adopted by the invention for solving the technical problems is as follows:

given an image I, the corresponding real label map Gt constitutes a training set.

Step (1), preprocessing a data set, and extracting the characteristics of image data

Preprocessing an image I: firstly, horizontally rotating the image I, randomly scaling the size, cutting the image I into uniform size, and extracting the features of the image by using a full convolution neural network to obtain the image I_f1、I_f2、I_f3And I_f4。

Step (2), establishing a multi-scale attention mechanism model (MSM) and further extracting characteristics

Input image characteristics I_f4Zooming in different degrees is carried out on the image through bilinear interpolation, and finally, channel fusion is carried out to obtain a characteristic image I with specified dimensionality_{f4_att}。

Step (3), establishing a multi-stage attention mechanism model (MCM) for feature fusion

Input image characteristics I_f1、I_f2And I_{f4_att}The provided multi-stage attention mechanism model is used for effectively fusing the three characteristics to obtain a characteristic diagram I with strong characteristic information and good robustness_F。

Step (4), model training

Input feature map I_F、I_f2And (3) carrying out spatial cross entropy calculation with the real label graph Gt to obtain the difference with a real solution, and training the model parameters of the full convolution neural network defined in the step (2) and the step (3) by using a back propagation algorithm until the whole network model can be converged.

The data preprocessing and the image feature extraction in the step (1) are carried out:

extracting the features of the image I, and extracting the image features by using the conventional full convolution neural network (FCN) to form the image features I_f1、I_f2、I_f3And I_f4Wherein

And

wherein

Where c is the number of channels of the image feature and h and w are the height and width of the image feature, respectively.

The multi-scale attention mechanism model (MSM) for image semantic segmentation in the step (2) is used for feature fusion, and the specific formula is as follows:

2-1. for

Extracting characteristic information on different scales, wherein the specific formula is as follows:

x＝Conv(I_f4) (1)

x_s＝Attention(bilinear interpolation(x,size(s))；s＝1,2,3,4；size＝[48,32,16,8](2)

Y_s＝Concat(bilinear interpolation(x_s,64),I_f4) (3)

where Conv is a 1 × 1 convolution, for I_f4Reducing the dimension of the channel; the bilinear interpolation function refers to feature amplification by bilinear interpolationShrinking; the Concat function refers to the feature performing the splicing operation. The Attention function is specifically disclosed as follows:

for the Attention function input feature image x, the specific formula is as follows:

x_query＝Conv(x)；x_key＝Conv(x)；x_value＝Conv(x) (4)

x_context＝x^t _value×x_attention(6)

x_out＝μ×x_context+x (7)

wherein μ denotes a learnable coefficient and

x^trefers to matrix transposition.

2-2, performing dimension reduction on the Concat output result, and extracting characteristic information, wherein the specific formula is as follows:

I_{f4_att}＝Conv(Y_s) (8)

where Conv is a 1 × 1 convolution for Y_sReducing the dimension of the channel;

the multi-stage attention mechanism model (MCM) for image semantic segmentation in the step (3) specifically comprises the following steps:

firstly, a multi-level attention mechanism model for image semantic segmentation is described, and the model is specifically realized as follows:

inputting low-order characteristic image x to multi-stage attention mechanism model_lAnd higher order feature image x_hThe concrete formula is as follows:

3-1, carrying out unified dimension and size operation on the two input feature graphs:

x_l＝Conv(x_l) (9)

x_h＝bilinear interpolation(x_h,size(x_l)) (10)

where the Conv function is a 1 × 1 convolution for x_lPerforming channel dimensionality reduction; blinThe ear interpolation function is a bilinear interpolation pair x_hPerforming size enlargement to obtain a product of x_lUniform size.

3-2, carrying out splicing and normalization operation on the two characteristic images with the same dimensionality to obtain attention information:

x_lh＝Concat(x_l,x_h) (11)

x_att＝Softmax(Normalize(GAP(x_lh))) (12)

wherein GAP is global average pooling, and Softmax formula is as follows

3-3, performing Hadamardproduct operation on the attention information image and the low-order characteristic image, wherein the specific formula is as follows:

and 3-4, performing summation operation on the Hadamardroduct input and the high-order characteristic image, wherein the specific formula is as follows:

F_a＝f_a+x_h(15)

then sequentially adding I_{f4_att}、I_f2And I_f1The method is input into a multi-stage attention mechanism model, and the specific formula is as follows:

I_F＝MCM(I_{f4_att},I_f2) (16)

I_F＝MCM(I_F,I_f1) (17)

wherein the MCM function refers to a multi-stage attention mechanism model.

The training model in the step (4) is as follows:

the predicted image I generated in the step (3) is processed_FThe characteristic image I generated in the step (1)_f3And inputting the real tag graph Gt into a defined Loss function CrossEntropyLoss to obtain a Loss value Loss, wherein the Loss value Loss is specifically disclosed as follows:

Loss＝CrossEntropyLoss(I_F,I_f3,Gt) (18)

wherein the formula of Cross EntropyLoss is as follows:

Loss＝L₁+λ×L₂(21)

wherein B refers to the number of images input into the neural network, C refers to the number of channels of the characteristic images, and λ refers to the weight values of the two loss functions.

And adjusting parameters in the network by using a back propagation algorithm according to the calculated Loss value Loss.

The invention has the following beneficial effects:

compared with other methods, the method provided by the invention has relatively better performance in precision aiming at the problem of image semantic segmentation: firstly, the parameter quantity of the model is greatly reduced, the overfitting of the model is effectively prevented, and the training time of the model is reduced; second, it is simpler and easier to implement than other models. According to the invention, an attention mechanism is introduced into the end-to-end-based full convolution neural network, and image features are extracted at multiple scales and multiple levels, so that a better effect in an image semantic segmentation task is obtained.

Drawings

Fig. 1 is a general structural view of the present invention.

FIG. 2 is a multi-scale attention mechanism model of the present invention.

FIG. 3 is a multi-stage attention mechanism model of the present invention.

Fig. 4 is a visualization result of the model experiment of the present invention.

Detailed Description

In order to make the purpose and technical solution of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings and examples, and the application principle of the present invention is described in detail.

As shown in fig. 1, fig. 2 and fig. 3, the present invention provides a deep neural network structure for Image semantic segmentation (Image semantic segmentation), which comprises the following specific steps:

the data preprocessing and the feature extraction of the image in the step (1) are specifically as follows:

the Pascal VOC2012 data set is used here as training and testing data.

For image data, the image features are extracted here using the existing 101-layer depth residual network (Resnet-101) model. Specifically, we uniformly scale the image data to 513 × 513 and input it into the depth residual network, and extract the output of res2c layer therein as the image feature

Extracting the output of res3c layer as image feature

Extracting the output of res4c layer as image feature

Extracting the output of res5c layer as image feature

The multi-scale attention mechanism model (MSM) in the step (2) fuses image features, and the method specifically comprises the following steps:

2-1 for I_f4Extraction of feature information at different scales is performed. First using convolution operation pair I_f4And performing dimension reduction operation to 512 channels.

2-2, carrying out bilinear interpolation operation on the dimension reduction output result to obtain characteristic images x with the dimensions of 48,32,16 and 8 respectively_s。

And 2-3, performing Attention operation on the feature images with 4 scales, extracting the relevance among pixel points, and then outputting a result by sampling the Attention through bilinear interpolation. Wherein the Attention operation has the following specific formula:

x_query＝Conv(x)；x_key＝Conv(x)；x_value＝Conv(x) (22)

x_context＝x^t _value×x_attention(24)

x_out＝μ×x_context+x (25)

wherein μ denotes a learnable coefficient and

x^trefers to matrix transposition.

2-4, outputting the result and I of 4 multi-scale attentions_f4Carrying out splicing and dimensionality reduction operation to obtain a characteristic image I with attention information_{f4_att}。

The relevant operation of the multi-scale attention mechanism model is completed.

Fusing the image characteristics by the multi-stage attention mechanism model (MCM) in the step (3), which comprises the following specific steps:

3-1. for input features I_{f4_att}And I_f2Unification in dimension and scale is performed.

3-2, splicing the two characteristic images with uniform dimension, and sequentially carrying out global average pooling, regularization and normalization on the spliced output result to obtain a characteristic image x with attention information_att。

3-3. image x of attention information_attAnd a low-order feature image I_f2F is obtained by Hadamardroduct operation_a。

3-4, outputting result f for Hadamardproduct_aAnd a high-order characteristic image I_{f3_att}Doing a summation operation to obtain F_a。

3-5, mixing I_f1As a low-order feature image, F_aPerforming the above operations 3-1 to 3-4 as high-order characteristic images to obtain final imagesOutput characteristic image I_F。

Thus, the multi-stage attention mechanism model operation is completed.

The training model in the step (4) is specifically as follows:

for the prediction characteristic image generated in the step (3)

And the characteristic image generated in the step (1)

An upsample operation is performed to the original size 513 × 513 and the dimensions are reduced to the number of classes of the Pascal VOC2012 data set by a convolution operation (21). Comparing the loss value with a real tag graph Gt of a data set, calculating to obtain the difference between a predicted value and an actual correct value through a defined loss function Cross EntropyLoss and forming a loss value, and then adjusting the parameter value of the whole network by using a Back-Propagation (BP) algorithm according to the loss value until the network converges.

The following table shows the accuracy of the process of the invention in Pascal VOC 2012. Our is the depth model proposed by the invention, aero, bike represents the class object to be semantically segmented in the data set, and mIoU represents the average accuracy of all classes on the semantic segmentation task.

Claims

1. An image semantic segmentation method based on a multi-scale and multi-level attention mechanism is characterized by comprising the following steps:

given an image I, the corresponding real label map Gt, constitutes a training set:

step (1): data set preprocessing, feature extraction of image data

Step (2): establishing a multi-scale attention mechanism model (MSM) and further extracting characteristics

Input image characteristics I_f4Zooming in different degrees is carried out on the image through bilinear interpolation, and finally, channel fusion is carried out to obtain image characteristics I with specified dimensionality_{f4_att}。

And (3): establishing a multi-level attention mechanism model (MCM) for feature fusion

Input image characteristics I_f1、I_f2And I_{f4_att}Effectively fusing the three characteristics by using the proposed multi-stage attention mechanism model to obtain a characteristic diagram I with strong characteristic information and good robustness_F。

And (4): model training

2. The image semantic segmentation method based on the multi-scale and multi-level attention mechanism according to claim 1, characterized in that the image preprocessing of step (1) and the feature fusion of the multi-scale attention mechanism model (MSM) of step (2) are as follows:

2-1, extracting the features of the image I, and extracting the image features by using the existing full convolution neural network (FCN) to form the image features I_f1、I_f2、I_f3And I_f4Which is

And

wherein

2-2 for I_f4Extracting characteristic information on different scales, wherein a specific formula is as follows:

x＝Conv(I_f4) (1)

Y_s＝Concat(bilinear interpolation(x_s,64),I_f4) (3)

where Conv is a 1 × 1 convolution, for I_f4Reducing the dimension of the channel; the bilinear interpolation function refers to feature scaling by bilinear interpolation; the Concat function refers to the splicing operation of the feature images. The Attention function is specifically disclosed as follows:

x_query＝Conv(x)；x_key＝Conv(x)；x_value＝Conv(x) (4)

x_context＝x^t _value×x_attention(6)

x_out＝μ×x_context+x (7)

wherein μ denotes a learnable coefficient and

x^trefers to matrix transposition.

2-3, reducing the dimension of the Concat output result, and extracting characteristic information, wherein the specific formula is as follows:

I_{f4_att}＝Conv(Y_s) (8)

where Conv is a 1 × 1 convolution for Y_sAnd (5) reducing the dimension of the channel.

3. The image semantic segmentation method based on the multi-scale multi-stage attention mechanism as claimed in claim 1, wherein the multi-stage attention mechanism model (MCM) for image semantic segmentation in step (3) is specifically as follows:

firstly, the specific implementation of the multi-level attention mechanism model for image semantic segmentation is described as follows:

x_l＝Conv(x_l) (9)

x_h＝bilinear interpolation(x_h,size(x_l)) (10)

where the Conv function is a 1 × 1 convolution for x_lPerforming channel dimensionality reduction; the bilinear interpolation function is a bilinear interpolation pair x_hPerforming size enlargement to obtain a product of x_lUniform size.

x_lh＝Concat(x_l,x_h) (11)

x_att＝Softmax(Normalize(GAP(x_lh))) (12)

wherein GAP is global average pooling, and Softmax formula is as follows:

3-3. image x of attention information_attAnd low-order feature image x_lThe Hadamardproduct operation is carried out, and the specific formula is as follows:

3-4, outputting the result and the high-order characteristic image x to the Hadamardroduct_hAnd (3) performing summation operation, wherein the specific formula is as follows:

F_a＝f_a+x_h(15)

then sequentially adding I_{f4_att}、I_f2And I_f1The input is input into a multi-stage attention mechanism model (MCM), and the specific formula is as follows:

I_F＝MCM(I_{f4_att},I_f2) (16)

I_F＝MCM(I_F,I_f1) (17)

wherein the MCM function refers to a multi-stage attention mechanism model.

4. The image semantic segmentation method based on the multi-scale and multi-level attention mechanism according to claim 1, wherein the training model in the step (4) is as follows:

Loss＝CrossEntropyLoss(I_F,I_f3,Gt) (18)

wherein the formula of Cross EntropyLoss is as follows:

Loss＝L₁+λ×L₂(21)