CN113240683A

CN113240683A - Attention mechanism-based lightweight semantic segmentation model construction method

Info

Publication number: CN113240683A
Application number: CN202110638043.2A
Authority: CN
Inventors: 张霖; 杨源
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-08-10
Anticipated expiration: 2041-06-08
Also published as: CN113240683B

Abstract

The invention discloses a lightweight semantic segmentation model construction method based on an attention mechanism, which is applied to the technical field of image processing, wherein an image I is given, and a corresponding real label image GT forms a training set: step 1, establishing a model; step 2, training a model; and 3, testing the model, namely inputting the test set image into the trained network model to obtain a test result. The invention realizes the improvement of the image segmentation accuracy and the segmentation speed; the segmentation process is not easy to overfit; the efficiency is high, and the actual deployment is convenient; in the event of a deficiency in the annotation data, it is trained quickly to further improve performance.

Description

Attention mechanism-based lightweight semantic segmentation model construction method

Technical Field

The invention relates to the technical field of image processing, in particular to a lightweight semantic segmentation model construction method based on an attention mechanism.

Background

Image segmentation refers to a computer vision task that labels a designated area according to the content of an image, and specifically, the purpose of image semantic segmentation is to label each point pixel in an image and associate the pixel with its corresponding class. The method has important practical application value in the aspects of scene understanding, medical images, unmanned driving and the like.

The classical semantic segmentation model comprises:

the full convolutional neural network (FCN) is a classic manufacture of a semantic segmentation network in deep learning, the traditional classification network structure is used for reference, the FCN is different from the traditional classification network, and a full connection layer of the traditional classification network is converted into a convolutional layer. Then, by performing up-sampling through deconvolution (deconvolution), the detailed information of the image is gradually restored and the size of the feature map is enlarged. In the process of restoring the detail information of the image, the FCN is realized by deconvolution that can be learned on one hand, and on the other hand, a skip-connection (skip-connection) mode is adopted to fuse the feature information obtained in the down-sampling process with the corresponding feature map in the up-sampling process. However, FCN has technical drawbacks such as loss of semantic information and lack of research on correlation between pixels.

SegNet, which adopts the FCN coding-decoding architecture, does not use a hopping connection structure unlike FCN, and uses an unpuncturing operation instead of deconvolution during upsampling. Those stored indices are used in the decoder to perform a depoling operation on the corresponding feature map. Thus, the integrity of high frequency information is ensured, but when the unpolluting is carried out on the feature map with lower resolution, the information among the pixel neighbors is also ignored.

The deep series is a series of semantic segmentation network models designed by the Google team, and the processing of the hole convolution and the CRF is adopted. The scope of the receptive field is expanded by using hole convolution without increasing parameters. And the post-processing of the CRF can better improve the accuracy of semantic segmentation. depllabv 2 adds an ASPP (hole space pyramid pooling) module on the basis of v 1.

PSPnet, called Pyramid matching Network, uses Pyramid pooling module to fuse the context information of the image, focusing on the relevance between pixels. After the pre-training model is used for extracting the features, the pyramid pooling module is used for extracting the context information of the image, and after the context information and the extracted features are stacked, the final output is obtained through up-sampling. The process of feature stacking is a process of fusing a detail feature of an object, where the detail feature refers to a shallow feature, that is, a feature extracted by a shallow network, and a global feature refers to a deep feature, that is, a contextual feature in general. The corresponding is the feature extracted by the deep network.

The network model has more layers and larger model parameter quantity, and the segmentation based on the pixel level is the mainstream direction of image classification along with the development of the technology and the continuous improvement of hardware conditions.

Therefore, it is an urgent technical problem to be solved by those skilled in the art to introduce a lightweight model for semantic segmentation and to provide a lightweight semantic segmentation model construction method based on an attention mechanism to improve the image segmentation accuracy and the segmentation speed.

Disclosure of Invention

In view of the above, the invention provides a lightweight semantic segmentation model construction method based on an attention mechanism, which is used for improving the image segmentation accuracy and the segmentation speed.

In order to achieve the purpose, the invention adopts the following technical scheme:

the attention mechanism-based lightweight semantic segmentation model construction method comprises the following steps:

given an image I, a corresponding real label graph GT forms a training set:

step 1, establishing a model, namely constructing a coding stage by adopting an AHSP module, a Channel Attention Sum, a channels-Cross Attention Sum, a Channel Split and a Concat, constructing a decoding stage by using an FFM, a Channel Attention Sum, a channels-Cross Attention Sum, a ReLU function and a Final predition, and connecting the coding stage and the decoding stage through the Channel Attention Sum to obtain an ultra-lightweight semantic segmentation network based on an Attention mechanism;

step 2, model training, namely inputting a training set image I into an ultra-lightweight semantic segmentation network of an attention mechanism to obtain a predicted image, comparing the predicted image with a real label image GT, calculating a cross entropy function as a loss function, and measuring the error between a predicted value and a real value; performing iterative optimization training on the network model parameters defined in the step 1 through a back propagation algorithm until the whole model converges;

and 3, testing the model, namely inputting the test set image into the trained network model to obtain a test result.

Preferably, in step 1, the coding network includes n stages, and the AHSP module is used as a basic module, and Criss-Cross Attention Sum, Channel Split, and Concat fuse Split are introduced to construct a first path and a second path which are connected with each other; n times of downsampling are carried out on the training set image I, and the size of the feature map output at each stage is 1/2, 1/4, 1/2 of the original sizeⁿ。

Preferably, the first path includes k AHSP modules, and the transfer function of the k-th module at the i-th stage of the first path is expressed as

Output is as

Wherein i belongs to {1,2, 3.., n }, and k belongs to {1 };

the second path comprises j AHSP modules, and the conversion function of the j-th module at the i-th stage of the second path is expressed as

Output is as

Wherein i belongs to {1,2, 3.,. n }, j belongs to {1,2}, and C_iIs the number of characteristic channels in the i-th stage.

Preferably, the calculation formula of the first AHSP module output characteristic diagram of the first path and the second path at each stage is as follows:

wherein i ∈ {1,2, 3.., n },

and

down-sampling with step size of 2; f^1×1(. h) is a convolution function with a convolution kernel of 1 × 1, and Split () divides the received feature map into two parts along the channel dimension and sends the two parts into the channel dimension

And

and obtaining first path characteristic information and second path characteristic information.

Preferably, the calculation formula of the output characteristic diagram of the 2 nd AHSP module of the second path at each stage is as follows:

wherein i ∈ {1,2, 3.

Preferably, in step 1, the decoding network includes n stages, and based on the FFM module, a Channel Attention Sum and a Cross-Cross Attention Sum are introduced to form the decoding network, and a ReLU function is introduced as an activation function of a final output prediction result.

Preferably, the FFM module has a transfer function of D_i(. to) the profile of the output is represented as

Wherein i ∈ {1,2, 3.., n };

S′_i＝F^1×1(X) (5)

wherein, S'_iFor the output result of the down-sampled final output X after a 1X 1 convolution function operation, F^1×1(. cndot.) is a convolution function with a convolution kernel of 1 x 1,

the separable convolution network transfer function with a convolution kernel of 3 × 3, BatchNorm (·) is a batch normalization function.

Preferably, the feature map output obtained through the encoding stage is:

then D is_iThe calculation formula process of (2) is as follows:

S″_i＝D_i(Upsample(CAM(D_i+1),2)) (8)

wherein Upesample (-) and t denote that the characteristic diagram is sampled by the coefficient of t by using a bilinear interpolation method, CAM (-) denotes that a channel attention mechanism is used, S ″_iAnd the feature graph D of the next stage is output after CAM, up-sampling and FFM operations.

Preferably, by D_iTest result P was obtained by 1X 1 convolution_iThe method comprises the following steps:

P_i＝Soft max(Upsample(F^1×1(D_i),2ⁱ)) (10)

wherein, P_i∈R^H×WFor the predicted class label graph, Soft max (·) is the activation function, i ∈ {1,2, 3.

According to the technical scheme, compared with the prior art, the invention provides a lightweight semantic segmentation model construction method based on an attention mechanism, which comprises the following steps: the image segmentation accuracy and the segmentation speed are improved; the segmentation process is not easy to overfit; the efficiency is high, and the actual deployment is convenient; in the event of a deficiency in the annotation data, it is trained quickly to further improve performance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram of an ultra lightweight semantic segmentation network architecture based on an attention mechanism according to the present invention;

FIG. 2 is a block diagram of an FFM module of the present invention;

FIG. 3 is an Image of an embodiment of the present invention, wherein 3.1 is CT Image, 3.2 is a prediction map, and 3.3 is a real label map.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the invention discloses a lightweight semantic segmentation model construction method based on an attention mechanism, which comprises the following steps:

given an image I, a corresponding real label graph GT forms a training set:

step 1, establishing a model, namely constructing a coding stage by adopting an AHSP (advanced high performance protocol) module, Channel Attention Sum, Cross-Cross Attention Sum, Channel Split and Concat full Split, constructing a decoding stage by using an FFM (feature fusion module), the Channel Attention Sum, the Cross-Cross Attention Sum, a ReLU function and a Final Prediction, and connecting the coding stage and the decoding stage through the Channel Attention Sum to obtain an ultra-lightweight semantic segmentation network based on an Attention mechanism;

In one embodiment, the Channel Attention Sum is used in a network architecture

To show that Criss-Cross Attention Sum is used in network architecture

That the Concat fuse split is used in the network

To indicate.

In a specific embodiment, step 1, the encoding network comprises n stages, a first path and a second path which are connected with each other are constructed by introducing Criss-Cross Attention Sum, Channel Split and Concat fuse Split with an AHSP module as a basic module; n times of downsampling are carried out on the training set image I, and the size of the feature map output at each stage is 1/2, 1/4, 1/2 of the original sizeⁿ。

In one embodiment, the first path includes k AHSP modules, and the transfer function of the k-th module at the i-th stage of the first path is expressed as

Output is as

Wherein i belongs to {1,2, 3.., n }, and k belongs to {1 };

Output is as

In one particular embodiment, for phase 0, one may obtain:

in a specific embodiment, the calculation formula of the first AHSP module output characteristic map of the first path and the second path at each stage is as follows:

wherein i ∈ {1,2, 3.., n },

and

And

In one embodiment, the calculation formula of the output characteristic diagram of the 2 nd AHSP module of the second path at each stage is as follows:

wherein i ∈ {1,2, 3.

In a specific embodiment, in step 1, the decoding network includes n stages, based on the FFM module, a Channel assignment Sum and a Cross-Cross assignment Sum are introduced to form the decoding network, and a ReLU function is introduced as an activation function of a final output prediction result.

In one embodiment, referring to FIG. 2, the FFM module has a transfer function of D_i(. to) the profile of the output is represented as

Wherein i ∈ {1,2, 3.., n };

S′_i＝F^1×1(X) (5)

wherein S is_i' is the output result of the 1 × 1 convolution function operation on the down-sampled final output X, F^1×1(. cndot.) is a convolution function with a convolution kernel of 1 x 1,

BatchNorm (-) is a batch normalization function for separable convolution functions with a convolution kernel of 3 × 3.

In one embodiment, the signature graph output obtained through the encoding stage is:

then D is_iThe calculation formula process of (2) is as follows:

S″_i＝D_i(Upsample(CAM(D_i+1),2)) (8)

wherein Upesample (-) and t denote that the characteristic diagram is sampled by the coefficient of t by using a bilinear interpolation method, CAM (-) denotes that a channel attention mechanism is used, S ″_iAnd (4) outputting a result obtained by performing 1 × 1 convolution function operation on the down-sampled final output X.

In one embodiment, D is utilized_iTest result P was obtained by 1X 1 convolution_iThe method comprises the following steps:

P_i＝Soft max(Upsample(F^1×1(D_i),2ⁱ)) (10)

wherein, P_i∈R^H×WFor the predicted class label graph, Soft max (·) is the activation function, i ∈ {1,2, 3.., n };

definition of Softmax function (taking the ith node output as an example):

wherein: z_iAnd C is the output value of the ith node, and the number of output nodes, namely the number of classified categories.

In one embodiment, the lung image is taken as an example for experiment, as shown in fig. 3, 3.1 is CTImage, 3.2 is a prediction map, 3.3 is a real label map, and table 1 is a comparison between the parameters of the model and the parameters of other models:

TABLE 1

Methods

Backbone

Param.

FLOPs

Dice

Sen.

Spec.

U-Net

VGG16

7.853M

38.116G

0.4

0.5

0.8

Attention-UNet

VGG16

8.727M

31.73G

0.5

0.6

0.9

U-Net++

VGG16

9.163M

65.938G

0.5

0.6

0.9

Minimum-seg

36.98K

209.043M

0.663

0.704

0.935

As can be seen from the param column in table 1, the parameter quantity of the ultra-lightweight semantic segmentation network model based on the attention mechanism is only a parameter quantity close to 37K, and the parameter quantities of other models are at least M-level, so that the volume quantity of the model to be modified is small, and the image segmentation accuracy and the segmentation speed are improved; the segmentation process is not easy to overfit; the efficiency is high, and the actual deployment is convenient; in the event of a deficiency in the annotation data, it is trained quickly to further improve performance.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention in a progressive manner. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The attention mechanism-based lightweight semantic segmentation model construction method is characterized by comprising the following steps of:

given an image I, a corresponding real label graph GT forms a training set:

2. The attention-based lightweight semantic segmentation model construction method according to claim 1,

in the step 1, the coding network comprises n stages, an AHSP module is used as a basic module, Criss-Cross Activity Sum, Channel Split and Concat fuse Split are introduced, and a first path and a second path which are connected with each other are constructed; n times of downsampling are carried out on the training set image I, and the size of the feature graph output at each stage is the original scaleCun 1/2, 1/4.., 1/2ⁿ。

3. The attention-based lightweight semantic segmentation model construction method according to claim 2,

the first path includes k AHSP modules, and the transfer function of the k-th module at the i-th stage of the first path is expressed as

Output is as

Wherein i belongs to {1,2, 3.., n }, and k belongs to {1 };

Output is as

4. The attention-based lightweight semantic segmentation model construction method according to claim 3,

the calculation formula of the first AHSP module output characteristic diagram of the first path and the second path at each stage is as follows:

wherein i ∈ {1,2, 3.., n },

and

down-sampling with step size of 2; f^1×1(. h) is a convolution network transfer function with a convolution kernel of 1 × 1, and Split (-) divides the received feature graph into two parts along the channel dimension and sends the two parts into the two parts

And

5. The attention-based lightweight semantic segmentation model construction method according to claim 3,

the calculation formula of the 2 nd AHSP module output characteristic diagram of the second path at each stage is as follows:

wherein i ∈ {1,2, 3.

6. The attention-based lightweight semantic segmentation model construction method according to claim 1,

in step 1, the decoding network comprises n stages, a Channel Attention Sum and a Criss-Cross Attention Sum are introduced to form the decoding network based on the FFM module, and a ReLU function is introduced as an activation function of a final output prediction result.

7. The attention-based lightweight semantic segmentation model construction method according to claim 6,

the FFM module has a transfer function of D_i(. to) the profile of the output is represented as

Wherein i ∈ {1,2, 3.., n };

S'_i＝F^1×1(X) (5)

8. The attention-based lightweight semantic segmentation model construction method according to claim 7,

the feature map output obtained through the encoding stage is:

then D is_iThe calculation formula process of (2) is as follows: wherein i is 1,2, …, n-1

S”_i＝D_i(Upsample(CAM(D_i+1),2)) (8)

Wherein Upesample (·, t) represents sampling the feature graph with the coefficient of t using bilinear interpolation method, CAM (·) represents using channel attention mechanism, S "_iAnd the feature graph D of the next stage is output after CAM, up-sampling and FFM operations.

9. The attention-based lightweight semantic segmentation model construction method according to claim 8,

by using D_iTest result P was obtained by 1X 1 convolution_iThe method comprises the following steps:

P_i＝Soft max(Upsample(F^1×1(D_i),2ⁱ)) (10)