CN110853057B

CN110853057B - Aerial image segmentation method based on global and multi-scale full-convolution network

Info

Publication number: CN110853057B
Application number: CN201911087534.1A
Authority: CN
Inventors: 马晶晶; 吴琳琳; 唐旭; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2021-10-29
Anticipated expiration: 2039-11-08
Also published as: CN110853057A

Abstract

The invention discloses an aerial image segmentation method based on global and multi-scale full convolution networks, which comprises the following steps: constructing a global and multi-scale full convolution network; generating a training set; training a global and multi-scale full convolution network; and inputting the aerial image to be segmented into a trained global and multi-scale full convolution network for binary segmentation to generate a segmentation mask image. The method utilizes the global and multi-scale full convolution network to segment the aerial image, and embeds the global module and the multi-scale module in the global and multi-scale full convolution network, thereby extracting more refined segmentation mask, having strong robustness and high segmentation precision.

Description

Aerial image segmentation method based on global and multi-scale full-convolution network

Technical Field

The invention belongs to the technical field of image processing, and further relates to an aerial image segmentation method based on global and multi-scale full convolution networks in the technical field of image segmentation. The invention can be used for detecting the building target from the high-resolution aerial image and segmenting the area where the building is located from the image.

Background

With the continuous development of the current society, the planning of town construction becomes a hot topic of people's attention. With increasing building demand, more buildings add difficulty to the construction of urban infrastructure, such as traffic route planning, drainage system planning, convenience facility planning, and the like. Building detection and segmentation in the aerial images can help a construction planning department to detect and segment town buildings and build township infrastructures. However, the aerial images contain rich information and complex spatial details, the building targets occupy different areas in one aerial image, the shooting angles are different, the objects are more and complex, the appearance styles are different, and in addition, the buildings and the surrounding environment are shielded to different degrees, so that great challenges are brought to building detection and segmentation of the aerial images.

The remote sensing image segmentation method based on the fusion of the complete residual error and the multi-scale features is proposed in the patent document 'the remote sensing image segmentation method combining the complete residual error and the features' (patent application number: 201811306585.4, application publication number: CN109447994A) applied by Shanxi university of teachers and universities. The method comprises the following implementation steps: the method comprises the steps of improving a convolutional encoding-decoding network used as a segmented backbone network, adding a characteristic pyramid module for aggregating multi-scale context information by adopting the network as the segmented backbone network, adding a residual error unit in a convolutional layer corresponding to an encoder and a decoder of the backbone network, simultaneously fusing the characteristics in the encoder into the corresponding layer of the decoder in a pixel-by-pixel addition mode, and finally segmenting the remote sensing image by using the improved image segmentation network combining complete residual errors and multi-scale characteristic fusion. The method has the defects that the structure of the convolutional encoding-decoding network in the convolutional encoding-decoding network is realized by depending on a multilayer convolutional layer, only local information can be extracted due to the size limitation of a convolutional kernel, and global information is lacked, so that the segmentation precision is low.

A Remote Sensing image Segmentation method based on a full convolution recursive Network is proposed in a published paper "RiFCN: Current Network in full volumetric computational Network for continuous Segmentation of High Resolution Remote Sensing Images" (IEEE arxiv, 5 months in 2018). The method comprises the following implementation steps: processing data to construct a training sample set and a test set; constructing a bidirectional network containing a forward flow and a reverse flow as a main network for semantic segmentation; the forward flow is a convolution neural network used for feature extraction, images pass through the forward flow to obtain a multilevel convolution feature map from light to deep, and the backward flow utilizes all available characteristics of the forward flow to realize high-resolution prediction by utilizing cyclic connection. The method has the following defects: only the relation between the coding and decoding parts of the full convolution network is considered, the different effects of each convolution layer of the decoding part of the full convolution network on final prediction are not considered, the multi-scale characteristics are not considered, the similar objects with different sizes in the image are difficult to identify, the simplicity and the high efficiency of the network are not considered, and the network segmentation performance is not high.

Disclosure of Invention

The invention aims to provide an aerial image segmentation method based on global and multi-scale full convolution networks, aiming at the defects of the prior art.

The idea for realizing the purpose of the invention is to construct a global and multi-scale full convolution network for segmenting the aerial image, and embed a global module and a multi-scale module in the global and multi-scale full convolution network so as to improve the segmentation efficiency and the segmentation precision.

The method comprises the following specific steps:

(1) constructing a global and multi-scale full convolution network:

(1a) a global and multi-scale full convolution network is built, and the structure of the network is as follows in sequence: input layer → feature extraction layer → first combination module → fully-connected layer → deconvolution layer → second combination module → output layer;

the feature extraction layer consists of five convolution modules connected in series in a VGG16 model;

the first combination module has 7 layers, and the structure thereof is as follows in sequence: first convolution layer → transposed layer → first multiplication layer → softmax layer → second multiplication layer → second convolution layer → additive layer;

the structure of the full connecting layer is as follows in sequence: maximum pooling layer → third convolution layer → first dropout layer → fourth convolution layer → second dropout layer;

the deconvolution layer is composed of four deconvolution modules connected in series, and the structure of each deconvolution module is as follows: first upsampling layer → fifth convolution layer → third dropout layer;

the second combination module is formed by connecting three up-sampling modules in series, wherein each up-sampling module consists of a second up-sampling layer and a sixth convolution layer;

the output layer is formed by connecting a seventh convolution layer and Argmax in series and is used for generating a segmentation mask map;

wherein, the outputs of the second, third, fourth and fifth convolution modules of the feature extraction layer in the global and multi-scale full convolution network are respectively connected with the inputs of the first, second, third and fourth convolution modules of the network deconvolution layer in a way of adding pixel by pixel;

(1b) the parameters of the global and multiscale full convolution networks are set as follows:

the convolution kernel sizes of the first convolution layer and the second convolution layer are set to be 1 x 1 pixels, and the step length is set to be 1 pixel; the parameters of the feature extraction layer are the same as the network parameters of VGG 16;

the feature maps of the input and the output in the first combination module are set to be 512, the feature maps in the middle process are set to be 256,

the convolution kernel sizes of the full-connection layer and the deconvolution layer are set to be 3 x 3 pixels, the step length is set to be 1 pixel, and the dropout parameters in the full-connection layer and the deconvolution layer are set to be 0.5;

setting the feature mapping maps of each up-sampling layer in the second combination module to be 2, setting the sizes of convolution kernels of the sixth convolution layer to be 1 × 1 pixel, and setting the step length to be 1 pixel;

(2) generating a training set:

(2a) acquiring 31 aerial images with the size of 5000 multiplied by 5000 and corresponding actual class labels, wherein each image comprises a background class and a target class;

(2b) cutting each image into 256 multiplied by 256 sizes, dividing each pixel point by 255.0 for normalization processing to form a training set, and cutting the corresponding actual class label to form an actual class label of the training set;

(3) training global and multi-scale full convolution networks:

(3a) inputting the training set into a global and multi-scale full convolution network, and taking a feature map output by the global and multi-scale full convolution network as a segmentation mask map for network prediction;

(3b) iteratively updating the network weight value by using an Adam optimization algorithm until a loss function is converged to obtain a trained global and multi-scale full convolution network;

(4) generating a segmentation mask map:

and cutting each aerial image to be segmented into 256 multiplied by 256 sizes, dividing each pixel point by 255.0 for normalization processing, and inputting the pixel points into a trained global and multi-scale full convolution network for binary segmentation to obtain a final segmentation mask image.

Compared with the prior art, the invention has the following advantages:

firstly, the global information in the feature extraction layer is constructed and obtained by using the global module, so that the local information can be obtained through the convolution layer, the global information can be obtained through the global module, the image is segmented by using the local information and the global information, the problem that the segmentation precision is low because only the local information can be extracted due to the size limitation of a convolution kernel in the prior art and the global information is lacked is solved, and the method has the advantage of high segmentation precision.

Secondly, because the invention constructs and utilizes the multi-scale module to obtain the multi-scale information in the deconvolution layer, and uses the mask map obtained by the multi-level feature mapping map in series to replace the mask map obtained by only using the last level feature mapping map, the multi-scale information can be obtained, and the information of each level can be fully utilized, thereby overcoming the problem that the prior art does not consider the different functions of each convolution layer of the full convolution network decoding part on the final prediction and does not consider the multi-scale features, which causes the difficulty of identifying the same kind of objects with different sizes in the image, and leading the invention to identify the same kind of objects with different shapes more accurately.

Thirdly, because the connection between the feature extraction layer and the deconvolution layer of the invention makes full use of the information extracted by the feature extraction layer, and the addition of the feature extraction layer and the corresponding layer of the deconvolution layer can reduce the loss of high-frequency information generated by pooling operation without increasing the calculation amount, and overcome the problem of low network segmentation performance caused by the fact that the network is not simple and efficient in the prior art, the invention has the advantages of excellent segmentation performance and high robustness.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the global and multi-scale full convolution network of the present invention;

FIG. 3 is a block diagram of the invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

The implementation steps of the present invention are described in further detail with reference to fig. 1.

Step 1, constructing a global and multi-scale full convolution network.

Firstly, a global and multi-scale full convolution network is built, and the structure of the network is as follows in sequence: input layer → feature extraction layer → first combination module → fully connected layer → deconvolution layer → second combination module → output layer.

the first combination module has 7 layers, and the structure thereof is as follows in sequence: first convolution layer → transpose layer → first multiplication layer → softmax layer → second multiplication layer → second convolution layer → addition layer.

The structure of the full connecting layer is as follows in sequence: max pooling layer → third convolution layer → first dropout layer → fourth convolution layer → second dropout layer.

The deconvolution layer is composed of four deconvolution modules connected in series, and the structure of each deconvolution module is as follows: first upsampling layer → fifth convolution layer → third dropout layer.

The second combination module is formed by connecting three up-sampling modules in series, wherein each up-sampling module consists of a second up-sampling layer and a sixth convolution layer.

The output layer is composed of a seventh convolutional layer and Argmax in series and is used for generating a segmentation mask map.

The outputs of the second, third, fourth and fifth convolution modules of the feature extraction layer in the global and multi-scale full convolution network are respectively connected with the inputs of the first, second, third and fourth convolution modules of the network deconvolution layer in a pixel-by-pixel addition mode.

The convolution kernels of a first convolution module and a second convolution module of five series convolution modules in the VGG16 model are all 3 x 3 pixels, and the step sizes of the first convolution layer and the second convolution layer are sequentially set to be 2 pixels and 1 pixel; the sizes of convolution kernels of the third convolution module, the fourth convolution module and the fifth convolution module are all 3 multiplied by 3 pixels, the step lengths of the third convolution layer, the fourth convolution module and the fifth convolution module are sequentially set to be 2 pixels, 1 pixel and 1 pixel, and weights trained on an Imagenet data set in advance are used as initial values of the models.

The structure of the first combination of modules in the constructed global and multi-scale full convolution network will be further described with reference to fig. 2.

The first combination module in the global and multi-scale full convolution network is mainly composed of 1 × 1 convolution, transposition, multiplication and addition, wherein the input X is the output characteristic mapping chart of the fifth convolution module of VGG16, and the same theta, and the first, theta, phi,

and g, adding the input X after transposition, multiplication, softmax operation and 1 multiplied by 1 convolution operation to obtain a final feature mapping graph Z containing global information.

Second, the parameters of the global and multiscale fully convolutional networks are set as follows.

The convolution kernel sizes of the first convolution layer and the second convolution layer are set to be 1 x 1 pixels, and the step length is set to be 1 pixel; the parameters of the feature extraction layer are the same as the network parameters of the VGG 16.

The feature maps of input and output in the first combination module are set to be 512, and the feature maps in the middle process are set to be 256.

The convolution kernel sizes of the full-connection layer and the deconvolution layer are set to be 3 x 3 pixels, the step length is set to be 1 pixel, and the dropout parameters in the full-connection layer and the deconvolution layer are set to be 0.5.

And setting the feature maps of each upsampling layer in the second combined module to be 2, setting the sizes of the convolution kernels of the sixth convolution layer to be 1 multiplied by 1 pixel, and setting the step length to be 1 pixel.

The input feature maps of the first, second, third, fourth and fifth convolution modules in the network parameters of the VGG16 are sequentially set to be 3, 64, 128, 256 and 512, and the output feature maps are sequentially set to be 64, 128, 256, 512 and 512.

And 2, generating a training set.

31 aerial images with the size of 5000 multiplied by 5000 and corresponding actual class labels are collected, and each image comprises a background class and a target class.

Cutting each image into 256 multiplied by 256 size, dividing each pixel point by 255.0 for normalization processing to form a training set, and cutting the corresponding actual class label to form the actual class label of the training set.

And 3, training the global and multi-scale full convolution network.

And inputting the training set into the global and multi-scale full convolution networks, and taking the feature maps output by the global and multi-scale full convolution networks as segmentation mask maps for network prediction.

And (3) using an Adam optimization algorithm to update the network weight values iteratively until the loss function is converged to obtain the trained global and multi-scale full convolution network.

The loss function is a spark-softmax cross entropy loss function, the loss function firstly converts an actual label from an original category index into one-hot coding, then performs softmax calculation on a prediction category label, and finally calculates cross entropy as a loss value, wherein the cross entropy calculation formula is as follows:

H_y'(y)＝-∑y'logy

wherein y' is the actual class label of the training set, y is the segmentation mask graph predicted by the training set, and log is the logarithm operation with a base 10.

The step of iteratively updating the network weight values using the Adam optimization algorithm is as follows:

firstly, dividing a training set into a plurality of parts according to the following formula:

wherein G is the total number of images in the training set, M is the total number of images in the training set, Q is the number of each image in the training set, the number of each image is set according to the scale of the global and multi-scale full convolution network and the size of an input image, and when the network is deeper or each input image is larger, the value of Q is smaller;

secondly, any unselected image is taken from the divided training set and input into the global and multi-scale full convolution network, and the weighted value of the network is updated by using the following weighted value updating formula:

wherein, W_newFor the updated weight value, W is the initial weight value of the global and multi-scale full convolution network, L is the learning rate of the global and multi-scale full convolution network training, and the value range of the learning rate is [0.001-0.00001 ]]Denotes the operation of multiplication,

representing a partial derivation operation;

and thirdly, any unselected image is taken from the divided training set, the selected image is input into the global and multi-scale full convolution network, and the loss function loss value after the weight value is updated.

And 4, generating a segmentation mask map.

The working steps of the overall invention are further described with reference to fig. 3.

The method comprises the steps of sequentially inputting pictures of a training set into a constructed global and multi-scale full convolution network, extracting a feature mapping graph through a feature extraction layer consisting of a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module, inputting the extracted feature mapping graph into the global module, passing through a full connection layer consisting of a maximum pooling layer, the third convolution layer, the fourth convolution layer and a first drop layer and a second drop layer, expanding the feature mapping graph through a deconvolution layer consisting of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module, obtaining a multi-level feature mapping graph through first sampling, second sampling and third sampling for using multi-level information, then connecting the multi-level feature mapping graphs in series, and reducing the dimension of the connected feature mapping graph through a seventh convolution layer to obtain an output segmentation mask graph, wherein the 'accord with' means that pixel points are added together one by one.

The effect of the present invention will be further described with reference to simulation experiments.

1. Simulation conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: CPU is Intel (R) core (TM) i7-8700X, main frequency is 3.2GHz, memory 64GB, GPU is NVIDIA 1080 Ti.

The software platform of the simulation experiment of the invention is as follows: ubuntu operating system and python 3.6.

2. Simulation content and result analysis:

the simulation experiment of the invention is to train the constructed global and multi-scale full convolution networks respectively by using the training images by adopting the invention and three prior arts (full convolution network method, segmentation network method and bidirectional full convolution network method). And (3) segmenting the image to be segmented by using the trained global and multi-scale full convolution network to obtain 25 (5 in each region) segmentation mask images of the image to be segmented.

The training Image and the Image to be segmented used in the simulation experiment are Aerial Image data sets in an Aerial Image Labeling data set Inria initial Image Labeling data set of a French national computer and an automated research institute. The aerial image dataset is collected from ten regions, five of which have real tags, each region has 36 images with a size of 5000 × 5000 × 3 pixels, the image tags are architectural and non-architectural, and the image format is tiff. The simulation experiment of the invention uses five regions with real labels to verify the effectiveness of the invention, and selects 6 th to 36 th aerial images of each region as training images of each region, and 1 st to 5 th aerial images as images to be segmented of each region.

In the simulation experiment, three prior arts are adopted:

the prior art full convolution network method refers to an aerial image Segmentation method proposed in the paper "full convolution Networks for magnetic Segmentation", IEEE Conference on computer Vision and Pattern registration "(CVPR, 2014)" published by Long et al, and the method uses an end-to-end convolution neural network and uses deconvolution to perform upsampling, which is referred to as a full convolution network method for short.

The prior art network segmentation method refers to a paper 'SegNet' published by Vijay et al: the method for dividing aerial images proposed in A Deep relational Encoder-Decoder Architecture for Image Segmentation (IEEE Arxiv, 2016) converts the maximum pooling into a Decoder to improve the resolution, which is called a dividing network method for short.

The prior art bi-directional full convolutional network method refers to the Mou et al paper "RiFCN: the method for segmenting the aerial image provided by the secure network in full volumetric network for the continuous segmentation for high resolution (IEEE Arxiv, 2018) improves the segmentation precision by using the circulation action of the forward flow and the backward flow, and is called as a bidirectional full convolution network method for short.

The segmentation accuracy of the segmentation mask map of the obtained 25 images to be segmented (5 images in each region) is evaluated by the four methods respectively by using two evaluation indexes (accuracy rate ACC and cross-over ratio IOU). The accuracy ACC, the cross-over ratio IOU, is calculated using the following formula, and the calculation results are plotted in table 1:

wherein, A represents the area of the predicted target label, B represents the area of the real target label, n is intersection operation, and U is union operation.

In table 1, "invention" represents the aerial image segmentation method based on global and multi-scale full convolution networks proposed by the present invention, "FCN" represents the full convolution network method proposed by Long et al, "SegNet" represents the segmentation network method proposed by Vijay et al, "RiFCN" represents the bi-directional full convolution network method proposed by Mou et al, "austin", "chicago", "kitsap", "tyrol _ w" and "vienna" are five regions respectively containing 5 segmentation mask maps, and "overall" is a whole region containing 25 segmentation mask maps.

TABLE 1 Performance evaluation table for semantic segmentation model of the invention and the existing aerial remote sensing image

	austin	chicago	kitsap	Tyrol_w	vienna	overall
							The Invention (IOU)	78.90	69.84	66.87	75.29	80.59	75.97
(ACC)	96.89	92.78	99.27	98.05	94.56	96.35
							FCN(IOU)	47.66	53.62	33.70	46.86	60.60	53.82
(ACC)	92.22	88.59	98.58	95.83	88.72	92.79
							SegNet(IOU)	74.81	52.83	68.06	65.68	72.90	70.14
(ACC)	92.52	98.65	97.28	91.36	96.04	95.17
							RiFCN(IOU)	76.84	67.45	63.95	73.19	79.18	74.00
(ACC)	96.50	91.76	99.14	97.75	93.95	95.82

As can be seen by combining the table 1, the accuracy rate ACC of all the areas containing 25 segmentation mask maps is 95.82%, the intersection ratio of all the areas is 74.00%, the two indexes are higher than those of 3 prior art methods, and the accuracy rate and the intersection ratio of each area are also higher than those of the 3 prior art methods, so that the method provided by the invention can obtain higher aerial image segmentation accuracy.

The above simulation experiments show that: the method can extract the global information in the feature extraction layer of the aerial image and combine the local information by utilizing the built global module, can extract the multi-scale information in the deconvolution layer of the aerial image by utilizing the built multi-scale module, can fully utilize the information extracted by the feature extraction layer by utilizing the connection between the built feature extraction layer and the deconvolution layer, solves the problems that only the local information can be extracted due to the size limitation of a convolution kernel in the prior art, the global information is lacked, the different effects of each convolution layer of a full convolution network decoding part on final prediction are not considered, the multi-scale feature and the network simplicity and high efficiency are not considered, the segmentation precision is low, the same kind of objects with different sizes in the image are difficult to identify, and the network segmentation performance is not high, and is a very practical aerial image segmentation method.

Claims

1. An aerial image segmentation method based on global and multi-scale full convolution networks is characterized in that a global module is constructed and utilized to obtain global information in a feature extraction layer, a multi-scale module is constructed and utilized to obtain multi-scale information in an deconvolution layer, and the connection between the feature extraction layer and the deconvolution layer enables the information extracted by the feature extraction layer to be fully utilized, and the method specifically comprises the following steps:

(1) constructing a global and multi-scale full convolution network:

(2) generating a training set:

(3) training global and multi-scale full convolution networks:

(4) generating a segmentation mask map:

2. The aerial image segmentation method based on the global and multi-scale full convolution network of claim 1 is characterized in that the convolution kernel sizes of the first convolution module and the second convolution module of the five series convolution modules in the VGG16 model in step (1a) are 3 x 3 pixels, and the step sizes of the first convolution module and the second convolution module are sequentially set to be 2 pixels and 1 pixel; the sizes of convolution kernels of the third convolution module, the fourth convolution module and the fifth convolution module are all 3 multiplied by 3 pixels, the step lengths of the third convolution layer, the fourth convolution module and the fifth convolution module are sequentially set to be 2 pixels, 1 pixel and 1 pixel, and weights trained on an Imagenet data set in advance are used as initial values of the models.

3. The aerial image segmentation method based on the global and multi-scale full convolution network of claim 1 is characterized in that in the step (1b), the input feature maps of the first, second, third, fourth and fifth convolution modules in the network parameters of the VGG16 are sequentially set to be 3, 64, 128, 256 and 512, and the output feature maps are sequentially set to be 64, 128, 256, 512 and 512.

4. The aerial image segmentation method based on the global and multi-scale full convolutional network as claimed in claim 1, wherein the loss function in step (3b) is a sparse-softmax cross entropy loss function, the loss function first converts the actual label from the original category index into one-hot coding, then performs softmax calculation on the predicted category label, and finally calculates the cross entropy as a loss value, and the cross entropy calculation formula is as follows:

H_y'(y)＝-∑y'logy

5. The method for segmenting aerial images based on global and multi-scale full convolution networks according to claim 1, wherein the step of iteratively updating network weight values by using an Adam optimization algorithm in step (3b) is as follows:

representing a partial derivation operation;