CN110334765B

CN110334765B - Remote sensing image classification method based on attention mechanism multi-scale deep learning

Info

Publication number: CN110334765B
Application number: CN201910603799.6A
Authority: CN
Inventors: 唐旭; 马秋硕; 马晶晶; 焦李成
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2023-03-24
Anticipated expiration: 2039-07-05
Also published as: CN110334765A

Abstract

The invention discloses a remote sensing image classification method based on attention mechanism multi-scale deep learning, which mainly solves the problem of low classification accuracy in the prior art. The scheme is as follows: establishing a remote sensing image library and a corresponding category of the image library, and randomly selecting 80% of remote sensing image samples from each category of remote sensing images after normalization processing to construct a training image library; constructing a convolutional neural network comprising a convolutional network module, an attention module, an SCDA module and a full connection layer; inputting training samples in a training image library into a convolutional neural network to obtain a classification result of the training samples, and determining a loss function of the convolutional neural network; iteratively updating the loss function by a gradient descent method until the loss value is stable to obtain a trained convolutional neural network; normalizing the remote sensing picture to be classified, and inputting the normalized remote sensing picture into a trained convolutional neural network to obtain a classification result; the method has high classification precision and strong robustness, and can be applied to the analysis and management of remote sensing image data.

Description

Remote sensing image classification method based on attention mechanism multi-scale deep learning

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a remote sensing image content classification method which can be applied to analysis and management of remote sensing image data.

Background

With the continuous improvement of the resolution of satellite remote sensing images and aerial remote sensing images, more useful data and information can be obtained from the remote sensing images. However, for applications in different situations, there are different requirements for processing remote sensing images, so in order to effectively analyze and manage the remote sensing image data, semantic tags need to be attached to the images according to the image contents. Scene classification is an important way to solve the problem. Scene classification refers to distinguishing images with similar scene features from multiple images and correctly classifying the images. Compared with a natural image, the remote sensing image has the characteristics of the remote sensing image, and the classification result of the remote sensing image often causes the phenomenon of wrong classification due to the limitation of the spatial resolution of the remote sensing image and the existence of the phenomena of same object, different spectrum and same spectrum of foreign objects, which are caused by the complexity of the remote sensing image. Therefore, it is also a challenge how to classify the remote sensing images more accurately.

The classification based on the convolutional neural network is that some pictures needing to be trained are input into the convolutional neural network in batches, and a target optimization loss function is reduced through repeated training of large batches of data. Thereby achieving the purpose of classification. A number of more mature, well-known convolutional nerves have been proposed today. As in 2012, alexKrizhevsky proposed a deep convolutional network model "AlexNet".

Although the conventional convolutional neural network can realize the task of picture scene classification, two defects still exist when the semantic information of the picture is learned, firstly, the classification information is inaccurately positioned due to the complexity of the remote sensing image, and secondly, the convolutional neural network usually falls into a local significant region during training, as shown in fig. 1. These two disadvantages can cause problems of poor robustness and easy generation of false scores in the process of classification of actual scenes.

Disclosure of Invention

The invention aims to provide a remote sensing image classification method based on attention mechanism multi-scale aiming at the problems in the prior art, so as to reduce the probability that the classification target of the remote sensing image falls into a local area, enlarge the attention area of a convolution network and improve the classification accuracy of the remote sensing image.

The technical idea of the invention is as follows: the convolution neural network is used for obtaining convolution characteristics of the picture, useful information which is beneficial to classification is obtained through the attention mechanism according to the attention mechanism principle, multi-scale convolution layer characteristics are extracted from the useful information, and image classification is achieved through the full-connection layer network.

According to the above concept, the implementation steps of the invention include the following:

(1) Establishing a remote sensing image library I ₁ ,I ₂ ,…I _n …,I _N The image library corresponds to the category { Y } ₁ ,Y ₂ ,…Y _n …,Y _N And normalizing the established remote sensing image library, wherein n represents the nth sample number in the image library, and n belongs to [0,N ]]N represents the number of pictures in the remote sensing image library;

(2) Randomly selecting 80% of samples from each type of images after normalization processing, and constructing a training image library { T } ₁ ,T ₂ ,…T _j …,T _M In which M is<N, wherein T _j Represents the jth picture in the training image library, and j belongs to [0,M ]]M is the total number of training samples;

(3) Constructing a convolutional neural network comprising a convolutional network module, an attention module, an SCDA module and a full connection layer;

(4) Determining a loss function of the convolutional neural network:

(4a) Will train the image library { T ₁ ,T ₂ ,…T _j …,T _M Inputting the convolution layer neural network with pre-training weight and outputting the last layer of characteristics F of the convolution layer;

(4b) Inputting the last layer of features F into an attention module of a convolutional neural network, outputting convolutional features A, inputting the convolutional layer features A into a plurality of SCDA modules of the convolutional neural network with different average thresholds, and outputting T groups of mask convolutional features: { M ₁ ,M ₂ ,…M _T T is the number of SCDA modules;

(4c) Inputting the T groups of mask convolution characteristics into a full connection layer of a convolution neural network after global average pooling, and outputting a classification result of training data to obtain a loss function of the convolution neural network:

therein, loss ₁ For outputting the cross entropy of the classification result and the actual result ₂ Outputting the absolute value sum of the cross entropy of the classification result and the actual result after the T groups of mask convolution characteristics pass through a full connection layer,

is the L2 norm, lambda, of the convolutional neural network weight vector _r 、λ _s Eta are respectively loss ₁ ，loss ₂ ，/>

The hyper-parameter of (c);

(5) Setting the iteration number as P, and carrying out iterative training on the convolutional neural network through gradient descent optimization until a loss function is obtained

The number of times of the training round is not reduced or reaches the number of iterations, and a well-trained convolutional neural network is obtained;

(6) And (3) normalizing the remote sensing image I 'to be classified by the user, and inputting the normalized remote sensing image I' into the trained convolutional neural network to obtain a classification result and finish image classification.

Compared with the prior art, the invention has the following advantages:

1. the invention can quickly find obvious characteristics in the remote sensing image based on the attention mechanism principle, concentrates the characteristics for classification into a certain area with obvious semantic information, and enhances the accuracy of scene classification of the remote sensing image;

2. the SCDA module is used, so that the perception field of the convolutional neural network is enlarged, the probability that the remote sensing image classification target falls into a local area is reduced, and the accuracy and the robustness of remote sensing image classification are enhanced;

3. the invention designs a loss function, further defines the classification task and improves the accuracy of remote sensing image classification.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram of a convolutional neural network constructed in the present invention;

FIG. 3 is a sample view of a remote sensing image used in the simulation of the present invention.

Detailed Description

The effects of the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, establishing a remote sensing image library to obtain a training sample and a test sample.

1a) Downloading UC Merced images from the Weegee official network, and establishing a remote sensing image library { I } ₁ ,I ₂ ,…I _n …,I _N The image library corresponds to the category { Y } ₁ ,Y ₂ ,…Y _n …,Y _N In which I _n Representing the nth image in the image library, Y _n Represents the category corresponding to the nth image in the image library, n represents the nth sample number in the image library, and n belongs to [0,N ]]；

1b) Carrying out normalization processing on the established remote sensing image library according to the following formula:

wherein V _max Is the maximum value of points, V, of all pixels in the remote sensing image library _min Is the dot minimum value of all pixels in the library of remotely sensed images, { I' ₁ ,I' ₂ ,…I' _n …,I' _N Is a remote sensing image library I 'after normalization processing' _n To normalize the nth sample of the processed image set, n e [0,N]N represents the number of the pictures in the remote sensing image library after normalization;

1c) Randomly selecting 80% of remote sensing images from each type of images in the remote sensing image library after normalization processing as a training sample set { T } ₁ ,T ₂ ,…T _j …,T _M And taking the rest 20% of remote sensing images as a test sample set { T } ₁ ,T ₂ ,…T _d …,T _m Where T is _j Represents the j sample in the training sample, j is E [0,M]，t _d Denotes the d-th sample in the test sample, d ∈ [0,m]M is the total number of training samples, M is the total number of testing samples, M<N，M<N。

And 2, constructing a convolutional neural network.

Referring to fig. 2, this step is implemented as follows:

2a) Setting a convolution network module which is composed of five convolution layers { conv1, conv2, conv3, conv4, conv5} connected in sequence in a pre-training AlexNet network;

2b) Setting an attention module, wherein the attention module consists of a global average pooling layer, a first full-connection layer, a Relu activation layer, a second full-connection layer and a Sigmoid function, and the structure of the attention module is shown in FIG. 3;

the global average pooling layer: the input convolution characteristic size is W multiplied by H multiplied by C, and the convolution characteristic size is used for averaging C Wmultiplied by H convolutions and outputting 1 multiplied by C convolution characteristics;

the first fully connected layer: the size of a convolution kernel is set to be 1 multiplied by C '/16, wherein C' is the characteristic dimension of the input first fully-connected layer;

the second fully-connected layer: the size of a convolution kernel is set to be 1 multiplied by C ', wherein C' is the characteristic dimension of the input second fully-connected layer;

the Relu activation function and the Sigmoid activation function are respectively as follows:

wherein x is an input function of the Relu activation function, and x' is an input function of the Sigmoid activation function;

2c) Setting up an SCDA module for outputting convolution mask features

Referring to fig. 2, the operating principle of the SCDA module is as follows:

inputting the three-dimensional convolution characteristics output by the second full-connection layer of the attention module into the SCDA module for third dimension summation to obtain two-dimensional convolution characteristics, and averaging the obtained two-dimensional convolution characteristics;

performing convolution mask according to the average value, namely comparing the value of the two-dimensional convolution characteristic data point with the average value, if the value of the two-dimensional convolution characteristic data point is larger than the average value, the code is 1, and if the value of the two-dimensional convolution characteristic data point is smaller than the average value, the code is 0, and obtaining the convolution mask;

extracting a convolution mask feature, namely multiplying a convolution mask by a set average value threshold value E, adding 1 to a multiplied convolution mask data value, and multiplying the multiplied convolution mask data value by a three-dimensional convolution feature input into an SCDA (sparse code division multiple access) module to obtain a mask feature; averaging the first two dimensions of the mask features to obtain and output convolution mask features;

2d) Setting a full connection layer, wherein the full connection layer consists of three convolution kernels of 4 multiplied by 1024,1024 multiplied by 21 with convolution kernel sizes of 512 multiplied by 1024,102 respectively;

2e) And sequentially connecting the convolutional network module, the attention module, the SCDA module and the full connection layer to obtain the convolutional neural network.

Step 3, determining a loss function of the convolutional neural network:

3a) Will train the sample set { T ₁ ,T ₂ ,…T _j …,T _M Inputting the data into a convolution network module of a convolution layer neural network, and outputting a last layer of characteristics F of the convolution layer;

3b) Inputting the last layer of feature F into an attention module of a convolutional neural network, outputting a convolutional feature A, inputting the convolutional layer feature A into a plurality of SCDA modules of the convolutional neural network with different average threshold values, and outputting T group mask convolutional features: { M ₁ ,M ₂ ,…M _T T is the number of SCDA modules;

3c) Inputting the T groups of mask convolution characteristics into a full connection layer of the convolutional neural network, outputting a classification result of training data, and obtaining a loss function loss of the convolutional neural network _op ：

Wherein:

The hyper-parameter of (c);

indicating the cross entropy, y, of the output classification result and the actual result _j For training T in image library _j Predicted class probability of o _j For training T in image library _j The actual class label of (2);

expressing the absolute value sum of cross entropy of the classification result and the actual result output after T groups of mask convolution characteristics pass through the full connection layer, wherein T is the sum of the SCDA module class, loss _m For training T in image library _j Loss under the m-th convolution mask feature _n For training T in image library _j Loss under the nth convolution mask feature ₁ 。

And 4, performing iterative training on the convolutional neural network.

The existing method for carrying out iterative training on the convolutional neural network comprises a gradient descent optimization algorithm, a Nesterov gradient acceleration method and an Adagarad method, the invention adopts but is not limited to a gradient descent algorithm, and the implementation steps are as follows:

4a) Setting the iteration number as P, setting the initial learning rate of training as L and the attenuation rate as beta, and training the image library { T ₁ ,T ₂ ,…T _j …,T _M Dividing the input image into G times, inputting the G times into the convolutional neural network constructed in the step 2, wherein the number Q of the input images each time is as follows:

wherein M is the total number of the training image library samples;

4b) Setting the learning rate l corresponding to each input picture as:

l＝L*β ^G ；

4c) G times of parameter updating are carried out on the convolutional neural network through the following formula to obtain an updated weight vector W _new ；

Wherein, W is a weight vector of the parameter of the convolutional neural network;

the updated weight vector W _new Substituting the loss function in 3 c) to obtain the loss function loss after updating the weight vector _op ；

4d) Inputting the next training picture into the convolutional neural network, and updating the loss function loss of the weight vector _op Updating so that the loss function loss _op The value of (d) is constantly decreasing;

4e) Repeat 4 d) until the loss function loss _op Stopping training the network to obtain a trained convolutional neural network if the current training round number is less than the set iteration number P; otherwise, stopping training the network when the training round reaches the set iteration number P to obtain a trained convolutional neural network;

and 5, classifying the remote sensing scene pictures input by the user.

5a) The user normalizes the remote sensing image to be classified, namely acquiring the maximum value V 'of pixel points of the remote sensing image to be classified' _max And minimum value V 'of pixel point' _min And dividing the values of all pixel points of the remote sensing image to be classified by V' _max And V' _min Obtaining the remote sensing image to be classified after normalization processing;

5b) And inputting the remote sensing image after the normalization processing into a trained convolution network model to obtain a classification result.

The effects of the present invention can be further illustrated by the following simulations:

1. simulation conditions

The embodiment completes the classification simulation of the remote sensing image scene and the prior remote sensing image scene on an HP-Z840-Workstation with Xeon (R) CPU E5-2630, geForce TITAN XP,64G RAM and Ubuntu systems and a TensorFlow operation platform.

The simulation parameters are set as follows, the iteration round P is 100 times, the learning rate is 0.00001, and lambda is _r ＝0.7，λ _s ＝0.3，η=0.0001, the number of pictures input each time G is 6, the attenuation rate beta is 0.9, 3 groups of SCDA modules are taken, and the average threshold values of the three groups are p ₁ ＝1.0，p ₂ ＝0.8，p ₃ =0.6, the training data is randomly rotated to enhance by four times the number of original data. The training sequence is that in each iteration training, a class mark discriminator and a classification difference value optimizer are trained together.

2. Emulated content

Downloading a UC Merced remote sensing image set as shown in FIG. 3, and carrying out normalization processing on the remote sensing image set, namely acquiring the maximum value V of pixel points of the UC Merced image set " _max Minimum value V of sum pixel " _min Dividing the values of all pixel points of the UC Merced image set by V " _max And V " _min Obtaining a UC Merced image set after normalization processing;

randomly selecting 80% of remote sensing images from the UC Merced images after normalization processing as a training sample set D _T Taking the rest 20% of remote sensing images as a test sample set D _t ；

Under the simulation conditions, a training sample set D is adopted _T Respectively training by using the three image classification models of the invention and the current representativeness, and adopting a test sample set D _t Tests were performed to compare the accuracy of their classification, with the results shown in table 1.

The images in the training sample set and the test sample set are 21 types, which are, respectively, aggrecultural, airlane, baseball diamond, beach, buildings, chararral, dense identification, forest, free, golf, harbor, interaction, medium identification, mobilehomepark, overtpass, parkinglot, river, runway, sparingidentification, storage ranks, tenis color,

TABLE 1 Performance evaluation of the classification model of the present invention and the existing remote sensing image

	Test sample accuracy
		The invention	0.9849
MSCP	0.9782
		SHHTFM	0.9789
DCA	0.9690

In table 1, MSCP is the existing remote sensing image classification method based on multi-stack covariance pooling, SHHTFM is the existing remote sensing image classification method based on isomorphic heterogeneous sparsity, and DCA is the existing remote sensing image classification method based on depth feature fusion.

As can be seen from Table 1, in the training sample set D _T When the percentage of the UC Merced image set is 80%, the convolutional neural network trained by the method is used for testing 20% of the sample set D _t And the accuracy rate of the classification is higher than that of the current representative remote sensing image classification model.

In conclusion, the remote sensing image classification effect of the invention is obviously better than that of other remote sensing image classification models.

Claims

1. A remote sensing image classification method based on attention mechanism multi-scale deep learning is characterized by comprising the following steps:

(1) Establishing a remote sensing image library { I } ₁ ,I ₂ ,…I _n …,I _N The image library corresponds to the category { Y } ₁ ,Y ₂ ,…Y _n …,Y _N And performing the construction on the remote sensing image libraryNormalization process, wherein I _n Representing the nth image in the image library, Y _n Representing the category corresponding to the nth image in the image library, wherein n represents the nth sample number in the image library, and n belongs to [0,N ]]N represents the number of pictures in the remote sensing image library;

(2) Randomly selecting 80% of remote sensing image samples from each type of remote sensing image subjected to normalization processing, and constructing a training image library n E [0,N]Taking the rest 20% of remote sensing images as a test sample set { T% ₁ ,T ₂ ,…T _d …,T _m Where T is _j Represents the jth sample in the training samples, j ∈ [0,M]，t _d Denotes the d-th sample in the test sample, d ∈ [0,m]And M is the total number of training samples,

m is the total number of test samples, m<N，M<N; wherein, T is _j Represents the jth picture in the training image library, and j belongs to [0,M ]]，

M is the total number of training samples;

(4) Determining a loss function of the convolutional neural network:

(4a) Will train the image library { T ₁ ,T ₂ ,…T _j …,T _M Inputting the data into a convolution network module of the convolution layer neural network, and outputting the last layer of characteristics F of the convolution layer;

(4b) Inputting the last layer of features F into an attention module of a convolutional neural network, outputting convolutional features A, inputting the convolutional layer features A into a plurality of SCDA modules of the convolutional neural network with different average thresholds, and outputting T groups of mask convolutional features:

{M ₁ ,M ₂ ,…,M _T t is the number of SCDA modules;

among them, loss ₁ For outputting the cross entropy of the classification result and the actual result ₂ Outputting the absolute value sum of the cross entropy of the classification result and the actual result after the T groups of mask convolution characteristics pass through a full connection layer,

is the L2 norm of the convolutional neural network weight vector,

λ _r 、λ _s eta are respectively loss ₁ ，loss ₂ ，

The hyper-parameter of (c);

(5) Setting the iteration number as P, and carrying out iterative training on the convolutional neural network through gradient descent optimization until the loss function loss _op The number of times of the training round is not reduced or reaches the number of iterations, and a well-trained convolutional neural network is obtained;

(6) And (3) normalizing the remote sensing image I 'to be classified by the user, and inputting the normalized remote sensing image I' into the trained convolutional neural network to obtain a classification result and finish the image classification.

2. The method of claim 1, wherein the remote sensing image library is normalized in (1) by the following equation:

wherein V _max Is the maximum value of points, V, of all pixels in the remote sensing image library _min Is the point minimum for all pixels in the library of remotely sensed images,

{I' ₁ ,I' ₂ ,…I' _n …,I' _N is a remote sensing image library I 'after normalization processing' _n For the nth sample of the remote sensing image after normalization processing, n belongs to the field of 0,N]。

3. The method of claim 1, wherein the parameters of the convolutional network module, attention module, SCDA module and full connectivity layer in (3) which form the convolutional neural network are set as follows:

the convolutional network module is composed of five convolutional layers { conv1, conv2, conv3, conv4 and conv5} which are connected in sequence in a pre-training AlexNet network;

the attention mechanism module consists of a global average pooling layer, a first full-connection layer, a Relu activation function, a second full-connection layer and a Sigmoid function;

and the SCDA module consists of a convolution channel summation layer and a mask layer in sequence.

4. The method of claim 1, wherein the cross entropy loss of the output classification result and the actual result in (4 c) ₁ The formula is as follows:

wherein, y _j For training T in image library _j Predicted class probability of o _j For training T in image library _j The actual class of.

5. The method of claim 1, wherein the T groups of mask convolution features in (4 c) output the absolute sum of cross entropy of the classification result and the actual result after passing through the full connection layer, and the formula is as follows:

wherein T represents the number of SCDA modules, loss _m Representing T in a training image library _j Loss under the mth convolution mask feature ₁ ，loss _n Representing T in a training image library _j Loss under the nth convolution mask feature ₁ 。

6. The method of claim 1, wherein (5) the convolutional neural network is iteratively trained by gradient descent optimization, which is implemented as follows:

(5a) Setting the initial learning rate of training as L and the attenuation rate as beta, and training the image library { T ₁ ,T ₂ ,…T _j …,T _M In a convolutional neural network constructed by G times of input, the number Q of pictures input each time is as follows:

wherein M is the total number of the training image library samples;

(5b) Setting the learning rate l corresponding to each input picture as:

l＝L*β ^G ；

(5c) G times of parameter updating are carried out on the convolutional neural network through the following formula to obtain an updated weight vector W _new ；

(5d) Inputting the next training picture into the convolutional neural network, and updating the loss function loss of the weight vector _op Updating so that the loss function loss _op The value of (d) is constantly decreasing;

(5e) Repeat (5 d) until loss function loss _op Stopping training the network to obtain a trained convolutional neural network if the current training round number is less than the set iteration number P; otherwise, when the training round reaches the set iteration number P, stopping the training of the network to obtain the trained convolutional neural network.