CN113408594B

CN113408594B - Remote sensing scene classification method based on attention network scale feature fusion

Info

Publication number: CN113408594B
Application number: CN202110622695.7A
Authority: CN
Inventors: 郑禄; 肖鹏飞; 帖军; 吴立锋; 刘振宇; 田莎莎; 张潇; 于舒
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-04-29
Anticipated expiration: 2041-06-04
Also published as: CN113408594A

Abstract

The invention discloses a remote sensing scene classification method based on attention network scale feature fusion, which comprises the following steps: inputting a training data set comprising various types of remote sensing scene images; preprocessing a training data set; extracting the characteristics of multiple scales of the remote sensing image through a convolutional neural network, obtaining an attention area of the image under different scales by using a multi-box attention network model, cutting and scaling the attention area and inputting the cut and scaled attention area into a three-layer network structure; fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction; inputting the image of the training data set into a multi-box attention network model MS-APN for learning training; and carrying out scene classification on the remote sensing image through the trained multi-box attention network model MS-APN. The method can extract the multi-scale and multi-angle characteristics of the remote sensing image, and has better scene classification and identification effects of the remote sensing image.

Description

Remote sensing scene classification method based on attention network scale feature fusion

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a remote sensing scene classification method based on attention network scale feature fusion.

Background

With the continuous development of satellite sensors and remote sensing technologies, a large number of high-resolution (HSR) remote sensing images are available, and the high-resolution remote sensing images often contain abundant spatial and semantic information, and are widely applied to the fields of land utilization planning, intelligent agriculture, key target detection, military and the like. Scene classification of high resolution images is an important research topic, which aims to assign reasonable semantic labels to each image. High resolution scenes typically contain rich semantic information and complex spatial patterns, and therefore, accurately classifying them is a challenging task.

Due to the fact that the remote sensing images are different in acquisition time and position, the same type of scene is acquired due to the fact that the texture is inconsistent and the shape and the size are obviously different due to the fact that the directions are inconsistent. Meanwhile, the difference of the colors of the ground feature types in the same category is large due to the influence of environmental factors such as illumination and the like. The remote sensing images of the surface feature types in the same category have differences in texture, shape and color. The high-resolution remote sensing images are shot from high altitude, so that scenes in the remote sensing images have different directions, and the distance between a sensor and a measured object in the remote sensing images is greatly changed, so that the scale of the remote sensing images is changed.

When remote sensing image scenes are classified, information which can be most distinguished from different categories is usually concentrated in a local characteristic region, while the traditional convolutional neural network focuses more on processing image global information, and local detail information of the image is easily lost. The latest attention mechanism is introduced into a fine-grained image classification task to provide a circular attention convolution neural network, a frame regression mechanism is adopted on three different levels to regress a characteristic region with the most distinguishing performance of an image, and finally, image characteristics extracted from the three levels are fused to complete classification, so that the classification performance of the fine-grained image is further improved. However, the research is based on the depth characteristics of a single scale and an angle of an image, and currently, remote sensing images gradually tend to be diversified in the aspects of target types, surface feature scales and the like, so that the scale and the angle of a remote sensing scene are greatly changed.

The remote sensing scene images contain target dimensions and angles with diversity, and large differences exist in the shapes, textures and colors of the images. The traditional convolutional neural network focuses more on processing the global information of the image, and the previous research is only based on the depth features of a single scale of the image, and the difference of the extracted features of the image caused by the rotation angle is ignored.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a remote sensing scene classification method based on attention network scale feature fusion aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a remote sensing scene classification method based on attention network scale feature fusion, which comprises the following steps:

a training stage:

step 1, inputting a training data set comprising various types of remote sensing scene images;

step 2, preprocessing the training data set to keep the sizes of the input images consistent, and normalizing the pixel values of the images;

step 3, constructing a multi-box attention network model MS-APN as an image classification model, extracting the characteristics of multiple scales of the remote sensing image through a convolutional neural network, obtaining attention areas of the image under different scales by using the multi-box attention network model, cutting and scaling the attention areas and inputting the attention areas into a three-layer network structure; fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction;

step 4, inputting the image of the training data set into a multi-box attention network model MS-APN for learning training, using a mode of alternately training two loss functions until the two loss functions are converged, and fixing the parameters of the whole model at the moment to obtain the multi-box attention network model MS-APN finally used for classification;

and (3) a testing stage:

and 5, inputting the remote sensing image to be recognized, and outputting a scene classification result in the remote sensing image through the trained multi-box attention network model MS-APN.

Further, the remote sensing scene image input in step 1 of the present invention includes a plurality of different public data sets, each data set includes a plurality of types of remote sensing scenes, and each remote sensing scene includes a plurality of images.

Further, the method for performing normalization processing on the pixel values of the image in step 2 of the present invention is as follows:

the input image is an RGB color three-channel image, the value range of each pixel point of the image is 0-255, the range of all pixel values is normalized from [0,255] to [0,1], the convergence rate of the initial iteration of the training network is accelerated, and the formula is as follows:

y_i,j＝2x_i,j/255

wherein x is_i,jRepresenting the image before preprocessing, y_i,jRepresenting the preprocessed image, where i e [0,223 ]]，j∈[0,223]。

Further, the step 3 of the present invention includes:

step 3.1, multi-scale feature extraction: selecting VGG-16 as an image classification sub-network to extract the multi-scale features of the remote sensing scene image, wherein the VGG-16 network consists of 13 layers of convolution layers and 3 layers of full-connection layers, firstly, the scene image input into the network is subjected to 2 times of 2-layer 3X 3 convolution and maximum pooling downsampling operation, then is subjected to 3 times of 3-layer 3X 3 convolution and maximum pooling downsampling operation, and finally is subjected to 3 layers of full-connection layers to output the probability of the category to which the scene image belongs by a softmax classifier;

step 3.2, attention area extraction: selecting an attention area by adopting different prior rectangular frames, and finally, positioning the extracted multiple characteristic areas to the attention area under different scales by adopting joint identification;

step 3.3, multi-scale feature fusion: and fusing the features of the original image in different scales and the image features of the attention area of the original image, expressing by using the LBP global features, and inputting the LBP global features into a network full-connection layer to finish classification prediction.

Further, the specific method of step 3.2 of the present invention is:

if the channel output by the APN is the coordinate (t) of the center point of the square candidate box_a,t_b)，t_hIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, W_iAnd H_iRespectively representing half of the length and the width of the ith prior rectangular frame, setting the maximum value of i to be 3 through comparison, and defining the ratio of the length to the width of the ith prior rectangular frame to be K_iThen, there are:

N＝(2t_h)²＝4t_h ²，K_i＝W_i/H_i

it is specified that the area of the a priori rectangular box is equal to the area of the square box of the output, so that:

N＝2W_i×2H_i＝4K_iH_i ²

obtain a new relation W_iAnd H_iThe expression of (a) is as follows:

wherein int () is a rounded down function; using two vertexes of the upper left corner and the lower right corner of the prior rectangular frame to represent the rectangular frame, and obtaining coordinate values of the upper left corner and the lower right corner of the target area:

t_a(ul)＝t_a-W_i,t_b(ul)＝t_b-H_i

t_a(br)＝t_a+W_i,t_b(br)＝t_b+H_i

wherein, the element indicates a dot product operation, X^attRepresenting the clipped target area; the M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:

wherein the content of the first and second substances,

is a sigmoid function, and the formula is expressed as:

the output result is an open interval of 0 to 1, when k is large enough, only when x is between t_a(ul)And t_a(br)Interval, y value between t_b(ul)And t_b(br)The interval, namely the point in the characteristic region, the point approaches 1 when passing through the clipping function M (-) and M (-) approaches 0 in other cases; finally, amplifying the selected target area by using a bilinear interpolation method to obtain the scale of the next input network, wherein the formula is as follows:

wherein the values of m and n are expressed by the following formulas:

wherein, (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size, and [. cndot. ] and {. cndot. ] are respectively an integer part and a decimal part.

Further, said step 3.1 of the present invention comprises:

1) the multi-scale image input into the network model enters a classification sub-network of a first scale layer to carry out feature extraction and classification operation, and the probability P of predicting the correct classification label by the classification sub-network of the first scale layer is obtained_t ⁽¹⁾；

2) Inputting the features extracted by the first scale layer classification subnetwork into a multi-selection-box attention network d1 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a second scale layer;

3) inputting the multi-scale feature image output by the first scale layer into a second scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the second scale layer classification sub-network_t ⁽²⁾；

4) Inputting the features extracted by the second scale layer classification subnetwork into a multi-selection-box attention network d2 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a third scale layer;

5) inputting the multi-scale feature image output by the second scale layer into a third scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the third scale layer classification sub-network_t ⁽³⁾；

6) Tong (Chinese character of 'tong')Over-setting the correct classification label probability of the network model to be P_t ⁽³⁾Greater than P_t ⁽²⁾，P_t ⁽²⁾Greater than P_t ⁽¹⁾Therefore, the target area extracted by the multi-box attention network is more accurate;

7) and finally, inputting the features extracted by the three layers of classification sub-networks into an LBP operator for feature fusion, thereby completing the scene classification task.

Further, in the step 4 of the present invention:

the MSA-CNN network model loss function contains two L_clsAnd L_rankTwo moieties, wherein L_clsA function for representing class loss, including the loss generated by predicting the class of the remote sensing image relative to the real class label in the three-layer classification network, L_rankThe loss generated when the probability of the middle-high level network prediction in the front layer and the rear layer of the network is lower than that of the low level network prediction is represented, the combined loss function adopts a mode of alternately training two loss functions, and the formula of the combined loss function L (x) of the model is as follows:

wherein, Y^(s)Indicates the class to which the network model predicted image belongs, Y^*Representing the true category of the remote sensing image, P_t ^(s)Probability value, L, representing the correct label of S-layer prediction in a network structure_rankThe calculation formula of (2) is as follows:

L_rank(P_t ^(s),P_t ^(s+1))＝max{0,P_t ^(s)-P_t ^(s+1)+0.05}

updating a loss value generated when the probability of the predicted real category of the S +1 layer in the network model is less than the probability of the predicted real category of the S layer by a maximum value method, so that the probability of the predicted real category of the network model is gradually increased along with the increasing of the layers, and when P is the maximum value_t ^(s)-P_t ^(s+1)When +0.05 > 0, the loss between adjacent layers will be updated, and the sum thereofThe outer 0.05 is added to prevent the loss from stopping and not updating due to 0 in both cases.

Further, in the step 4 of the present invention:

the MSA-CNN network model adopts a mode of cyclic cross training of a VGG-16 network and an MS-APN, firstly, the pre-trained parameters of the VGG-16 network are used for realizing the feature extraction of a multi-scale remote sensing image and the initialization of a classification sub-network, and the parameter { t } of the MS-APN is initialized by using the region with the highest response value in the last convolutional layer of the VGG-16 network_a,t_b,t_h,K_i}; then fixing the parameters of MS-APN, training VGG-16 sub-network until its loss function L_clsConverging, fixing the parameters of the VGG-16 network, training the MS-APN until its loss function L_rankConverging, and finally circularly and crossly training the MS-APN and VGG-16 sub-networks until L_clsAnd L_rankThe joint loss functions are all converged, and parameters of the whole model are fixed at the moment, so that the model finally used for classification is obtained, and the final scene classification prediction is carried out.

The invention has the following beneficial effects: the remote sensing scene classification method based on the attention network scale feature fusion is realized based on the multi-scale feature fusion attention model, the category diversity of the remote sensing image is high, in order to achieve a better realization effect, the method based on the multi-frame selection attention model and LBP feature fusion is used, and the defects of the similar technology application can be effectively overcome. 1) The invention can extract the multi-scale and multi-angle characteristics of the image; 2) the invention focuses more on classifying the target characteristic region of the image; 3) the method has larger research significance and has important significance for land utilization planning, intelligent agriculture, key target detection and the like.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of remote sensing image scene classification according to an embodiment of the present invention;

FIG. 2 illustrates the principle of the MS-APN operation according to an embodiment of the present invention;

FIG. 3 is an LBP feature extraction of an embodiment of the present invention;

fig. 4 is a MSA-CNN network model structure of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The remote sensing scene classification method based on attention network scale feature fusion (MSA-CNN) comprises the steps of firstly, extracting features of multiple scales of a remote sensing image through a convolutional neural network, obtaining attention areas of the image under different scales by using a multi-frame attention model, cutting and scaling the attention areas and inputting the attention areas into a three-layer network structure. And then, fusing the features of the original image in different scales and the image features of the attention area of the original image, utilizing LBP global feature expression to overcome the difference of the remote sensing image caused by different shooting angles, and finally inputting the remote sensing image into a network full-connection layer to finish classification prediction. The method mainly comprises the steps of data set preprocessing, multi-scale feature extraction, attention area extraction, feature fusion and classification, and the remote sensing image classification process is shown in figure 1.

1) Multi-selection frame attention model

The multi-box attention model is mainly used for more accurately positioning an attention area of an image when a remote sensing scene is classified. The multi-box attention model can gradually focus on a key area in the three-layer network feature training process, and can be more accurate. And the alternative training with the classification network can be mutually promoted, so that the method is an excellent unsupervised network.

2) Classifying sub-networks

The convolution layers of 3 sub-networks with different levels of the network are convolution layers which extract features from an input image, and according to the extracted features, on one hand, area information is obtained through a multi-box attention model, on the other hand, the area information is transmitted to a full connection layer (fc) and a softmax layer to predict the class probability of the image, and one task (task) is classification.

The method comprises the following specific steps:

a training stage:

step 1, inputting a training data set;

the UC Mercded Land-Use data set is obtained by manual extraction of data downloaded from national maps of the national survey bureau of geological survey, which is released in 2010, by California university and is now the most widely used data set for remote sensing image scene classification research. The UC Merced Land-Use data set comprises 2100 remote sensing images, including 21 types of Land utilization remote sensing scenes, each type comprises 100 images, and the resolution of each image is 256 multiplied by 256.

The NWPU-RESISC45 data set is obtained by manually intercepting the remote sensing data of Google Earth by northwest university of industry, and is released in 2017, wherein the data set is a public data set which contains the most remote sensing scene types and numbers at present and can be used as a reference for remote sensing image scene classification. The NWPU-RESISC45 data set has 31500 remote sensing images, including 45 types of remote sensing scenes, each type has 700 images, and the resolution of each image is 256 multiplied by 256 which is the same as that of the UC Merceded Land-Use data set.

The size of the high-view No. 1 image data set is 24631 × 23945 pixels, which is obtained from Jiangxa in Wuhan City and Xixia in Hubei province of China, the terrain is plain, the elevation is 20m, and the area is 85.5km²Panchromatic waveband fraction of 0.5m, multispectral waveband resolution of 2m, red, green, blue and near infrared 4 wavebands, and the remote sensing image is preprocessed by radiation, atmosphere and ortho-correction before experiment.

Step 2, preprocessing a data set;

since the fully connected layer exists in the overall classification network and the input images are required to be consistent in size, the sizes of the images of the training set and the test set are firstly uniformly adjusted to the specification of 224 × 224 to meet the network input requirements. Because the input image is an RGB color three-channel image, the value range of each pixel point of the image is between 0 and 255, in order to meet the requirement of the network on input data, the range of all pixel values is normalized from [0,255] to [0,1], the convergence rate of the initial iteration of the training network is accelerated, and the formula is as follows:

y_i,j＝2x_i,j/255 (1)

x in the formula_i,jRepresenting the image before preprocessing, y_i,jRepresenting the preprocessed image, where i e [0,223 ]]，j∈[0,223]。

Step 3, designing a picture classification model;

the MSA-CNN model extracts the characteristics of the remote sensing image in multiple scales through a convolutional neural network (VGG-16), obtains the attention area of the image in different scales by using a multi-box attention model (MS-APN), cuts and scales the attention area and inputs the cut and scaled attention area into a three-layer network structure. And then, fusing the features of the original image in different scales and the image features of the attention area of the original image, utilizing LBP global feature expression to overcome the difference of the remote sensing image caused by different shooting angles, and finally inputting the remote sensing image into a network full-connection layer to finish classification prediction.

Step 3.1, multi-scale feature extraction;

the invention selects VGG-16 as an image classification sub-network to extract the multi-scale characteristics of the remote sensing scene image, because the VGG-16 network has stable network structure and characteristic extraction capability, the characteristic extraction capability of the remote sensing image with multi-scale targets and complex background information is stronger, and the completeness of characteristic information and result accuracy are ensured. The VGG-16 network is composed of 13 layers of convolution layers and 3 layers of full connection layers, firstly, a scene image input into the network is subjected to 2 times of 2-layer 3 x 3 convolution and maximum pooling downsampling operation, then is subjected to 3 times of 3-layer 3 x 3 convolution and maximum pooling downsampling operation, and finally is subjected to 3 layers of full connection layers, and the probability of the category to which the scene image belongs is output by a softmax classifier.

Step 3.2, extracting attention area;

an attention subnetwork (APN) of RA-CNN consists of 2 full-connection layers, and an APN module is used for extracting a region of interest of a remote sensing scene image, but the method for extracting the region of interest is to fixedly use a square framing target, so that the extracted region of interest contains parts which do not belong to features or parts which originally belong to a plurality of features are easily extracted incompletely. For high-resolution remote sensing images, the region of interest in the picture is usually irregular in geometric shape, so that the background region is more easily counted by using only a square candidate box. Therefore, the target area is selected by adopting different prior rectangular frames, and finally the extracted characteristic areas can be more accurately positioned to the target area by adopting combined identification. Therefore, the improvement is made on the basis of the original APN attention model, such as a multi-box attention network model MS-APN provided in FIG. 2.

If the channel output by the APN is the coordinate (t) of the center point of the square candidate box_a,t_b)，t_hIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, W_iAnd H_iRespectively representing half of the length and the width of the ith prior rectangular frame, and setting the maximum value of i to be 3 through comparison, the ratio of the length to the width of the ith prior rectangular frame can be defined as K_iThen, there are:

N＝(2t_h)²＝4t_h ²，K_i＝W_i/H_i (2)

N＝2W_i×2H_i＝4K_iH_i ² (3)

a new relation W can be obtained by combining the formula (2) and the formula (3)_iAnd H_iThe expression of (a) is as follows:

int () in the formula is a rounded down function. The rectangular box is represented using the two vertices of the upper left and lower right corners of the a priori rectangular box. Coordinate values of the upper left corner and the lower right corner of the target area can be obtained:

after the coordinate values are determined, the final cut region X^attCan be expressed as:

X^att＝X⊙M(t_a,t_b,t_h,K_i) (6)

wherein [ ] indicates an element dot product operation, X^attRepresenting the cropped target area. The M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:

wherein

Is a sigmoid function, and the formula can be expressed as:

the output result is an open interval of 0 to 1, when k is large enough, only when x is between t_a(ul)And t_a(br)Interval, y value between t_b(ul)And t_b(br)The interval (i.e., the point in the feature region) is close to 1 when passing through the clipping function M (-) and is close to 0 in other cases. Finally, amplifying the selected target area by using a bilinear interpolation method to obtain the scale of the next input network, wherein the formula is as follows:

the values of m and n in the above formula can be expressed by the following formula:

wherein (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size, and [. cndot. ] and {. cndot. ] are respectively an integer part and a decimal part. The MS-APN algorithm can be simply interpreted as: when the candidate feature region is at the upper right corner, the center point moves towards the upper right corner; when the candidate feature region is at the lower left corner, the central point moves to the lower left corner (the same applies to the rest positions); after the candidate characteristic region is obtained, the candidate characteristic region is amplified to be the same as the original input size by using a bilinear interpolation method.

Step 3.3, multi-scale feature fusion;

the Local Binary Pattern (LBP) is an operator for describing local texture characteristics of an image, and has the advantages of gray scale invariance, rotation invariance and the like. The original LBP operator is defined as that in a window of 3 × 3, the central pixel of the window is used as a threshold value, the gray values of the adjacent 8 pixels are compared with the central pixel, if the values of the surrounding pixels are greater than the value of the central pixel, the position of the pixel is marked as 1, otherwise, the position is 0. Firstly, taking the upper left corner of a rectangle as a starting point, connecting the gray values of adjacent 8 pixels clockwise to generate an 8-bit binary number, namely obtaining the LBP value of the pixel point in the center of the window, and reflecting the texture information of the area by using the value. The specific process is shown in fig. 3.

The whole MSA-CNN network model structure is shown in FIG. 4. The algorithm for processing the image by the network model comprises the following steps:

3) inputting the multi-scale feature image output by the first scale layer into a second scale layer classification sub-network for feature extraction and classification to obtain a second scale layer classification sub-network prediction imageProbability of identifying P_t ⁽²⁾；

6) By setting the probability of correctly classifying the label of the network model to be P_t ⁽³⁾Greater than P_t ⁽²⁾，P_t ⁽²⁾Greater than P_t ⁽¹⁾Therefore, the target area extracted by the multi-box attention network is more accurate;

Step 4, learning and training a network model;

L_rank(P_t ^(s),P_t ^(s+1))＝max{0,P_t ^(s)-P_t ^(s+1)+0.05} (12)

updating a loss value generated when the probability of the predicted real category of the S +1 layer in the network model is less than the probability of the predicted real category of the S layer by a maximum value method, so that the probability of the predicted real category of the network model is gradually increased along with the increasing of the layers, and when P is the maximum value_t ^(s)-P_t ^(s+1)When +0.05 > 0, the loss between adjacent layers will be updated, wherein the extra 0.05 is to prevent the loss from stopping and not updating due to 0 of both layers.

And (3) a testing stage:

step 5, inputting the image to be identified

And identifying the test set through the trained network model, and outputting a scene classification result in the remote sensing image.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A remote sensing scene classification method based on attention network scale feature fusion is characterized by comprising the following steps:

a training stage:

the step 3 comprises the following steps:

step 3.3, multi-scale feature fusion: fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction;

the specific method of the step 3.2 comprises the following steps:

if the channel output by APN is squareCoordinates (t) of the center point of the candidate frame_a,t_b)，t_hIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, W_iAnd H_iRespectively representing half of the length and the width of the ith prior rectangular frame, setting the maximum value of i to be 3 through comparison, and defining the ratio of the length to the width of the ith prior rectangular frame to be K_iThen, there are:

N＝(2t_h)²＝4t_h ²，K_i＝W_i/H_i

N＝2W_i×2H_i＝4K_iH_i ²

obtain a new relation W_iAnd H_iThe expression of (a) is as follows:

t_a(ul)＝t_a-W_i,t_b(ul)＝t_b-H_i

t_a(br)＝t_a+W_i,t_b(br)＝t_b+H_i

wherein the content of the first and second substances,

representing element dot product operation, X^attRepresenting the clipped target area; the M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:

wherein the content of the first and second substances,

is a sigmoid function, and the formula is expressed as:

wherein the values of m and n are expressed by the following formulas:

wherein (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size,

and

respectively a rounding part and a decimal part;

and (3) a testing stage:

2. The remote sensing scene classification method based on attention network scale feature fusion of claim 1, characterized in that the remote sensing scene image input in step 1 comprises a plurality of different public data sets, each data set comprises a plurality of types of remote sensing scenes, and each remote sensing scene comprises a plurality of images.

3. The remote sensing scene classification method based on attention network scale feature fusion according to claim 1, wherein the method for normalizing the pixel values of the images in the step 2 comprises the following steps:

y_i,j＝2x_i,j/255

4. The remote sensing scene classification method based on attention network scale feature fusion of claim 1, characterized in that the step 3.1 comprises:

1) input to network modelThe scale image enters a classification sub-network of a first scale layer to carry out feature extraction and classification operation, and the probability P of predicting the correct classification label by the classification sub-network of the first scale layer is obtained_t ⁽¹⁾；

5. The remote sensing scene classification method based on attention network scale feature fusion according to claim 1, characterized in that in the step 4:

the MSA-CNN network model loss function contains two L_clsAnd L_rankTwo moieties, wherein L_clsRepresenting class loss function, including predicting remote sensing image class in three-layer classification networkLoss of identity against true class label, L_rankThe loss generated when the probability of the middle-high level network prediction in the front layer and the rear layer of the network is lower than that of the low level network prediction is represented, the combined loss function adopts a mode of alternately training two loss functions, and the formula of the combined loss function L (x) of the model is as follows:

L_rank(P_t ^(s),P_t ^(s+1))＝max{0,P_t ^(s)-P_t ^(s+1)+0.05}

6. The remote sensing scene classification method based on attention network scale feature fusion according to claim 5, characterized in that in the step 4:

the MSA-CNN network model adopts a mode of cyclic cross training of a VGG-16 network and an MS-APN, firstly, the pre-trained parameters of the VGG-16 network are used for realizing the feature extraction of a multi-scale remote sensing image and the initialization of a classification sub-network, and the parameter { t } of the MS-APN is initialized by using the region with the highest response value in the last convolutional layer of the VGG-16 network_a,t_b,t_h,K_i}; then fixing MS-AParameters of PN, training VGG-16 sub-network until its loss function L_clsConverging, fixing the parameters of the VGG-16 network, training the MS-APN until its loss function L_rankConverging, and finally circularly and crossly training the MS-APN and VGG-16 sub-networks until L_clsAnd L_rankThe joint loss functions are all converged, and parameters of the whole model are fixed at the moment, so that the model finally used for classification is obtained, and the final scene classification prediction is carried out.