CN113408594B - Remote sensing scene classification method based on attention network scale feature fusion - Google Patents

Remote sensing scene classification method based on attention network scale feature fusion Download PDF

Info

Publication number
CN113408594B
CN113408594B CN202110622695.7A CN202110622695A CN113408594B CN 113408594 B CN113408594 B CN 113408594B CN 202110622695 A CN202110622695 A CN 202110622695A CN 113408594 B CN113408594 B CN 113408594B
Authority
CN
China
Prior art keywords
network
image
classification
remote sensing
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110622695.7A
Other languages
Chinese (zh)
Other versions
CN113408594A (en
Inventor
郑禄
肖鹏飞
帖军
吴立锋
刘振宇
田莎莎
张潇
于舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN202110622695.7A priority Critical patent/CN113408594B/en
Publication of CN113408594A publication Critical patent/CN113408594A/en
Application granted granted Critical
Publication of CN113408594B publication Critical patent/CN113408594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing scene classification method based on attention network scale feature fusion, which comprises the following steps: inputting a training data set comprising various types of remote sensing scene images; preprocessing a training data set; extracting the characteristics of multiple scales of the remote sensing image through a convolutional neural network, obtaining an attention area of the image under different scales by using a multi-box attention network model, cutting and scaling the attention area and inputting the cut and scaled attention area into a three-layer network structure; fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction; inputting the image of the training data set into a multi-box attention network model MS-APN for learning training; and carrying out scene classification on the remote sensing image through the trained multi-box attention network model MS-APN. The method can extract the multi-scale and multi-angle characteristics of the remote sensing image, and has better scene classification and identification effects of the remote sensing image.

Description

Remote sensing scene classification method based on attention network scale feature fusion
Technical Field
The invention relates to the technical field of remote sensing image processing, in particular to a remote sensing scene classification method based on attention network scale feature fusion.
Background
With the continuous development of satellite sensors and remote sensing technologies, a large number of high-resolution (HSR) remote sensing images are available, and the high-resolution remote sensing images often contain abundant spatial and semantic information, and are widely applied to the fields of land utilization planning, intelligent agriculture, key target detection, military and the like. Scene classification of high resolution images is an important research topic, which aims to assign reasonable semantic labels to each image. High resolution scenes typically contain rich semantic information and complex spatial patterns, and therefore, accurately classifying them is a challenging task.
Due to the fact that the remote sensing images are different in acquisition time and position, the same type of scene is acquired due to the fact that the texture is inconsistent and the shape and the size are obviously different due to the fact that the directions are inconsistent. Meanwhile, the difference of the colors of the ground feature types in the same category is large due to the influence of environmental factors such as illumination and the like. The remote sensing images of the surface feature types in the same category have differences in texture, shape and color. The high-resolution remote sensing images are shot from high altitude, so that scenes in the remote sensing images have different directions, and the distance between a sensor and a measured object in the remote sensing images is greatly changed, so that the scale of the remote sensing images is changed.
When remote sensing image scenes are classified, information which can be most distinguished from different categories is usually concentrated in a local characteristic region, while the traditional convolutional neural network focuses more on processing image global information, and local detail information of the image is easily lost. The latest attention mechanism is introduced into a fine-grained image classification task to provide a circular attention convolution neural network, a frame regression mechanism is adopted on three different levels to regress a characteristic region with the most distinguishing performance of an image, and finally, image characteristics extracted from the three levels are fused to complete classification, so that the classification performance of the fine-grained image is further improved. However, the research is based on the depth characteristics of a single scale and an angle of an image, and currently, remote sensing images gradually tend to be diversified in the aspects of target types, surface feature scales and the like, so that the scale and the angle of a remote sensing scene are greatly changed.
The remote sensing scene images contain target dimensions and angles with diversity, and large differences exist in the shapes, textures and colors of the images. The traditional convolutional neural network focuses more on processing the global information of the image, and the previous research is only based on the depth features of a single scale of the image, and the difference of the extracted features of the image caused by the rotation angle is ignored.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a remote sensing scene classification method based on attention network scale feature fusion aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a remote sensing scene classification method based on attention network scale feature fusion, which comprises the following steps:
a training stage:
step 1, inputting a training data set comprising various types of remote sensing scene images;
step 2, preprocessing the training data set to keep the sizes of the input images consistent, and normalizing the pixel values of the images;
step 3, constructing a multi-box attention network model MS-APN as an image classification model, extracting the characteristics of multiple scales of the remote sensing image through a convolutional neural network, obtaining attention areas of the image under different scales by using the multi-box attention network model, cutting and scaling the attention areas and inputting the attention areas into a three-layer network structure; fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction;
step 4, inputting the image of the training data set into a multi-box attention network model MS-APN for learning training, using a mode of alternately training two loss functions until the two loss functions are converged, and fixing the parameters of the whole model at the moment to obtain the multi-box attention network model MS-APN finally used for classification;
and (3) a testing stage:
and 5, inputting the remote sensing image to be recognized, and outputting a scene classification result in the remote sensing image through the trained multi-box attention network model MS-APN.
Further, the remote sensing scene image input in step 1 of the present invention includes a plurality of different public data sets, each data set includes a plurality of types of remote sensing scenes, and each remote sensing scene includes a plurality of images.
Further, the method for performing normalization processing on the pixel values of the image in step 2 of the present invention is as follows:
the input image is an RGB color three-channel image, the value range of each pixel point of the image is 0-255, the range of all pixel values is normalized from [0,255] to [0,1], the convergence rate of the initial iteration of the training network is accelerated, and the formula is as follows:
yi,j=2xi,j/255
wherein x isi,jRepresenting the image before preprocessing, yi,jRepresenting the preprocessed image, where i e [0,223 ]],j∈[0,223]。
Further, the step 3 of the present invention includes:
step 3.1, multi-scale feature extraction: selecting VGG-16 as an image classification sub-network to extract the multi-scale features of the remote sensing scene image, wherein the VGG-16 network consists of 13 layers of convolution layers and 3 layers of full-connection layers, firstly, the scene image input into the network is subjected to 2 times of 2-layer 3X 3 convolution and maximum pooling downsampling operation, then is subjected to 3 times of 3-layer 3X 3 convolution and maximum pooling downsampling operation, and finally is subjected to 3 layers of full-connection layers to output the probability of the category to which the scene image belongs by a softmax classifier;
step 3.2, attention area extraction: selecting an attention area by adopting different prior rectangular frames, and finally, positioning the extracted multiple characteristic areas to the attention area under different scales by adopting joint identification;
step 3.3, multi-scale feature fusion: and fusing the features of the original image in different scales and the image features of the attention area of the original image, expressing by using the LBP global features, and inputting the LBP global features into a network full-connection layer to finish classification prediction.
Further, the specific method of step 3.2 of the present invention is:
if the channel output by the APN is the coordinate (t) of the center point of the square candidate boxa,tb),thIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, WiAnd HiRespectively representing half of the length and the width of the ith prior rectangular frame, setting the maximum value of i to be 3 through comparison, and defining the ratio of the length to the width of the ith prior rectangular frame to be KiThen, there are:
N=(2th)2=4th 2,Ki=Wi/Hi
it is specified that the area of the a priori rectangular box is equal to the area of the square box of the output, so that:
N=2Wi×2Hi=4KiHi 2
obtain a new relation WiAnd HiThe expression of (a) is as follows:
Figure BDA0003100538660000041
wherein int () is a rounded down function; using two vertexes of the upper left corner and the lower right corner of the prior rectangular frame to represent the rectangular frame, and obtaining coordinate values of the upper left corner and the lower right corner of the target area:
ta(ul)=ta-Wi,tb(ul)=tb-Hi
ta(br)=ta+Wi,tb(br)=tb+Hi
wherein, the element indicates a dot product operation, XattRepresenting the clipped target area; the M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:
Figure BDA0003100538660000042
wherein the content of the first and second substances,
Figure BDA0003100538660000043
is a sigmoid function, and the formula is expressed as:
Figure BDA0003100538660000044
Figure BDA0003100538660000045
the output result is an open interval of 0 to 1, when k is large enough, only when x is between ta(ul)And ta(br)Interval, y value between tb(ul)And tb(br)The interval, namely the point in the characteristic region, the point approaches 1 when passing through the clipping function M (-) and M (-) approaches 0 in other cases; finally, amplifying the selected target area by using a bilinear interpolation method to obtain the scale of the next input network, wherein the formula is as follows:
Figure BDA0003100538660000046
wherein the values of m and n are expressed by the following formulas:
Figure BDA0003100538660000047
wherein, (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size, and [. cndot. ] and {. cndot. ] are respectively an integer part and a decimal part.
Further, said step 3.1 of the present invention comprises:
1) the multi-scale image input into the network model enters a classification sub-network of a first scale layer to carry out feature extraction and classification operation, and the probability P of predicting the correct classification label by the classification sub-network of the first scale layer is obtainedt (1)
2) Inputting the features extracted by the first scale layer classification subnetwork into a multi-selection-box attention network d1 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a second scale layer;
3) inputting the multi-scale feature image output by the first scale layer into a second scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the second scale layer classification sub-networkt (2)
4) Inputting the features extracted by the second scale layer classification subnetwork into a multi-selection-box attention network d2 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a third scale layer;
5) inputting the multi-scale feature image output by the second scale layer into a third scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the third scale layer classification sub-networkt (3)
6) Tong (Chinese character of 'tong')Over-setting the correct classification label probability of the network model to be Pt (3)Greater than Pt (2),Pt (2)Greater than Pt (1)Therefore, the target area extracted by the multi-box attention network is more accurate;
7) and finally, inputting the features extracted by the three layers of classification sub-networks into an LBP operator for feature fusion, thereby completing the scene classification task.
Further, in the step 4 of the present invention:
the MSA-CNN network model loss function contains two LclsAnd LrankTwo moieties, wherein LclsA function for representing class loss, including the loss generated by predicting the class of the remote sensing image relative to the real class label in the three-layer classification network, LrankThe loss generated when the probability of the middle-high level network prediction in the front layer and the rear layer of the network is lower than that of the low level network prediction is represented, the combined loss function adopts a mode of alternately training two loss functions, and the formula of the combined loss function L (x) of the model is as follows:
Figure BDA0003100538660000051
wherein, Y(s)Indicates the class to which the network model predicted image belongs, Y*Representing the true category of the remote sensing image, Pt (s)Probability value, L, representing the correct label of S-layer prediction in a network structurerankThe calculation formula of (2) is as follows:
Lrank(Pt (s),Pt (s+1))=max{0,Pt (s)-Pt (s+1)+0.05}
updating a loss value generated when the probability of the predicted real category of the S +1 layer in the network model is less than the probability of the predicted real category of the S layer by a maximum value method, so that the probability of the predicted real category of the network model is gradually increased along with the increasing of the layers, and when P is the maximum valuet (s)-Pt (s+1)When +0.05 > 0, the loss between adjacent layers will be updated, and the sum thereofThe outer 0.05 is added to prevent the loss from stopping and not updating due to 0 in both cases.
Further, in the step 4 of the present invention:
the MSA-CNN network model adopts a mode of cyclic cross training of a VGG-16 network and an MS-APN, firstly, the pre-trained parameters of the VGG-16 network are used for realizing the feature extraction of a multi-scale remote sensing image and the initialization of a classification sub-network, and the parameter { t } of the MS-APN is initialized by using the region with the highest response value in the last convolutional layer of the VGG-16 networka,tb,th,Ki}; then fixing the parameters of MS-APN, training VGG-16 sub-network until its loss function LclsConverging, fixing the parameters of the VGG-16 network, training the MS-APN until its loss function LrankConverging, and finally circularly and crossly training the MS-APN and VGG-16 sub-networks until LclsAnd LrankThe joint loss functions are all converged, and parameters of the whole model are fixed at the moment, so that the model finally used for classification is obtained, and the final scene classification prediction is carried out.
The invention has the following beneficial effects: the remote sensing scene classification method based on the attention network scale feature fusion is realized based on the multi-scale feature fusion attention model, the category diversity of the remote sensing image is high, in order to achieve a better realization effect, the method based on the multi-frame selection attention model and LBP feature fusion is used, and the defects of the similar technology application can be effectively overcome. 1) The invention can extract the multi-scale and multi-angle characteristics of the image; 2) the invention focuses more on classifying the target characteristic region of the image; 3) the method has larger research significance and has important significance for land utilization planning, intelligent agriculture, key target detection and the like.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of remote sensing image scene classification according to an embodiment of the present invention;
FIG. 2 illustrates the principle of the MS-APN operation according to an embodiment of the present invention;
FIG. 3 is an LBP feature extraction of an embodiment of the present invention;
fig. 4 is a MSA-CNN network model structure of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The remote sensing scene classification method based on attention network scale feature fusion (MSA-CNN) comprises the steps of firstly, extracting features of multiple scales of a remote sensing image through a convolutional neural network, obtaining attention areas of the image under different scales by using a multi-frame attention model, cutting and scaling the attention areas and inputting the attention areas into a three-layer network structure. And then, fusing the features of the original image in different scales and the image features of the attention area of the original image, utilizing LBP global feature expression to overcome the difference of the remote sensing image caused by different shooting angles, and finally inputting the remote sensing image into a network full-connection layer to finish classification prediction. The method mainly comprises the steps of data set preprocessing, multi-scale feature extraction, attention area extraction, feature fusion and classification, and the remote sensing image classification process is shown in figure 1.
1) Multi-selection frame attention model
The multi-box attention model is mainly used for more accurately positioning an attention area of an image when a remote sensing scene is classified. The multi-box attention model can gradually focus on a key area in the three-layer network feature training process, and can be more accurate. And the alternative training with the classification network can be mutually promoted, so that the method is an excellent unsupervised network.
2) Classifying sub-networks
The convolution layers of 3 sub-networks with different levels of the network are convolution layers which extract features from an input image, and according to the extracted features, on one hand, area information is obtained through a multi-box attention model, on the other hand, the area information is transmitted to a full connection layer (fc) and a softmax layer to predict the class probability of the image, and one task (task) is classification.
The method comprises the following specific steps:
a training stage:
step 1, inputting a training data set;
the UC Mercded Land-Use data set is obtained by manual extraction of data downloaded from national maps of the national survey bureau of geological survey, which is released in 2010, by California university and is now the most widely used data set for remote sensing image scene classification research. The UC Merced Land-Use data set comprises 2100 remote sensing images, including 21 types of Land utilization remote sensing scenes, each type comprises 100 images, and the resolution of each image is 256 multiplied by 256.
The NWPU-RESISC45 data set is obtained by manually intercepting the remote sensing data of Google Earth by northwest university of industry, and is released in 2017, wherein the data set is a public data set which contains the most remote sensing scene types and numbers at present and can be used as a reference for remote sensing image scene classification. The NWPU-RESISC45 data set has 31500 remote sensing images, including 45 types of remote sensing scenes, each type has 700 images, and the resolution of each image is 256 multiplied by 256 which is the same as that of the UC Merceded Land-Use data set.
The size of the high-view No. 1 image data set is 24631 × 23945 pixels, which is obtained from Jiangxa in Wuhan City and Xixia in Hubei province of China, the terrain is plain, the elevation is 20m, and the area is 85.5km2Panchromatic waveband fraction of 0.5m, multispectral waveband resolution of 2m, red, green, blue and near infrared 4 wavebands, and the remote sensing image is preprocessed by radiation, atmosphere and ortho-correction before experiment.
Step 2, preprocessing a data set;
since the fully connected layer exists in the overall classification network and the input images are required to be consistent in size, the sizes of the images of the training set and the test set are firstly uniformly adjusted to the specification of 224 × 224 to meet the network input requirements. Because the input image is an RGB color three-channel image, the value range of each pixel point of the image is between 0 and 255, in order to meet the requirement of the network on input data, the range of all pixel values is normalized from [0,255] to [0,1], the convergence rate of the initial iteration of the training network is accelerated, and the formula is as follows:
yi,j=2xi,j/255 (1)
x in the formulai,jRepresenting the image before preprocessing, yi,jRepresenting the preprocessed image, where i e [0,223 ]],j∈[0,223]。
Step 3, designing a picture classification model;
the MSA-CNN model extracts the characteristics of the remote sensing image in multiple scales through a convolutional neural network (VGG-16), obtains the attention area of the image in different scales by using a multi-box attention model (MS-APN), cuts and scales the attention area and inputs the cut and scaled attention area into a three-layer network structure. And then, fusing the features of the original image in different scales and the image features of the attention area of the original image, utilizing LBP global feature expression to overcome the difference of the remote sensing image caused by different shooting angles, and finally inputting the remote sensing image into a network full-connection layer to finish classification prediction.
Step 3.1, multi-scale feature extraction;
the invention selects VGG-16 as an image classification sub-network to extract the multi-scale characteristics of the remote sensing scene image, because the VGG-16 network has stable network structure and characteristic extraction capability, the characteristic extraction capability of the remote sensing image with multi-scale targets and complex background information is stronger, and the completeness of characteristic information and result accuracy are ensured. The VGG-16 network is composed of 13 layers of convolution layers and 3 layers of full connection layers, firstly, a scene image input into the network is subjected to 2 times of 2-layer 3 x 3 convolution and maximum pooling downsampling operation, then is subjected to 3 times of 3-layer 3 x 3 convolution and maximum pooling downsampling operation, and finally is subjected to 3 layers of full connection layers, and the probability of the category to which the scene image belongs is output by a softmax classifier.
Step 3.2, extracting attention area;
an attention subnetwork (APN) of RA-CNN consists of 2 full-connection layers, and an APN module is used for extracting a region of interest of a remote sensing scene image, but the method for extracting the region of interest is to fixedly use a square framing target, so that the extracted region of interest contains parts which do not belong to features or parts which originally belong to a plurality of features are easily extracted incompletely. For high-resolution remote sensing images, the region of interest in the picture is usually irregular in geometric shape, so that the background region is more easily counted by using only a square candidate box. Therefore, the target area is selected by adopting different prior rectangular frames, and finally the extracted characteristic areas can be more accurately positioned to the target area by adopting combined identification. Therefore, the improvement is made on the basis of the original APN attention model, such as a multi-box attention network model MS-APN provided in FIG. 2.
If the channel output by the APN is the coordinate (t) of the center point of the square candidate boxa,tb),thIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, WiAnd HiRespectively representing half of the length and the width of the ith prior rectangular frame, and setting the maximum value of i to be 3 through comparison, the ratio of the length to the width of the ith prior rectangular frame can be defined as KiThen, there are:
N=(2th)2=4th 2,Ki=Wi/Hi (2)
it is specified that the area of the a priori rectangular box is equal to the area of the square box of the output, so that:
N=2Wi×2Hi=4KiHi 2 (3)
a new relation W can be obtained by combining the formula (2) and the formula (3)iAnd HiThe expression of (a) is as follows:
Figure BDA0003100538660000091
int () in the formula is a rounded down function. The rectangular box is represented using the two vertices of the upper left and lower right corners of the a priori rectangular box. Coordinate values of the upper left corner and the lower right corner of the target area can be obtained:
Figure BDA0003100538660000107
after the coordinate values are determined, the final cut region XattCan be expressed as:
Xatt=X⊙M(ta,tb,th,Ki) (6)
wherein [ ] indicates an element dot product operation, XattRepresenting the cropped target area. The M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:
Figure BDA0003100538660000101
wherein
Figure BDA0003100538660000102
Is a sigmoid function, and the formula can be expressed as:
Figure BDA0003100538660000103
Figure BDA0003100538660000104
the output result is an open interval of 0 to 1, when k is large enough, only when x is between ta(ul)And ta(br)Interval, y value between tb(ul)And tb(br)The interval (i.e., the point in the feature region) is close to 1 when passing through the clipping function M (-) and is close to 0 in other cases. Finally, amplifying the selected target area by using a bilinear interpolation method to obtain the scale of the next input network, wherein the formula is as follows:
Figure BDA0003100538660000105
the values of m and n in the above formula can be expressed by the following formula:
Figure BDA0003100538660000106
wherein (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size, and [. cndot. ] and {. cndot. ] are respectively an integer part and a decimal part. The MS-APN algorithm can be simply interpreted as: when the candidate feature region is at the upper right corner, the center point moves towards the upper right corner; when the candidate feature region is at the lower left corner, the central point moves to the lower left corner (the same applies to the rest positions); after the candidate characteristic region is obtained, the candidate characteristic region is amplified to be the same as the original input size by using a bilinear interpolation method.
Step 3.3, multi-scale feature fusion;
the Local Binary Pattern (LBP) is an operator for describing local texture characteristics of an image, and has the advantages of gray scale invariance, rotation invariance and the like. The original LBP operator is defined as that in a window of 3 × 3, the central pixel of the window is used as a threshold value, the gray values of the adjacent 8 pixels are compared with the central pixel, if the values of the surrounding pixels are greater than the value of the central pixel, the position of the pixel is marked as 1, otherwise, the position is 0. Firstly, taking the upper left corner of a rectangle as a starting point, connecting the gray values of adjacent 8 pixels clockwise to generate an 8-bit binary number, namely obtaining the LBP value of the pixel point in the center of the window, and reflecting the texture information of the area by using the value. The specific process is shown in fig. 3.
The whole MSA-CNN network model structure is shown in FIG. 4. The algorithm for processing the image by the network model comprises the following steps:
1) the multi-scale image input into the network model enters a classification sub-network of a first scale layer to carry out feature extraction and classification operation, and the probability P of predicting the correct classification label by the classification sub-network of the first scale layer is obtainedt (1)
2) Inputting the features extracted by the first scale layer classification subnetwork into a multi-selection-box attention network d1 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a second scale layer;
3) inputting the multi-scale feature image output by the first scale layer into a second scale layer classification sub-network for feature extraction and classification to obtain a second scale layer classification sub-network prediction imageProbability of identifying Pt (2)
4) Inputting the features extracted by the second scale layer classification subnetwork into a multi-selection-box attention network d2 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a third scale layer;
5) inputting the multi-scale feature image output by the second scale layer into a third scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the third scale layer classification sub-networkt (3)
6) By setting the probability of correctly classifying the label of the network model to be Pt (3)Greater than Pt (2),Pt (2)Greater than Pt (1)Therefore, the target area extracted by the multi-box attention network is more accurate;
7) and finally, inputting the features extracted by the three layers of classification sub-networks into an LBP operator for feature fusion, thereby completing the scene classification task.
Step 4, learning and training a network model;
the MSA-CNN network model loss function contains two LclsAnd LrankTwo moieties, wherein LclsA function for representing class loss, including the loss generated by predicting the class of the remote sensing image relative to the real class label in the three-layer classification network, LrankThe loss generated when the probability of the middle-high level network prediction in the front layer and the rear layer of the network is lower than that of the low level network prediction is represented, the combined loss function adopts a mode of alternately training two loss functions, and the formula of the combined loss function L (x) of the model is as follows:
Figure BDA0003100538660000121
wherein, Y(s)Indicates the class to which the network model predicted image belongs, Y*Representing the true category of the remote sensing image, Pt (s)Probability value, L, representing the correct label of S-layer prediction in a network structurerankThe calculation formula of (2) is as follows:
Lrank(Pt (s),Pt (s+1))=max{0,Pt (s)-Pt (s+1)+0.05} (12)
updating a loss value generated when the probability of the predicted real category of the S +1 layer in the network model is less than the probability of the predicted real category of the S layer by a maximum value method, so that the probability of the predicted real category of the network model is gradually increased along with the increasing of the layers, and when P is the maximum valuet (s)-Pt (s+1)When +0.05 > 0, the loss between adjacent layers will be updated, wherein the extra 0.05 is to prevent the loss from stopping and not updating due to 0 of both layers.
The MSA-CNN network model adopts a mode of cyclic cross training of a VGG-16 network and an MS-APN, firstly, the pre-trained parameters of the VGG-16 network are used for realizing the feature extraction of a multi-scale remote sensing image and the initialization of a classification sub-network, and the parameter { t } of the MS-APN is initialized by using the region with the highest response value in the last convolutional layer of the VGG-16 networka,tb,th,Ki}; then fixing the parameters of MS-APN, training VGG-16 sub-network until its loss function LclsConverging, fixing the parameters of the VGG-16 network, training the MS-APN until its loss function LrankConverging, and finally circularly and crossly training the MS-APN and VGG-16 sub-networks until LclsAnd LrankThe joint loss functions are all converged, and parameters of the whole model are fixed at the moment, so that the model finally used for classification is obtained, and the final scene classification prediction is carried out.
And (3) a testing stage:
step 5, inputting the image to be identified
And identifying the test set through the trained network model, and outputting a scene classification result in the remote sensing image.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (6)

1. A remote sensing scene classification method based on attention network scale feature fusion is characterized by comprising the following steps:
a training stage:
step 1, inputting a training data set comprising various types of remote sensing scene images;
step 2, preprocessing the training data set to keep the sizes of the input images consistent, and normalizing the pixel values of the images;
step 3, constructing a multi-box attention network model MS-APN as an image classification model, extracting the characteristics of multiple scales of the remote sensing image through a convolutional neural network, obtaining attention areas of the image under different scales by using the multi-box attention network model, cutting and scaling the attention areas and inputting the attention areas into a three-layer network structure; fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction;
the step 3 comprises the following steps:
step 3.1, multi-scale feature extraction: selecting VGG-16 as an image classification sub-network to extract the multi-scale features of the remote sensing scene image, wherein the VGG-16 network consists of 13 layers of convolution layers and 3 layers of full-connection layers, firstly, the scene image input into the network is subjected to 2 times of 2-layer 3X 3 convolution and maximum pooling downsampling operation, then is subjected to 3 times of 3-layer 3X 3 convolution and maximum pooling downsampling operation, and finally is subjected to 3 layers of full-connection layers to output the probability of the category to which the scene image belongs by a softmax classifier;
step 3.2, attention area extraction: selecting an attention area by adopting different prior rectangular frames, and finally, positioning the extracted multiple characteristic areas to the attention area under different scales by adopting joint identification;
step 3.3, multi-scale feature fusion: fusing the characteristics of the original image in different scales and the image characteristics of the attention area of the original image, expressing by using the LBP global characteristics, and inputting the LBP global characteristics into a network full-connection layer to complete classification prediction;
the specific method of the step 3.2 comprises the following steps:
if the channel output by APN is squareCoordinates (t) of the center point of the candidate framea,tb),thIs half of the square candidate frame, N is the number of pixel points of the square candidate frame, WiAnd HiRespectively representing half of the length and the width of the ith prior rectangular frame, setting the maximum value of i to be 3 through comparison, and defining the ratio of the length to the width of the ith prior rectangular frame to be KiThen, there are:
N=(2th)2=4th 2,Ki=Wi/Hi
it is specified that the area of the a priori rectangular box is equal to the area of the square box of the output, so that:
N=2Wi×2Hi=4KiHi 2
obtain a new relation WiAnd HiThe expression of (a) is as follows:
Figure FDA0003552858940000021
wherein int () is a rounded down function; using two vertexes of the upper left corner and the lower right corner of the prior rectangular frame to represent the rectangular frame, and obtaining coordinate values of the upper left corner and the lower right corner of the target area:
ta(ul)=ta-Wi,tb(ul)=tb-Hi
ta(br)=ta+Wi,tb(br)=tb+Hi
wherein the content of the first and second substances,
Figure FDA00035528589400000210
representing element dot product operation, XattRepresenting the clipped target area; the M (-) function is a derivable clipping function that is easier to optimize in the back propagation of the neural network due to its continuous function:
Figure FDA0003552858940000022
wherein the content of the first and second substances,
Figure FDA0003552858940000023
is a sigmoid function, and the formula is expressed as:
Figure FDA0003552858940000024
Figure FDA0003552858940000025
the output result is an open interval of 0 to 1, when k is large enough, only when x is between ta(ul)And ta(br)Interval, y value between tb(ul)And tb(br)The interval, namely the point in the characteristic region, the point approaches 1 when passing through the clipping function M (-) and M (-) approaches 0 in other cases; finally, amplifying the selected target area by using a bilinear interpolation method to obtain the scale of the next input network, wherein the formula is as follows:
Figure FDA0003552858940000026
wherein the values of m and n are expressed by the following formulas:
Figure FDA0003552858940000027
wherein (m, n) represents any point of the attention area of the original image, (i, j) represents the corresponding value of (m, n) after the image is processed by the method, S represents the picture enlargement size,
Figure FDA0003552858940000028
and
Figure FDA0003552858940000029
respectively a rounding part and a decimal part;
step 4, inputting the image of the training data set into a multi-box attention network model MS-APN for learning training, using a mode of alternately training two loss functions until the two loss functions are converged, and fixing the parameters of the whole model at the moment to obtain the multi-box attention network model MS-APN finally used for classification;
and (3) a testing stage:
and 5, inputting the remote sensing image to be recognized, and outputting a scene classification result in the remote sensing image through the trained multi-box attention network model MS-APN.
2. The remote sensing scene classification method based on attention network scale feature fusion of claim 1, characterized in that the remote sensing scene image input in step 1 comprises a plurality of different public data sets, each data set comprises a plurality of types of remote sensing scenes, and each remote sensing scene comprises a plurality of images.
3. The remote sensing scene classification method based on attention network scale feature fusion according to claim 1, wherein the method for normalizing the pixel values of the images in the step 2 comprises the following steps:
the input image is an RGB color three-channel image, the value range of each pixel point of the image is 0-255, the range of all pixel values is normalized from [0,255] to [0,1], the convergence rate of the initial iteration of the training network is accelerated, and the formula is as follows:
yi,j=2xi,j/255
wherein x isi,jRepresenting the image before preprocessing, yi,jRepresenting the preprocessed image, where i e [0,223 ]],j∈[0,223]。
4. The remote sensing scene classification method based on attention network scale feature fusion of claim 1, characterized in that the step 3.1 comprises:
1) input to network modelThe scale image enters a classification sub-network of a first scale layer to carry out feature extraction and classification operation, and the probability P of predicting the correct classification label by the classification sub-network of the first scale layer is obtainedt (1)
2) Inputting the features extracted by the first scale layer classification subnetwork into a multi-selection-box attention network d1 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a second scale layer;
3) inputting the multi-scale feature image output by the first scale layer into a second scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the second scale layer classification sub-networkt (2)
4) Inputting the features extracted by the second scale layer classification subnetwork into a multi-selection-box attention network d2 to obtain a target region, and performing cutting and amplification operations on the target region to serve as the input of a third scale layer;
5) inputting the multi-scale feature image output by the second scale layer into a third scale layer classification sub-network for feature extraction and classification to obtain the probability P of predicting the correct label by the third scale layer classification sub-networkt (3)
6) By setting the probability of correctly classifying the label of the network model to be Pt (3)Greater than Pt (2),Pt (2)Greater than Pt (1)Therefore, the target area extracted by the multi-box attention network is more accurate;
7) and finally, inputting the features extracted by the three layers of classification sub-networks into an LBP operator for feature fusion, thereby completing the scene classification task.
5. The remote sensing scene classification method based on attention network scale feature fusion according to claim 1, characterized in that in the step 4:
the MSA-CNN network model loss function contains two LclsAnd LrankTwo moieties, wherein LclsRepresenting class loss function, including predicting remote sensing image class in three-layer classification networkLoss of identity against true class label, LrankThe loss generated when the probability of the middle-high level network prediction in the front layer and the rear layer of the network is lower than that of the low level network prediction is represented, the combined loss function adopts a mode of alternately training two loss functions, and the formula of the combined loss function L (x) of the model is as follows:
Figure FDA0003552858940000041
wherein, Y(s)Indicates the class to which the network model predicted image belongs, Y*Representing the true category of the remote sensing image, Pt (s)Probability value, L, representing the correct label of S-layer prediction in a network structurerankThe calculation formula of (2) is as follows:
Lrank(Pt (s),Pt (s+1))=max{0,Pt (s)-Pt (s+1)+0.05}
updating a loss value generated when the probability of the predicted real category of the S +1 layer in the network model is less than the probability of the predicted real category of the S layer by a maximum value method, so that the probability of the predicted real category of the network model is gradually increased along with the increasing of the layers, and when P is the maximum valuet (s)-Pt (s+1)When +0.05 > 0, the loss between adjacent layers will be updated, wherein the extra 0.05 is to prevent the loss from stopping and not updating due to 0 of both layers.
6. The remote sensing scene classification method based on attention network scale feature fusion according to claim 5, characterized in that in the step 4:
the MSA-CNN network model adopts a mode of cyclic cross training of a VGG-16 network and an MS-APN, firstly, the pre-trained parameters of the VGG-16 network are used for realizing the feature extraction of a multi-scale remote sensing image and the initialization of a classification sub-network, and the parameter { t } of the MS-APN is initialized by using the region with the highest response value in the last convolutional layer of the VGG-16 networka,tb,th,Ki}; then fixing MS-AParameters of PN, training VGG-16 sub-network until its loss function LclsConverging, fixing the parameters of the VGG-16 network, training the MS-APN until its loss function LrankConverging, and finally circularly and crossly training the MS-APN and VGG-16 sub-networks until LclsAnd LrankThe joint loss functions are all converged, and parameters of the whole model are fixed at the moment, so that the model finally used for classification is obtained, and the final scene classification prediction is carried out.
CN202110622695.7A 2021-06-04 2021-06-04 Remote sensing scene classification method based on attention network scale feature fusion Active CN113408594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110622695.7A CN113408594B (en) 2021-06-04 2021-06-04 Remote sensing scene classification method based on attention network scale feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110622695.7A CN113408594B (en) 2021-06-04 2021-06-04 Remote sensing scene classification method based on attention network scale feature fusion

Publications (2)

Publication Number Publication Date
CN113408594A CN113408594A (en) 2021-09-17
CN113408594B true CN113408594B (en) 2022-04-29

Family

ID=77676282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110622695.7A Active CN113408594B (en) 2021-06-04 2021-06-04 Remote sensing scene classification method based on attention network scale feature fusion

Country Status (1)

Country Link
CN (1) CN113408594B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114078230B (en) * 2021-11-19 2023-08-25 西南交通大学 Small target detection method for self-adaptive feature fusion redundancy optimization
CN113836850A (en) * 2021-11-26 2021-12-24 成都数之联科技有限公司 Model obtaining method, system and device, medium and product defect detection method
CN114463646B (en) * 2022-04-13 2022-07-05 齐鲁工业大学 Remote sensing scene classification method based on multi-head self-attention convolution neural network
CN115270405B (en) * 2022-06-22 2024-01-16 中国气象局广州热带海洋气象研究所(广东省气象科学研究所) Convection scale set forecasting method and system based on multisource and multisype disturbance combination
CN115100509B (en) * 2022-07-15 2022-11-29 山东建筑大学 Image identification method and system based on multi-branch block-level attention enhancement network
CN115115939B (en) * 2022-07-28 2023-04-07 北京卫星信息工程研究所 Remote sensing image target fine-grained identification method based on characteristic attention mechanism

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292339A (en) * 2017-06-16 2017-10-24 重庆大学 The unmanned plane low altitude remote sensing image high score Geomorphological Classification method of feature based fusion
CN110414377A (en) * 2019-07-09 2019-11-05 武汉科技大学 A kind of remote sensing images scene classification method based on scale attention network
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN112766083A (en) * 2020-12-30 2021-05-07 中南民族大学 Remote sensing scene classification method and system based on multi-scale feature fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292339A (en) * 2017-06-16 2017-10-24 重庆大学 The unmanned plane low altitude remote sensing image high score Geomorphological Classification method of feature based fusion
CN110414377A (en) * 2019-07-09 2019-11-05 武汉科技大学 A kind of remote sensing images scene classification method based on scale attention network
CN110555446A (en) * 2019-08-19 2019-12-10 北京工业大学 Remote sensing image scene classification method based on multi-scale depth feature fusion and transfer learning
CN112766083A (en) * 2020-12-30 2021-05-07 中南民族大学 Remote sensing scene classification method and system based on multi-scale feature fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition";Jianlong Fu,et.al;《2017 IEEE Conference on Computer Vision and Pattern Recognition》;20171209;第4476-4484页 *
"结合金字塔和局部二值模式的遥感图像分类";吴庆岗等;《现代电子技术》;20190701;第42卷(第13期);第56-60、64页 *
联合多尺度多特征的高分遥感图像场景分类;黄鸿等;《电子学报》;20200915(第09期);全文 *

Also Published As

Publication number Publication date
CN113408594A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN113408594B (en) Remote sensing scene classification method based on attention network scale feature fusion
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109829398B (en) Target detection method in video based on three-dimensional convolution network
CN109741331B (en) Image foreground object segmentation method
CN106909902B (en) Remote sensing target detection method based on improved hierarchical significant model
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN108596108B (en) Aerial remote sensing image change detection method based on triple semantic relation learning
Liu et al. Bipartite differential neural network for unsupervised image change detection
CN111415316A (en) Defect data synthesis algorithm based on generation of countermeasure network
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
CN109543632A (en) A kind of deep layer network pedestrian detection method based on the guidance of shallow-layer Fusion Features
CN108960404B (en) Image-based crowd counting method and device
CN108629368B (en) Multi-modal foundation cloud classification method based on joint depth fusion
CN111027497B (en) Weak and small target rapid detection method based on high-resolution optical remote sensing image
CN114187450A (en) Remote sensing image semantic segmentation method based on deep learning
CN112733614B (en) Pest image detection method with similar size enhanced identification
CN113159043B (en) Feature point matching method and system based on semantic information
CN111539422B (en) Flight target cooperative identification method based on fast RCNN
CN111814771A (en) Image processing method and device
CN113095371B (en) Feature point matching method and system for three-dimensional reconstruction
Fan et al. Registration of multiresolution remote sensing images based on L2-siamese model
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
CN107766810B (en) Cloud and shadow detection method
CN115995039A (en) Enhanced semantic graph embedding for omni-directional location identification
CN116052016A (en) Fine segmentation detection method for remote sensing image cloud and cloud shadow based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant