CN115601584A

CN115601584A - Remote sensing scene image multi-label classification method and device and storage medium

Info

Publication number: CN115601584A
Application number: CN202211113132.6A
Authority: CN
Inventors: 刘宏哲; 吴宏俊; 刘力铭; 徐成; 代松银; 潘卫国; 徐冰心
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2023-01-13

Abstract

The invention discloses a remote sensing scene image multi-label classification method and device and a storage medium, wherein the method comprises the following steps: extracting the image characteristics of the remote sensing scene; converting the remote sensing scene image characteristics into label embedding corresponding to each category label; obtaining a first inter-class relation matrix according to the correlation between the label embedding; constructing a mask according to the first inter-class relationship matrix to obtain a second inter-class relationship matrix; updating the label embedding according to the second inter-class relation matrix to obtain the prediction score of each class label; and determining the label of the remote sensing scene image according to the prediction score of each class label. By adopting the technical scheme of the invention, the problem that the deviation caused by the class which does not exist in the image is not eliminated when the relation between the classes is modeled in the prior art is solved.

Description

Remote sensing scene image multi-label classification method and device and storage medium

Technical Field

The invention belongs to the technical field of remote sensing image processing, and particularly relates to a multi-label classification method and device for remote sensing scene images based on a mask attention mechanism, and a storage medium.

Background

In recent years, with the continuous development of remote sensing technology, airborne and spaceborne remote sensing images have been widely used for land cover mapping and monitoring. Generally, as the land cover depicted by the high-resolution remote sensing image is various in types, the content in the image cannot be accurately described only by using a single label. The multi-label remote sensing image classification method can distribute a plurality of land cover labels for each remote sensing image, thereby accurately expressing the remote sensing image and being more in line with the actual requirement for understanding the remote sensing image.

Recently, visual feature extractors based on Deep learning have made great progress in the field of image recognition, such as ResNet (depth residual error Network) in DCNN (Deep Convolutional Neural Network) and Swin Transformer (high Vision Transformer using Shifted window layered Visual Transformer) in Visual Transformer. The feature extractors can extract high-level semantic features which are easier to distinguish, and are greatly helpful for single-label image classification. However, multi-label classification of remote sensing images is a more complex task than single label classification of remote sensing images. In one aspect, there are a plurality of surface coverings of different spatial resolution in one remotely sensed image. For example, "cars" are much smaller in size than "courses," and thus "cars" are one of the unobtrusive categories. On the other hand, since land cover objects generally coexist in the remote sensing image, the inter-class relationship is another key of classification. Therefore, the multi-label classification task of the remote sensing image not only considers the accurate spatial feature extraction, but also considers the correlation among a plurality of classes.

In typical multi-label image classification, utilization of spatial information and inter-class relationships are important issues. Methods for processing spatial information mainly include introduction of region suggestions, implicit spatial attention, or multi-scale features. Introducing regional recommendations requires additional bounding box labeling, which can be labor intensive. The use of implicit spatial attention enables automatic localization of the location of various class objects in an image through classification-loss supervision without the need for manually labeled bounding-box supervision. The use of multi-scale features can increase the recognition capability of the model for objects of different scales to some extent, but can increase the amount of computation.

On the other hand, modeling of relationships between classes is also widely studied. Early methods used RNN (Recurrent Neural Network) or LSTM (Long Short-Term Memory) to predict multiple tags in an image in a sequential manner and learned the sequential relevance of the tags. However, the performance of RNN or LSTM based methods is affected by the order that is preset or learned. Other studies describe the multi-label image classification task as a probabilistic graphical model-based structural inference problem, but its utility is limited due to high computational complexity. Inspired by GCN (Graph Convolutional Neural Network) in terms of multivariate relational representation, some researchers use GCN to explicitly model tag relevance. The performance of the convolutional neural network is limited by the receptive field of convolution, and the long-range relational modeling effect is poor. A transducer based on attention mechanism learns the relationship between each pair of elements in a long sequence using a self-attention mechanism, which is more advantageous than a convolutional neural network in terms of long-range relationship modeling. Currently, transformers have been widely used in the fields of natural language processing and computer vision.

Two broad categories of problems in multi-label classification are addressed: more accurate spatial information and inter-class relation modeling are needed, and the existing multi-label classification method of the remote sensing image is mainly divided into two types: a method of processing a space and a method of processing a relationship between classes, but a method of comprehensively considering both problems is lacking. Meanwhile, the existing inter-class relationship modeling method generally directly models the overall label dependency relationship among all classes. However, only some class objects exist in a single image, and most of the visual features extracted from the image are related to the real tags, but the features related to the nonexistent class are lacked. By this, inter-class relationships computed between non-existent classes are inaccurate. These inaccurate inter-tag dependencies can be noisy for the classification task.

Disclosure of Invention

The invention aims to solve the technical problem of providing a multi-label classification method and device for remote sensing scene images based on a mask attention mechanism and a storage medium, so as to solve the problem that the prior art does not eliminate the deviation caused by the categories which do not exist in the images when modeling the relationship between the categories.

In order to realize the purpose, the invention adopts the following technical scheme:

a remote sensing scene image multi-label classification method comprises the following steps:

s1, extracting the characteristics of a remote sensing scene image;

s2, converting the remote sensing scene image characteristics into label embedding corresponding to each class label;

s3, obtaining a first inter-class relation matrix according to the correlation between the label embedding;

s4, constructing a mask according to the first inter-class relation matrix to obtain a second inter-class relation matrix;

s5, updating the label embedding according to the second inter-class relation matrix to obtain the prediction score of each class label;

and S6, determining the label of the remote sensing scene image according to the prediction score of each class label.

Preferably, step S2 includes:

converting the remote sensing scene image characteristics into category specific activation;

and obtaining label embedding corresponding to each category label according to the remote sensing scene image characteristics and the category specific activation.

Preferably, step S3 is specifically: learning the correlation between the label embedding through a multi-head dot product self-attention mechanism to obtain a first inter-class relation matrix; first the tag embedding E is divided into h subsequences [ E ₁ ,e ₂ ,…，e _h ]，

i =1,2, \ 8230;, h; then for each subsequence e _i Learning three weight matrices

Sub-sequence e using the following formula _i Conversion to vector Q _i ，K _i ，V _i ：

Calculating the vector Q _i And K _i And mapping the (0, 1) interval to obtain the first inter-class relationship matrix.

Preferably, step S4 is specifically: converting the category-specific activation into a category prediction score 1 of the remote sensing scene image by using a global maximum pooling function; according to the category prediction score 1, selecting the index of the top k bits with the highest numerical value, and adding the index into a set I;

the mask was constructed using the following formula:

inaccurate inter-class relationships are filtered using the following formula:

obtaining a second inter-class relationship matrix [ A ] ₁ ,A ₂ ,…，A _h ]。

Preferably, step S5 is specifically:

updating the tag embedding E using the following formula:

E＝E+A(E，E，E)，

E＝σ(EP ₁ +b ₁ )P ₂ +b ₂ +E，

where σ (-) refers to the nonlinear activation function, P ₁ 、P ₂ 、b ₁ 、b ₂ Learning parameters;

obtaining a category prediction score 2 according to the updated label embedding;

and selecting the mean value of the category prediction score 1 and the category prediction score 2 to obtain the final prediction score of each category label.

Preferably, step S6 determines the label of the remote sensing scene image using a method,

wherein, Y _i Represents the ith element in the final class prediction score Y.

The invention also provides a multi-label classification device for remote sensing scene images, which comprises:

the extraction module is used for extracting the remote sensing scene image characteristics;

the conversion module is used for converting the remote sensing scene image characteristics into label embedding corresponding to each category label;

the first calculation module is used for obtaining a first inter-class relation matrix according to the correlation between the label embedding;

the second calculation module is used for constructing a mask according to the first inter-class relation matrix to obtain a second inter-class relation matrix;

the third calculation module is used for updating the label embedding according to the second inter-class relation matrix to obtain the prediction score of each class label;

and the classification module is used for determining the label of the remote sensing scene image according to the prediction score of each class label.

The invention also provides a storage medium storing machine executable instructions which, when invoked and executed by a processor, cause the processor to implement the remote sensing scene image multi-label classification method.

The invention comprehensively considers the spatial information and the inter-class relation modeling, uses implicit spatial attention to position the position of the ground coverage of each class on the image, and simultaneously uses a mask attention mechanism to replace a standard self-attention mechanism in a Transformer, and the used mask attention mechanism can filter the inaccurate inter-class dependency relation, thereby improving the multi-label classification performance.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a remote sensing scene image multi-label classification method of the present invention;

FIG. 2 is a schematic diagram illustrating the principle of the remote sensing scene image multi-label classification method of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example 1:

as shown in fig. 1 and 2, the invention provides a remote sensing scene image multi-label classification method based on a mask attention mechanism, which comprises the following steps:

s1, extracting the remote sensing scene image features by using a feature extractor;

As an implementation manner of the embodiment of the present invention, step S1 obtains the characteristics of the remote sensing scene image by the following method, including the following steps:

s11, preprocessing the remote sensing scene images, including horizontally turning the images, cutting and adjusting the images to be uniform in size, wherein all the remote sensing scene images are uniformly adjusted to 224 × 224 pixels in the embodiment;

s12, extracting the characteristics of the remote sensing scene image by using a deep convolutional neural network, wherein the final pooling layer is removed by all the deep convolutional neural networks; preferably, in order to take account of the accuracy and real-time performance of the network, the deep residual error network ResNet50 is adopted as a main network for extracting image characteristics to obtain the image characteristics of the remote sensing scene

Wherein H, W and D are three dimensions of the remote sensing scene image characteristicsRespectively, the length, width and number of channels of the feature. The dimension of the remote sensing scene image feature X obtained in the embodiment is 2048 × 7 × 7.

As an implementation manner of the embodiment of the present invention, the step S2 of converting the remote sensing scene image features into tag embedding corresponding to each category tag includes:

converting the remote sensing scene image characteristics into category specific activation; and obtaining label embedding corresponding to each class label according to the remote sensing scene image characteristics and the class specific activation.

Further, converting the remote sensing scene image features into category specific activation by the following method, specifically:

characterizing the remotely sensed scene image using the following formula

Transition to category-specific activation

Wherein, the first and the second end of the pipe are connected with each other,

the method is a two-dimensional convolution function with a convolution kernel of 1 multiplied by 1, H, W and D are three dimensions of the remote sensing scene image characteristics, namely the length, the width and the channel number of the characteristics, and C is the total number of label categories of a data set. The class-specific activation G represents that each position in the remote sensing scene image feature X belongs to a class { c } ₁ ，c ₂ ，…，c _C -the position of each category of ground cover in the image can be located.

Obtaining label embedding corresponding to each category label according to the remote sensing scene image characteristics and the category specific activation by the following method, specifically:

according to the remote sensing scene image characteristics X and the category specific activation G, the following formula is used to obtain the label embedding E:

wherein the function

Is a matrix dimension change function, and can be used for characterizing the remote sensing scene image

And category specific activation

Change in dimension of

And

namely, two dimensions of H and W are combined;

the method is a one-dimensional convolution function with a convolution kernel of 1, and is used for changing the dimension of the remote sensing scene image characteristic.

As another implementation manner of the embodiment of the present invention, the step S2 obtains the tag embedding by using the following method, including the following steps:

and S21, converting the remote sensing scene image characteristics X into class-specific activation G by using a convolution layer and a sigmoid activation layer. Specifically, in this embodiment, the convolutional layer is a two-dimensional convolutional layer, the size of the convolutional core is 1 × 1, the step size is 1, the input dimension is 2048, and the output dimension is the number of ground cover categories labeled in the remote sensing scene image data set, that is, the number of label categories C. The dimension of the class-specific activation G obtained in this example is C × 7 × 7;

and S22, changing the dimension of the remote sensing scene image characteristic X by using a convolution layer. Specifically, in this embodiment, the convolution layer is a two-dimensional convolution layer, the size of the convolution kernel is 1 × 1, the step size is 1, the input dimension is 2048, and the output dimension is 1024. The dimensionality of the remote sensing scene image feature X is 1024 × 7 × 7.

S23, combining the dimensionalities of the category-specific activation G and the remote sensing scene image feature X, in the embodiment, after combination,

and

and S24, multiplying the category-specific activation G and the remote sensing scene image feature X by using a multiplier to obtain a label embedding E. In the present embodiment, it is preferred that,

as another implementation manner of the embodiment of the present invention, in step S3, a first inter-class relationship matrix is obtained according to the correlation between the embedded tags, and specifically:

and learning the correlation between the label embedding through a multi-head dot product self-attention mechanism to obtain a first inter-class relation matrix. The multi-head dot product self-attention mechanism consists of h dot product self-attention heads and can learn the relevance of different aspects of characteristics. First E is divided into h subsequences [ E ₁ ,e ₂ ,…，e _h ]，

i =1,2, \8230;, h. Then for each subsequence e _i Learning three weight matrices

Calculating the vector Q _i And K _i And mapping the dot product of (1) to the interval of (0) to obtain the first inter-class relationship matrix.

As another implementation manner of the embodiment of the present invention, in step S3, the following method is used to obtain the first inter-class relationship matrix, and includes the following steps:

s31, dividing the label embedding E into h subsequences [ E ₁ ,e ₂ ,…，e _h ]I =1,2, \ 8230;, h, in this example, h =4;

s32, converting each subsequence into three vectors Q by using three full-connection layers respectively _i ，K _i ，V _i In this embodiment, the input dimension and the output dimension of each fully connected layer are both 1024;

s33, calculating a vector Q _i And V _i And normalizing by using a softmax function to obtain a first inter-class relationship matrix [ A' ₁ ,A’ ₂ ,...,A’ _h ]；

As an implementation manner of the embodiment of the present invention, in step S4, a mask is constructed by using the following method, and the first inter-class relationship matrix with low confidence is filtered out to obtain the second inter-class relationship matrix, which specifically is:

converting the category-specific activation G into a category prediction score 1 of the remote sensing scene image by using a global maximum pooling function maxporoling 2D; and selecting the index of the front k bits with the highest numerical value according to the class prediction score 1, and adding the index into the set I.

The mask was constructed using the following formula:

wherein x and y are rows and columns of the first relational matrix, respectively.

Inaccurate inter-class relationships are filtered using the following formula:

As another implementation manner of the embodiment of the present invention, in step S4, the following method is used to obtain the second inter-class relationship matrix, and the method includes the following steps:

s41, according to the class prediction score 1, selecting the index of the top k bit with the highest numerical value by using a topK function, and adding the index into the set I. In the present embodiment, k =20;

s42, constructing a mask by using the following formula:

s43, using an adder to mask the mask

Respectively adding the first and second inter-class relationship matrixes to obtain a second inter-class relationship matrix [ A ] ₁ ,A ₂ ,…，A _h ]。

As an implementation manner of the embodiment of the present invention, step S5 updates the label embedding by the following method to obtain the prediction score of each category label, specifically:

the tag embedding E is updated using the following formula:

E＝E+A(E，E，E)，

E＝σ(EP ₁ +b ₁ )P ₂ +b ₂ +E，

where σ (-) refers to the nonlinear activation function, P ₁ 、P ₂ 、b ₁ 、b ₂ To learn parameters.

and taking the average value of the category prediction score 1 and the category prediction score 2 to obtain the final prediction score of each category label.

As another implementation manner of the embodiment of the present invention, in step S5, the following method is used to obtain the prediction score of each category label, and the method includes the following steps:

s51, using a multiplier to convert the relationship matrix [ A ] between the second classes ₁ ,A ₂ ,…，A _h ]Each element A in _i Respectively with said vector V _i Multiplying to obtain new label embedding E ₁ ；

S52, embedding the label into E by using two full connection layers and one GELU activation layer ₁ Update to tag embedding E ₂ . In this embodiment, the input dimension of the fully-connected layer 1 is 1024, the output dimension is 2048, the input dimension of the fully-connected layer 2 is 2048, and the output dimension is 1024;

s53, embedding the label into E ₂ A full connectivity layer and a sigmoid activation layer are used to translate to a class prediction score of 2. In this embodiment, the input dimension of the fully-connected layer is 1024, and the output dimension is the tag category number C;

and S54, adding the category prediction score 1 and the category prediction score 2 in a bitwise manner to obtain a final category prediction score Y.

As an implementation manner of the embodiment of the present invention, step S6 determines the label of the remote sensing scene image by using the following method:

wherein Y is _i Represents the ith element in the final class prediction score Y.

By adopting the technical scheme, namely the multi-label image classification network based on the mask attention mechanism, the corresponding position of each ground cover in the image can be automatically positioned, the self-attention mechanism is used for distributing the weight to each group of label pairs, and the inaccurate part with low confidence coefficient in the mask filtering inter-class relation is constructed, so that the more accurate inter-class relation can be obtained. And the spatial information and the inter-class relation are combined, so that the classification label of the remote sensing scene image is more accurate.

The embodiment of the invention uses implicit spatial attention to position the position of each type of ground covering on the image, automatically extracts the relevant regional characteristics of the ground covering of the types, and fully utilizes the spatial information in the remote sensing image.

The embodiment of the invention uses a mask attention mechanism to replace a standard self-attention mechanism and automatically learns the relationship between classes. The masking attention mechanism can filter out the partially inaccurate inter-class dependency relationship, so that the accuracy of multi-label classification is improved.

To verify the validity of the present application, the following comparative experiments were performed:

data set

The UC-Merced multi-label land use data set is a multi-label remote sensing image data set. The data set had a total of 2100 remote sensing images, type 17 land cover. Each image had 1-7 footprints of 256 pixels by 256. We randomly chosen 80% of the images for training the model and the remaining images for validation and testing.

Evaluation index

Five common multi-label remote sensing image classification evaluation indexes are selected: overall accuracy (OP), overall Recall (OR), accuracy per Class (CP), recall per Class (CR) and F1 score per class (CF 1) were used as evaluation indices. For all indices, a larger value indicates a better classification effect.

The comparison method comprises the following steps:

the prior method comprises the following steps: resNet50, a method proposed by Homingming et al in the document "Deep residual learning for image recognition";

the prior method II comprises the following steps: CA-ResNet-BilSTM, a method proposed by Hua et al in the document "Current applied class-wide identification in a hybrid volume and bidirectional LSTM network for multi-layer image class";

the existing method three: AL-RN-ResNet50, a method proposed by Hua et AL in the document "relationship network for Multi laboratory image classification";

the result of performing a multi-label image classification task on a UC-Mercded multi-label land using a data set by using the method of the present application and the existing method is as follows:

the above experiment illustrates that: the method is superior to the existing method in most evaluation indexes, and the effectiveness of the method is reflected.

Example 2:

Example 3:

the invention also provides a storage medium, which stores machine executable instructions, and when the machine executable instructions are called and executed by a processor, the machine executable instructions cause the processor to realize the remote sensing scene image multi-label classification method.

The above-described embodiments are only intended to describe the preferred embodiments of the present invention, and not to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A remote sensing scene image multi-label classification method is characterized by comprising the following steps:

s1, extracting the characteristics of a remote sensing scene image;

2. The remote sensing scene image multi-label classification method according to claim 1, wherein the step S2 comprises:

3. The remote sensing scene image multi-label classification method according to claim 2, characterized in that step S3 specifically comprises: learning the correlation between the label embedding through a multi-head dot product self-attention mechanism to obtain a first inter-class relation matrix; first the tag embedding E is divided into h subsequences

Then for each subsequence e _i Learning three weight matrices

4. The remote sensing scene image multi-label classification method according to claim 3, wherein the step S4 specifically comprises: converting the category-specific activation into a category prediction score 1 of the remote sensing scene image by using a global maximum pooling function; according to the category prediction score 1, selecting the index of the front k bits with the highest numerical value, and adding the index into a set I;

the mask was constructed using the following formula:

inaccurate inter-class relationships are filtered using the following formula:

obtaining a second inter-class relationship matrix [ A ] ₁ ，A ₂ ，…，A _h ]。

5. The remote sensing scene image multi-label classification method according to claim 4, wherein the step S5 specifically comprises:

the tag embedding E is updated using the following formula:

E＝E+A(E，E，E)，

E＝σ(EP1+b1)P2+b1+E，

6. The remote sensing scene image multi-label classification method according to claim 5, characterized in that step S6 determines the label of the remote sensing scene image using the following method,

7. A multi-label classification device for remote sensing scene images is characterized by comprising:

8. A storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of multi-label classification of images of remote sensing scenes according to any one of claims 1 to 6.