CN112949732A

CN112949732A - Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion

Info

Publication number: CN112949732A
Application number: CN202110270709.3A
Authority: CN
Inventors: 刘瑜; 谭大宁; 丁自然; 姚力波; 徐从安; 孙顺; 姜乔文
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-11
Anticipated expiration: 2041-03-12
Also published as: CN112949732B

Abstract

The invention relates to a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion, which comprises the steps of firstly, sequentially carrying out feature extraction and splicing processing on remote sensing images of multiple modes; secondly, performing random inactivation treatment on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set; carrying out global semantic annotation based on each pixel point in the first convolution feature map set again; then, performing channel semantic annotation on the channel dimension based on the second convolution feature map set; and finally, performing weighted fusion and convolution processing on the labeled position output characteristic graph set and the labeled dimension output characteristic graph set to obtain a fused labeled image. The method randomly inactivates the input multi-modal channels to simulate the condition of partial modal loss under the actual condition, and improves the generalization capability and robustness of the model. In addition, the invention combines the channel semantic annotation with the global semantic annotation, thereby improving the accuracy of the fusion annotation of the context information of the image.

Description

Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion

Technical Field

The invention relates to the technical field of semantic annotation, in particular to a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion.

Background

At present, high-resolution multi-modal remote sensing data semantic annotation is divided into early fusion (such as FuseNet) and later fusion (such as SegNet-RC) according to different multi-modal fusion opportunities.

In the early fusion, a plurality of encoders are used for carrying out joint encoding on multi-source remote sensing information, and output results of the encoders are added after each volume block. A decoder then resamples the encoded joint representation back to the label probability space. However, in this architecture, the finger data is treated as second-hand data, i.e., the several fingers are not completely symmetric. Furthermore, only the index of the main branch is used in the upsampling process. Thus, there is a conceptual imbalance in the way that multiple sources are handled, requiring selection of which source is the primary source and which source is the secondary data.

The later stage fusion is similar to the early stage fusion, except that a plurality of encoders are used for respectively encoding the multi-source remote sensing information, then the multi-source remote sensing information is respectively decoded, and fusion of different branches is carried out after decoding. The method improves the precision of semantic annotation, but requires that the multi-source remote sensing image belongs to a non-heterogeneous image, and the application of the model has certain limitation, especially when the source data is photoelectric and radar images.

The current semantic annotation method for multi-source remote sensing image fusion mainly has the following defects and shortcomings:

1) since the remote sensing images from multiple sources often have heterogeneity, redundancy and complementarity, the existing model is not designed according to different modes, weighted average is adopted in a fusion strategy, and an algorithm is difficult to achieve the optimum.

2) The multi-source remote sensing image comprises a photoelectric (such as hyperspectral, multispectral, panchromatic, infrared and the like) image and an SAR image, and the images of different sources are difficult to ensure to be simultaneously acquired all day long and all weather (such as the photoelectric image acquisition difficulty in cloud and rain weather). The existing method does not consider the problem that some modal information is temporarily lost, and the final labeling effect is not good under the condition that the modal information is not complete, so that the robustness of the model is reduced.

3) The existing early fusion and the existing late fusion are based on the traditional method of a full convolutional neural network (FCN), and the local features generated by the inherent characteristics of the convolutional neural network can cause misclassification and ignore the context relation of the local features.

Disclosure of Invention

The invention aims to provide a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion so as to improve the accuracy of image context information fusion annotation.

In order to achieve the aim, the invention provides a semantic annotation method based on self-adaptive multi-mode remote sensing image fusion, which comprises the following steps:

step S1: acquiring remote sensing images of a plurality of modalities;

step S2: respectively carrying out feature extraction processing on the remote sensing images of a plurality of modes to obtain output feature maps of the plurality of modes;

step S3: splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;

step S4: based on the splicing characteristic diagram set, carrying out random inactivation treatment on the channel to obtain a random inactivation characteristic diagram set;

step S5: respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set and a second convolution characteristic image set;

step S6: carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;

step S7: performing channel semantic annotation on channel dimensions based on second convolution feature map set to obtain channel output feature map set

Step S8: carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;

step S9: and performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.

Optionally, the step S4 specifically includes:

step S41: grading each channel based on pixel values in the spliced feature map set to obtain a grading score corresponding to each channel;

step S42: calculating a probability value corresponding to each channel according to the rating score corresponding to each channel;

step S43: according to M-Nx wrs_ratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrs_ratioIs a constant;

step S44: and selecting M channels corresponding to the maximum probability value.

Optionally, the step S6 specifically includes:

step S61: performing convolution processing on the first convolution feature map set respectively to obtain a third convolution feature map set and a fourth convolution feature map set;

step S62: reshaping and transposing the third convolution feature map set to obtain a first transposing feature map set;

step S63: reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;

step S64: multiplying the first transfer feature map set and the first shaping feature map set, and obtaining a spatial attention map set through a softmax layer;

step S65: performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;

step S66: reshaping the fifth convolution characteristic graph set to obtain a second reshaping characteristic graph set;

step S67: multiplying the second shaping feature map set and the spatial attention map set and reshaping to obtain a third shaping feature map set;

step S68: and carrying out pixel-level addition processing on the first convolution feature map set and the third shaping feature map set to obtain a position output feature map set.

Optionally, the step S7 specifically includes:

step S71: reshaping and transposing the second convolution feature map set to obtain a second transposed feature map set;

step S72: reshaping the second convolution characteristic graph set to obtain a fourth reshaping characteristic graph set;

step S73: multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;

step S74: multiplying the second convolution characteristic diagram set and the channel attention diagram set and reshaping to obtain a fifth reshaping characteristic diagram set;

step S75: and performing dimension addition processing on the second convolution feature map set and the fifth shaping feature map set to obtain a dimension output feature map set.

The invention also provides a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion, which comprises the following components:

the system comprises a plurality of characteristic extraction processing modules, a plurality of image processing modules and a plurality of image processing modules, wherein the plurality of characteristic extraction processing modules are used for respectively carrying out characteristic extraction processing on the obtained remote sensing images in a plurality of modes to obtain output characteristic graphs of the plurality of modes;

the splicing module is used for splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;

the random inactivation processing module is used for carrying out random inactivation processing on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set;

the first convolution layer is used for performing convolution processing on the random inactivation characteristic map sets respectively to obtain first convolution characteristic map sets respectively;

the second convolution layer is used for performing convolution processing on the random inactivation characteristic image set to obtain a second convolution characteristic image set;

the position attention module is used for carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;

a channel attention module for performing channel semantic annotation on channel dimensions based on the second convolution feature map set to obtain a channel output feature map set

The weighted fusion module is used for carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;

and the third convolution layer is used for performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.

Optionally, the random inactivation processing module specifically includes:

the ranking score determining unit is used for ranking each channel based on the pixel values in the splicing feature map set to obtain a ranking score corresponding to each channel;

the probability value determining unit is used for calculating the probability value corresponding to each channel according to the rating score corresponding to each channel;

a channel reservation number determining unit for determining a channel reservation number according to M-Nx wrs_ratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrs_ratioIs a constant;

and the selecting unit is used for selecting the M channels corresponding to the maximum probability value.

Optionally, the position attention module specifically includes:

the first convolution processing unit is used for respectively carrying out convolution processing on the first convolution feature map set to obtain a third convolution feature map set and a fourth convolution feature map set;

the first transposition processing unit is used for reshaping and transposing the third convolution feature map set to obtain a first transposition feature map set;

the first reshaping processing unit is used for reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;

the first multiplication processing unit is used for multiplying the first transfer characteristic map set and the first shaping characteristic map set, and obtaining a spatial attention map set through a softmax layer;

the second convolution processing unit is used for performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;

the second reshaping processing unit is used for reshaping the fifth convolution feature map set to obtain a second reshaping feature map set;

the third reshaping processing unit is used for multiplying the second reshaping characteristic diagram set and the space attention diagram set and reshaping the spatial attention diagram set to obtain a third reshaping characteristic diagram set;

and the first addition processing unit is used for carrying out pixel-level addition processing on the first convolution characteristic diagram set and the third shaping characteristic diagram set to obtain a position output characteristic diagram set.

Optionally, the channel attention module specifically includes:

the second transposition processing unit is used for reshaping and transposing the second convolution characteristic graph set to obtain a second transposition characteristic graph set;

the fourth reshaping processing unit is used for reshaping the second convolution feature map set to obtain a fourth reshaping feature map set;

the second multiplication processing unit is used for multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;

a fifth reshaping processing unit, configured to multiply the second convolution feature map set and the channel attention map set and perform reshaping processing to obtain a fifth reshaping feature map set;

and the second addition processing unit is used for carrying out dimension addition processing on the second convolution characteristic diagram set and the fifth shaping characteristic diagram set to obtain a dimension output characteristic diagram set.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a semantic annotation method based on self-adaptive multi-mode remote sensing image fusion of the invention;

FIG. 2 is a flow chart of global semantic annotation according to the present invention;

FIG. 3 is a flow chart of channel semantic annotation according to the present invention;

FIG. 4 is a structural diagram of a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the invention provides a semantic annotation method based on self-adaptive multi-modal remote sensing image fusion, which comprises the following steps:

step S1: remote sensing images of multiple modalities are acquired.

Step S2: and respectively carrying out feature extraction processing on the remote sensing images in the plurality of modes to obtain output feature maps in the plurality of modes.

Step S3: and splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set. The spliced feature map set comprises a plurality of spliced feature maps.

Step S4: and carrying out random inactivation treatment on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set.

Step S5: and respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set and a second convolution characteristic image set.

Step S6: and carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set.

Step S7: and performing channel semantic annotation on the channel dimension based on the second convolution feature graph set to obtain a channel output feature graph set.

Step S8: and performing weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image.

The individual steps are discussed in detail below:

step S1: acquiring remote sensing images of a plurality of modalities; the remotely sensed images of the plurality of modalities include: the remote sensing image comprises a panchromatic remote sensing image MS, a multispectral remote sensing image PAN, a panchromatic and multispectral remote sensing image PS-MS, a panchromatic and infrared remote sensing image PS-RGB and an SAR remote sensing image.

Step S4: based on the splicing feature map set, performing random inactivation treatment on the channel to obtain a random inactivation feature map set X', which specifically comprises the following steps:

step S41: ranking each channel based on pixel values in the spliced feature map set to obtain a ranking score corresponding to each channel, wherein the specific formula is as follows:

wherein, score_iIs the rating score of the ith channel, W and H are the maximum width and height of the ith channel, respectively, x_iAnd (j, k) is the pixel value corresponding to the ith channel with the width j and the height k.

Step S42: calculating a probability value corresponding to each channel according to the rating score corresponding to each channel, wherein the specific formula is as follows:

wherein r is_iTo generate random numbers between (0,1) by a random number generator, score_iIs the rating score, key, of the ith channel_iAnd the probability value corresponding to the ith channel is obtained.

Step S43: according to M-Nx wrs_ratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrs_ratioIs a constant.

Step S44: the M channels corresponding to the maximum probability values are selected and the corresponding masks are set to 1.

Step S5: and respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set A and a second convolution characteristic image set A'.

As shown in fig. 2, step S6 specifically includes:

step S61: performing convolution processing on the first convolution feature map set A respectively to obtain a third convolution feature map set B and a fourth convolution feature map set C;

the dimension is represented, M is H × W, M represents the number of pixels, H represents the feature map height, W represents the feature map width, and N' represents the number of feature map channels of dropout output.

Step S62: reshaping and transposing the third convolution characteristic image set B to obtain a first transposing characteristic image set U;

the dimensions are represented.

Step S63: reshaping the fourth convolution characteristic diagram set C to obtain a first reshaping characteristic diagram set I;

the dimensions are represented.

Step S64: multiplying the first transfer feature map set U with the first shaping feature map set I, obtaining a space attention map set S through a softmax layer,

the dimensions are represented.

Step S65: performing convolution processing on the first convolution characteristic image set A to obtain a fifth convolution characteristic image set D,

the dimensions are represented.

Step S66: reshaping the fifth convolution characteristic map set D to obtain a second reshaping characteristic map set Q,

the dimensions are represented.

Step S67: combining the second shaping feature map set Q with spaceNote that the set S is multiplied and reshaped, a third set R of reshape features is obtained,

the dimensions are represented.

Step S68: carrying out pixel-level addition processing on the first convolution characteristic diagram set A and the third shaping characteristic diagram set R to obtain a position output characteristic diagram set E,

the dimensions are represented.

Calculating the image of the jth channel in the position output characteristic diagram set, wherein the specific formula is as follows:

wherein E is_jThe image of the jth channel in the position output characteristic diagram set is represented, alpha represents a parameter value obtained by training the network by semantic annotation data, M represents the number of pixels, and s represents_jiIndicates the degree/correlation of association between the ith and jth channels, D_iThe characteristic diagram matrix representing the j row and i column in the fifth convolution characteristic diagram set, A_jRepresenting the jth feature map matrix in the first volumetric feature map set.

B_iRepresents the i-th channel shaped 1 xM vector, C_jThe M × 1 vector after shaping of the j-th channel is shown, and M indicates the number of pixels.

And the final output position output feature graph set has a global context semantic perception visual field, and selectively aggregates context semantic information according to the spatial attention graph set S.

As shown in fig. 3, step S7 specifically includes:

step S71: reforming and transposing the second convolution characteristic diagram set A 'to obtain a second transposed characteristic diagram set B',

the dimensions are represented.

Step S72: reforming the second convolution characteristic diagram set A 'to obtain a fourth shaping characteristic diagram set C',

the dimension is expressed, and M × W represents the number of pixels.

Step S73: multiplying the second feature map set B 'and the fourth shaping feature map set C', obtaining a channel attention map set X through the softmax layer,

the dimensions are represented.

Step S74: multiplying and reshaping the second convolution characteristic diagram set A 'and the channel attention diagram set X to obtain a fifth reshaping characteristic diagram set D';

the dimensions are represented.

Step S75: performing dimension addition processing on the second convolution characteristic diagram set A ' and the fifth shaping characteristic diagram set D ' to obtain a dimension output characteristic diagram set E ',

the dimensions are represented.

Wherein, A'_iDenotes the ith channel feature map, A'_jDenotes the jth channel feature map, E'_jAnd the output corresponding to the j channel is shown, beta represents a parameter value obtained by training the network by semantic annotation data, and N' represents the number of channels after Dropout.

Wherein x is_jiDenotes an influence value, C ', of the ith channel on the jth channel'_iRepresents the 1 XM vector, C 'after channel i reshaping'_jThe M × 1 vector after the j-th channel reshaping is shown, i, j being 1,2, …, N'.

The final output dimension output feature graph set E' is weighted fusion of all channels, and is beneficial to improving the cross-channel feature identification degree.

According to the invention, because the position attention and the channel attention respectively acquire the interdependency between different positions and channel mapping, the characterization capability of the feature map on multi-mode semantics can be effectively enhanced.

As shown in fig. 4, the present invention further provides a semantic annotation system based on adaptive multi-modal remote sensing image fusion, wherein the system comprises:

and the plurality of feature extraction processing modules are used for respectively carrying out feature extraction processing on the obtained remote sensing images of the plurality of modalities to obtain output feature maps of the plurality of modalities.

And the splicing module is used for splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set.

And the random inactivation processing module is used for carrying out random inactivation processing on the channels based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set.

And the first convolution layer is used for performing convolution processing on the random inactivation characteristic map sets respectively to obtain first convolution characteristic map sets respectively.

And the second convolution layer is used for performing convolution processing on the random inactivation characteristic map set to obtain a second convolution characteristic map set.

And the position attention module is used for carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set.

And the channel attention module is used for carrying out channel semantic annotation on the channel dimension based on the second convolution feature map set to obtain a channel output feature map set.

And the weighted fusion module is used for carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image.

As an optional implementation manner, the random inactivation processing module of the present invention specifically includes:

and the rating score determining unit is used for rating each channel based on the pixel values in the splicing feature map set to obtain the rating score corresponding to each channel.

And the probability value determining unit is used for calculating the probability value corresponding to each channel according to the rating score corresponding to each channel.

A channel reservation number determining unit for determining a channel reservation number according to M-Nx wrs_ratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrs_ratioIs a constant.

In fig. 4, the random deactivation processing module is a Dropout module, and five channels are randomly deactivated through the Dropout module at a mode level to simulate a partial mode missing situation in an actual situation. For ease of explanation of the Dropout module, the following convention is made:

the output after splicing (i.e., the input of Dropout block) in fig. 4 is set to X ═ X₁,x₂,…,x_N]，

Representing the output of the Dropout module, N representing the total number of input channels of the Dropout module, N' representing the total number of channels of the modal profile, x_iCharacteristic diagram, x 'of ith channel representing Dropout module input'_iA characteristic diagram of the ith channel output by the Dropout block is shown. In most cases, X ═ X '(i.e., N ═ N', X_i＝x′_iI ═ 1,2, …, N). The method comprises the following specific steps:

a) channel rating

When the spliced feature map enters a Dropout module, channel rating is firstly carried out, namely, a score is allocated to each channel, the step is completed through GAP, and for each channel i, the corresponding score is as follows:

wherein, score_iIs the rating score of the ith channel, W and H are the maximum width and height of the ith channel, respectively, x_i(j, k) is the pixel value for the ith channel with j width and k height.

b) Channel selection

The invention relates to a method for constructing binary mask based on scores, belonging to weighted random selection method (WRS),

the algorithm flow is shown in the following table 1.

Generating random numbers r between (0,1) by a random number generator_iThereby according to

Get the key_iSelecting the reserved channel with the maximum M items and corresponding mask_iSet to 1.

c) Random selection

And selecting the channel on the basis of the channel rating, and constructing a 0-1 binary mask. Mask of i-th layer_iAt 0/1, the channel is selected/unselected accordingly. Before constructing binary mask, the probability p of reserving channel is calculated_iAnd assigned to each channel, which is calculated as follows:

due to P (mask)_i＝1)＝p_iDue to the factWhile the higher the channel rating, the more likely the channel is to be retained. Due to heterogeneous differences between different modalities, after passing through the preceding neural network, it may occur that some modality channels are assigned a larger score, and other scores are small, and if channels are selected only according to score, it is possible that for each image, the channels of the selected channel sequence remain the same at each forward pass. Then, by adding a random number generator, when the mask is in use_iSet to 1, corresponding x_iMay still be unselected.

As an optional implementation manner, the location attention module specifically includes:

and the first convolution processing unit is used for respectively carrying out convolution processing on the first convolution feature map set to obtain a third convolution feature map set and a fourth convolution feature map set.

And the first transposition processing unit is used for reshaping and transposing the third convolution feature map set to obtain a first transposition feature map set.

And the first reshaping processing unit is used for reshaping the fourth convolution feature map set to obtain a first reshaping feature map set.

And the first multiplication processing unit is used for multiplying the first transfer characteristic map set and the first shaping characteristic map set, and obtaining a spatial attention map set through the softmax layer.

And the second convolution processing unit is used for performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set.

And the second reshaping processing unit is used for reshaping the fifth convolution feature map set to obtain a second reshaping feature map set.

And the third reshaping processing unit is used for multiplying the second reshaping characteristic diagram set and the spatial attention diagram set and performing reshaping processing to obtain a third reshaping characteristic diagram set.

As an optional implementation manner, the channel attention module of the present invention specifically includes:

and the second transposition processing unit is used for reshaping and transposing the second convolution characteristic graph set to obtain a second transposition characteristic graph set.

And the fourth reshaping processing unit is used for reshaping the second convolution feature map set to obtain a fourth reshaping feature map set.

And the second multiplication processing unit is used for multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through the softmax layer.

And the fifth reshaping processing unit is used for multiplying the second convolution characteristic diagram set and the channel attention diagram set and reshaping the second convolution characteristic diagram set and the channel attention diagram set to obtain a fifth reshaping characteristic diagram set.

Compared with the prior art, the technical scheme disclosed by the invention has the following advantages:

1. the random inactivation processing module of the invention is provided with a model-level Dropout model missing simulation mechanism, and carries out random inactivation on the input multi-mode channels so as to simulate the situation of partial model missing under the actual condition, thereby improving the generalization capability and robustness of the model.

2. For a multi-mode heterogeneous fusion mechanism, the system is provided with a position attention module and a channel attention module, wherein the position attention module is used for capturing global semantic information (certain context relation among pixel points) to realize attention to image context information, the channel attention module is used for capturing cross-modal heterogeneous information (relation among different modalities), self-adaptive fusion benefits of remote sensing images of different modalities are improved through the channel attention module, feature weighting of a channel dimension and an image dimension is improved, and the global semantic information output by the two attention modules is better utilized. Meanwhile, in order to better utilize the global semantic information output by the two attention modules, the output of the two modules is subjected to weighted feature fusion after passing through a convolution layer, so that the fusion capability of semantic information labeling is improved.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A semantic annotation method based on self-adaptive multi-mode remote sensing image fusion is characterized by comprising the following steps:

step S1: acquiring remote sensing images of a plurality of modalities;

step S7: performing channel semantic annotation on the channel dimension based on the second convolution feature map set to obtain a channel output feature map set;

2. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S4 specifically comprises:

3. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S6 specifically comprises:

4. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S7 specifically comprises:

5. A semantic annotation system based on self-adaptive multi-mode remote sensing image fusion is characterized by comprising:

the channel attention module is used for carrying out channel semantic annotation on channel dimensions based on the second convolution feature map set to obtain a channel output feature map set;

6. The self-adaptive multi-modal remote sensing image fusion semantic annotation system based on claim 5, wherein the random inactivation processing module specifically comprises:

7. The system for semantic annotation based on adaptive multi-modal remote sensing image fusion according to claim 5, wherein the location attention module specifically comprises:

8. The self-adaptive multi-modal remote sensing image fusion semantic annotation system based on claim 5, wherein the channel attention module specifically comprises: