CN112949732A - Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion - Google Patents

Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion Download PDF

Info

Publication number
CN112949732A
CN112949732A CN202110270709.3A CN202110270709A CN112949732A CN 112949732 A CN112949732 A CN 112949732A CN 202110270709 A CN202110270709 A CN 202110270709A CN 112949732 A CN112949732 A CN 112949732A
Authority
CN
China
Prior art keywords
feature map
convolution
map set
channel
reshaping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110270709.3A
Other languages
Chinese (zh)
Other versions
CN112949732B (en
Inventor
刘瑜
谭大宁
丁自然
姚力波
徐从安
孙顺
姜乔文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval Aeronautical University
Original Assignee
Naval Aeronautical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval Aeronautical University filed Critical Naval Aeronautical University
Priority to CN202110270709.3A priority Critical patent/CN112949732B/en
Publication of CN112949732A publication Critical patent/CN112949732A/en
Application granted granted Critical
Publication of CN112949732B publication Critical patent/CN112949732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion, which comprises the steps of firstly, sequentially carrying out feature extraction and splicing processing on remote sensing images of multiple modes; secondly, performing random inactivation treatment on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set; carrying out global semantic annotation based on each pixel point in the first convolution feature map set again; then, performing channel semantic annotation on the channel dimension based on the second convolution feature map set; and finally, performing weighted fusion and convolution processing on the labeled position output characteristic graph set and the labeled dimension output characteristic graph set to obtain a fused labeled image. The method randomly inactivates the input multi-modal channels to simulate the condition of partial modal loss under the actual condition, and improves the generalization capability and robustness of the model. In addition, the invention combines the channel semantic annotation with the global semantic annotation, thereby improving the accuracy of the fusion annotation of the context information of the image.

Description

Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion
Technical Field
The invention relates to the technical field of semantic annotation, in particular to a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion.
Background
At present, high-resolution multi-modal remote sensing data semantic annotation is divided into early fusion (such as FuseNet) and later fusion (such as SegNet-RC) according to different multi-modal fusion opportunities.
In the early fusion, a plurality of encoders are used for carrying out joint encoding on multi-source remote sensing information, and output results of the encoders are added after each volume block. A decoder then resamples the encoded joint representation back to the label probability space. However, in this architecture, the finger data is treated as second-hand data, i.e., the several fingers are not completely symmetric. Furthermore, only the index of the main branch is used in the upsampling process. Thus, there is a conceptual imbalance in the way that multiple sources are handled, requiring selection of which source is the primary source and which source is the secondary data.
The later stage fusion is similar to the early stage fusion, except that a plurality of encoders are used for respectively encoding the multi-source remote sensing information, then the multi-source remote sensing information is respectively decoded, and fusion of different branches is carried out after decoding. The method improves the precision of semantic annotation, but requires that the multi-source remote sensing image belongs to a non-heterogeneous image, and the application of the model has certain limitation, especially when the source data is photoelectric and radar images.
The current semantic annotation method for multi-source remote sensing image fusion mainly has the following defects and shortcomings:
1) since the remote sensing images from multiple sources often have heterogeneity, redundancy and complementarity, the existing model is not designed according to different modes, weighted average is adopted in a fusion strategy, and an algorithm is difficult to achieve the optimum.
2) The multi-source remote sensing image comprises a photoelectric (such as hyperspectral, multispectral, panchromatic, infrared and the like) image and an SAR image, and the images of different sources are difficult to ensure to be simultaneously acquired all day long and all weather (such as the photoelectric image acquisition difficulty in cloud and rain weather). The existing method does not consider the problem that some modal information is temporarily lost, and the final labeling effect is not good under the condition that the modal information is not complete, so that the robustness of the model is reduced.
3) The existing early fusion and the existing late fusion are based on the traditional method of a full convolutional neural network (FCN), and the local features generated by the inherent characteristics of the convolutional neural network can cause misclassification and ignore the context relation of the local features.
Disclosure of Invention
The invention aims to provide a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion so as to improve the accuracy of image context information fusion annotation.
In order to achieve the aim, the invention provides a semantic annotation method based on self-adaptive multi-mode remote sensing image fusion, which comprises the following steps:
step S1: acquiring remote sensing images of a plurality of modalities;
step S2: respectively carrying out feature extraction processing on the remote sensing images of a plurality of modes to obtain output feature maps of the plurality of modes;
step S3: splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;
step S4: based on the splicing characteristic diagram set, carrying out random inactivation treatment on the channel to obtain a random inactivation characteristic diagram set;
step S5: respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set and a second convolution characteristic image set;
step S6: carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;
step S7: performing channel semantic annotation on channel dimensions based on second convolution feature map set to obtain channel output feature map set
Step S8: carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;
step S9: and performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
Optionally, the step S4 specifically includes:
step S41: grading each channel based on pixel values in the spliced feature map set to obtain a grading score corresponding to each channel;
step S42: calculating a probability value corresponding to each channel according to the rating score corresponding to each channel;
step S43: according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant;
step S44: and selecting M channels corresponding to the maximum probability value.
Optionally, the step S6 specifically includes:
step S61: performing convolution processing on the first convolution feature map set respectively to obtain a third convolution feature map set and a fourth convolution feature map set;
step S62: reshaping and transposing the third convolution feature map set to obtain a first transposing feature map set;
step S63: reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;
step S64: multiplying the first transfer feature map set and the first shaping feature map set, and obtaining a spatial attention map set through a softmax layer;
step S65: performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;
step S66: reshaping the fifth convolution characteristic graph set to obtain a second reshaping characteristic graph set;
step S67: multiplying the second shaping feature map set and the spatial attention map set and reshaping to obtain a third shaping feature map set;
step S68: and carrying out pixel-level addition processing on the first convolution feature map set and the third shaping feature map set to obtain a position output feature map set.
Optionally, the step S7 specifically includes:
step S71: reshaping and transposing the second convolution feature map set to obtain a second transposed feature map set;
step S72: reshaping the second convolution characteristic graph set to obtain a fourth reshaping characteristic graph set;
step S73: multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;
step S74: multiplying the second convolution characteristic diagram set and the channel attention diagram set and reshaping to obtain a fifth reshaping characteristic diagram set;
step S75: and performing dimension addition processing on the second convolution feature map set and the fifth shaping feature map set to obtain a dimension output feature map set.
The invention also provides a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion, which comprises the following components:
the system comprises a plurality of characteristic extraction processing modules, a plurality of image processing modules and a plurality of image processing modules, wherein the plurality of characteristic extraction processing modules are used for respectively carrying out characteristic extraction processing on the obtained remote sensing images in a plurality of modes to obtain output characteristic graphs of the plurality of modes;
the splicing module is used for splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;
the random inactivation processing module is used for carrying out random inactivation processing on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set;
the first convolution layer is used for performing convolution processing on the random inactivation characteristic map sets respectively to obtain first convolution characteristic map sets respectively;
the second convolution layer is used for performing convolution processing on the random inactivation characteristic image set to obtain a second convolution characteristic image set;
the position attention module is used for carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;
a channel attention module for performing channel semantic annotation on channel dimensions based on the second convolution feature map set to obtain a channel output feature map set
The weighted fusion module is used for carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;
and the third convolution layer is used for performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
Optionally, the random inactivation processing module specifically includes:
the ranking score determining unit is used for ranking each channel based on the pixel values in the splicing feature map set to obtain a ranking score corresponding to each channel;
the probability value determining unit is used for calculating the probability value corresponding to each channel according to the rating score corresponding to each channel;
a channel reservation number determining unit for determining a channel reservation number according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant;
and the selecting unit is used for selecting the M channels corresponding to the maximum probability value.
Optionally, the position attention module specifically includes:
the first convolution processing unit is used for respectively carrying out convolution processing on the first convolution feature map set to obtain a third convolution feature map set and a fourth convolution feature map set;
the first transposition processing unit is used for reshaping and transposing the third convolution feature map set to obtain a first transposition feature map set;
the first reshaping processing unit is used for reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;
the first multiplication processing unit is used for multiplying the first transfer characteristic map set and the first shaping characteristic map set, and obtaining a spatial attention map set through a softmax layer;
the second convolution processing unit is used for performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;
the second reshaping processing unit is used for reshaping the fifth convolution feature map set to obtain a second reshaping feature map set;
the third reshaping processing unit is used for multiplying the second reshaping characteristic diagram set and the space attention diagram set and reshaping the spatial attention diagram set to obtain a third reshaping characteristic diagram set;
and the first addition processing unit is used for carrying out pixel-level addition processing on the first convolution characteristic diagram set and the third shaping characteristic diagram set to obtain a position output characteristic diagram set.
Optionally, the channel attention module specifically includes:
the second transposition processing unit is used for reshaping and transposing the second convolution characteristic graph set to obtain a second transposition characteristic graph set;
the fourth reshaping processing unit is used for reshaping the second convolution feature map set to obtain a fourth reshaping feature map set;
the second multiplication processing unit is used for multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;
a fifth reshaping processing unit, configured to multiply the second convolution feature map set and the channel attention map set and perform reshaping processing to obtain a fifth reshaping feature map set;
and the second addition processing unit is used for carrying out dimension addition processing on the second convolution characteristic diagram set and the fifth shaping characteristic diagram set to obtain a dimension output characteristic diagram set.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention relates to a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion, which comprises the steps of firstly, sequentially carrying out feature extraction and splicing processing on remote sensing images of multiple modes; secondly, performing random inactivation treatment on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set; carrying out global semantic annotation based on each pixel point in the first convolution feature map set again; then, performing channel semantic annotation on the channel dimension based on the second convolution feature map set; and finally, performing weighted fusion and convolution processing on the labeled position output characteristic graph set and the labeled dimension output characteristic graph set to obtain a fused labeled image. The method randomly inactivates the input multi-modal channels to simulate the condition of partial modal loss under the actual condition, and improves the generalization capability and robustness of the model. In addition, the invention combines the channel semantic annotation with the global semantic annotation, thereby improving the accuracy of the fusion annotation of the context information of the image.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a semantic annotation method based on self-adaptive multi-mode remote sensing image fusion of the invention;
FIG. 2 is a flow chart of global semantic annotation according to the present invention;
FIG. 3 is a flow chart of channel semantic annotation according to the present invention;
FIG. 4 is a structural diagram of a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a semantic annotation method and a semantic annotation system based on self-adaptive multi-mode remote sensing image fusion so as to improve the accuracy of image context information fusion annotation.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the invention provides a semantic annotation method based on self-adaptive multi-modal remote sensing image fusion, which comprises the following steps:
step S1: remote sensing images of multiple modalities are acquired.
Step S2: and respectively carrying out feature extraction processing on the remote sensing images in the plurality of modes to obtain output feature maps in the plurality of modes.
Step S3: and splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set. The spliced feature map set comprises a plurality of spliced feature maps.
Step S4: and carrying out random inactivation treatment on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set.
Step S5: and respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set and a second convolution characteristic image set.
Step S6: and carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set.
Step S7: and performing channel semantic annotation on the channel dimension based on the second convolution feature graph set to obtain a channel output feature graph set.
Step S8: and performing weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image.
Step S9: and performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
The individual steps are discussed in detail below:
step S1: acquiring remote sensing images of a plurality of modalities; the remotely sensed images of the plurality of modalities include: the remote sensing image comprises a panchromatic remote sensing image MS, a multispectral remote sensing image PAN, a panchromatic and multispectral remote sensing image PS-MS, a panchromatic and infrared remote sensing image PS-RGB and an SAR remote sensing image.
Step S4: based on the splicing feature map set, performing random inactivation treatment on the channel to obtain a random inactivation feature map set X', which specifically comprises the following steps:
step S41: ranking each channel based on pixel values in the spliced feature map set to obtain a ranking score corresponding to each channel, wherein the specific formula is as follows:
Figure BDA0002974264740000071
wherein, scoreiIs the rating score of the ith channel, W and H are the maximum width and height of the ith channel, respectively, xiAnd (j, k) is the pixel value corresponding to the ith channel with the width j and the height k.
Step S42: calculating a probability value corresponding to each channel according to the rating score corresponding to each channel, wherein the specific formula is as follows:
Figure BDA0002974264740000081
wherein r isiTo generate random numbers between (0,1) by a random number generator, scoreiIs the rating score, key, of the ith channeliAnd the probability value corresponding to the ith channel is obtained.
Step S43: according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant.
Step S44: the M channels corresponding to the maximum probability values are selected and the corresponding masks are set to 1.
Step S5: and respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set A and a second convolution characteristic image set A'.
As shown in fig. 2, step S6 specifically includes:
step S61: performing convolution processing on the first convolution feature map set A respectively to obtain a third convolution feature map set B and a fourth convolution feature map set C;
Figure BDA0002974264740000082
the dimension is represented, M is H × W, M represents the number of pixels, H represents the feature map height, W represents the feature map width, and N' represents the number of feature map channels of dropout output.
Step S62: reshaping and transposing the third convolution characteristic image set B to obtain a first transposing characteristic image set U;
Figure BDA0002974264740000083
the dimensions are represented.
Step S63: reshaping the fourth convolution characteristic diagram set C to obtain a first reshaping characteristic diagram set I;
Figure BDA0002974264740000084
the dimensions are represented.
Step S64: multiplying the first transfer feature map set U with the first shaping feature map set I, obtaining a space attention map set S through a softmax layer,
Figure BDA0002974264740000085
the dimensions are represented.
Step S65: performing convolution processing on the first convolution characteristic image set A to obtain a fifth convolution characteristic image set D,
Figure BDA0002974264740000086
the dimensions are represented.
Step S66: reshaping the fifth convolution characteristic map set D to obtain a second reshaping characteristic map set Q,
Figure BDA0002974264740000087
the dimensions are represented.
Step S67: combining the second shaping feature map set Q with spaceNote that the set S is multiplied and reshaped, a third set R of reshape features is obtained,
Figure BDA0002974264740000091
the dimensions are represented.
Step S68: carrying out pixel-level addition processing on the first convolution characteristic diagram set A and the third shaping characteristic diagram set R to obtain a position output characteristic diagram set E,
Figure BDA0002974264740000092
the dimensions are represented.
Calculating the image of the jth channel in the position output characteristic diagram set, wherein the specific formula is as follows:
Figure BDA0002974264740000093
wherein E isjThe image of the jth channel in the position output characteristic diagram set is represented, alpha represents a parameter value obtained by training the network by semantic annotation data, M represents the number of pixels, and s representsjiIndicates the degree/correlation of association between the ith and jth channels, DiThe characteristic diagram matrix representing the j row and i column in the fifth convolution characteristic diagram set, AjRepresenting the jth feature map matrix in the first volumetric feature map set.
Figure BDA0002974264740000094
BiRepresents the i-th channel shaped 1 xM vector, CjThe M × 1 vector after shaping of the j-th channel is shown, and M indicates the number of pixels.
And the final output position output feature graph set has a global context semantic perception visual field, and selectively aggregates context semantic information according to the spatial attention graph set S.
As shown in fig. 3, step S7 specifically includes:
step S71: reforming and transposing the second convolution characteristic diagram set A 'to obtain a second transposed characteristic diagram set B',
Figure BDA0002974264740000095
the dimensions are represented.
Step S72: reforming the second convolution characteristic diagram set A 'to obtain a fourth shaping characteristic diagram set C',
Figure BDA0002974264740000096
the dimension is expressed, and M × W represents the number of pixels.
Step S73: multiplying the second feature map set B 'and the fourth shaping feature map set C', obtaining a channel attention map set X through the softmax layer,
Figure BDA0002974264740000097
the dimensions are represented.
Step S74: multiplying and reshaping the second convolution characteristic diagram set A 'and the channel attention diagram set X to obtain a fifth reshaping characteristic diagram set D';
Figure BDA0002974264740000098
the dimensions are represented.
Step S75: performing dimension addition processing on the second convolution characteristic diagram set A ' and the fifth shaping characteristic diagram set D ' to obtain a dimension output characteristic diagram set E ',
Figure BDA0002974264740000101
the dimensions are represented.
Figure BDA0002974264740000102
Wherein, A'iDenotes the ith channel feature map, A'jDenotes the jth channel feature map, E'jAnd the output corresponding to the j channel is shown, beta represents a parameter value obtained by training the network by semantic annotation data, and N' represents the number of channels after Dropout.
Figure BDA0002974264740000103
Wherein x isjiDenotes an influence value, C ', of the ith channel on the jth channel'iRepresents the 1 XM vector, C 'after channel i reshaping'jThe M × 1 vector after the j-th channel reshaping is shown, i, j being 1,2, …, N'.
The final output dimension output feature graph set E' is weighted fusion of all channels, and is beneficial to improving the cross-channel feature identification degree.
According to the invention, because the position attention and the channel attention respectively acquire the interdependency between different positions and channel mapping, the characterization capability of the feature map on multi-mode semantics can be effectively enhanced.
As shown in fig. 4, the present invention further provides a semantic annotation system based on adaptive multi-modal remote sensing image fusion, wherein the system comprises:
and the plurality of feature extraction processing modules are used for respectively carrying out feature extraction processing on the obtained remote sensing images of the plurality of modalities to obtain output feature maps of the plurality of modalities.
And the splicing module is used for splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set.
And the random inactivation processing module is used for carrying out random inactivation processing on the channels based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set.
And the first convolution layer is used for performing convolution processing on the random inactivation characteristic map sets respectively to obtain first convolution characteristic map sets respectively.
And the second convolution layer is used for performing convolution processing on the random inactivation characteristic map set to obtain a second convolution characteristic map set.
And the position attention module is used for carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set.
And the channel attention module is used for carrying out channel semantic annotation on the channel dimension based on the second convolution feature map set to obtain a channel output feature map set.
And the weighted fusion module is used for carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image.
And the third convolution layer is used for performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
As an optional implementation manner, the random inactivation processing module of the present invention specifically includes:
and the rating score determining unit is used for rating each channel based on the pixel values in the splicing feature map set to obtain the rating score corresponding to each channel.
And the probability value determining unit is used for calculating the probability value corresponding to each channel according to the rating score corresponding to each channel.
A channel reservation number determining unit for determining a channel reservation number according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant.
And the selecting unit is used for selecting the M channels corresponding to the maximum probability value.
In fig. 4, the random deactivation processing module is a Dropout module, and five channels are randomly deactivated through the Dropout module at a mode level to simulate a partial mode missing situation in an actual situation. For ease of explanation of the Dropout module, the following convention is made:
the output after splicing (i.e., the input of Dropout block) in fig. 4 is set to X ═ X1,x2,…,xN],
Figure BDA0002974264740000112
Representing the output of the Dropout module, N representing the total number of input channels of the Dropout module, N' representing the total number of channels of the modal profile, xiCharacteristic diagram, x 'of ith channel representing Dropout module input'iA characteristic diagram of the ith channel output by the Dropout block is shown. In most cases, X ═ X '(i.e., N ═ N', Xi=x′iI ═ 1,2, …, N). The method comprises the following specific steps:
a) channel rating
When the spliced feature map enters a Dropout module, channel rating is firstly carried out, namely, a score is allocated to each channel, the step is completed through GAP, and for each channel i, the corresponding score is as follows:
Figure BDA0002974264740000111
wherein, scoreiIs the rating score of the ith channel, W and H are the maximum width and height of the ith channel, respectively, xi(j, k) is the pixel value for the ith channel with j width and k height.
b) Channel selection
The invention relates to a method for constructing binary mask based on scores, belonging to weighted random selection method (WRS),
the algorithm flow is shown in the following table 1.
Figure BDA0002974264740000121
Generating random numbers r between (0,1) by a random number generatoriThereby according to
Figure BDA0002974264740000122
Get the keyiSelecting the reserved channel with the maximum M items and corresponding maskiSet to 1.
c) Random selection
And selecting the channel on the basis of the channel rating, and constructing a 0-1 binary mask. Mask of i-th layeriAt 0/1, the channel is selected/unselected accordingly. Before constructing binary mask, the probability p of reserving channel is calculatediAnd assigned to each channel, which is calculated as follows:
Figure BDA0002974264740000123
due to P (mask)i=1)=piDue to the factWhile the higher the channel rating, the more likely the channel is to be retained. Due to heterogeneous differences between different modalities, after passing through the preceding neural network, it may occur that some modality channels are assigned a larger score, and other scores are small, and if channels are selected only according to score, it is possible that for each image, the channels of the selected channel sequence remain the same at each forward pass. Then, by adding a random number generator, when the mask is in useiSet to 1, corresponding xiMay still be unselected.
As an optional implementation manner, the location attention module specifically includes:
and the first convolution processing unit is used for respectively carrying out convolution processing on the first convolution feature map set to obtain a third convolution feature map set and a fourth convolution feature map set.
And the first transposition processing unit is used for reshaping and transposing the third convolution feature map set to obtain a first transposition feature map set.
And the first reshaping processing unit is used for reshaping the fourth convolution feature map set to obtain a first reshaping feature map set.
And the first multiplication processing unit is used for multiplying the first transfer characteristic map set and the first shaping characteristic map set, and obtaining a spatial attention map set through the softmax layer.
And the second convolution processing unit is used for performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set.
And the second reshaping processing unit is used for reshaping the fifth convolution feature map set to obtain a second reshaping feature map set.
And the third reshaping processing unit is used for multiplying the second reshaping characteristic diagram set and the spatial attention diagram set and performing reshaping processing to obtain a third reshaping characteristic diagram set.
And the first addition processing unit is used for carrying out pixel-level addition processing on the first convolution characteristic diagram set and the third shaping characteristic diagram set to obtain a position output characteristic diagram set.
As an optional implementation manner, the channel attention module of the present invention specifically includes:
and the second transposition processing unit is used for reshaping and transposing the second convolution characteristic graph set to obtain a second transposition characteristic graph set.
And the fourth reshaping processing unit is used for reshaping the second convolution feature map set to obtain a fourth reshaping feature map set.
And the second multiplication processing unit is used for multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through the softmax layer.
And the fifth reshaping processing unit is used for multiplying the second convolution characteristic diagram set and the channel attention diagram set and reshaping the second convolution characteristic diagram set and the channel attention diagram set to obtain a fifth reshaping characteristic diagram set.
And the second addition processing unit is used for carrying out dimension addition processing on the second convolution characteristic diagram set and the fifth shaping characteristic diagram set to obtain a dimension output characteristic diagram set.
Compared with the prior art, the technical scheme disclosed by the invention has the following advantages:
1. the random inactivation processing module of the invention is provided with a model-level Dropout model missing simulation mechanism, and carries out random inactivation on the input multi-mode channels so as to simulate the situation of partial model missing under the actual condition, thereby improving the generalization capability and robustness of the model.
2. For a multi-mode heterogeneous fusion mechanism, the system is provided with a position attention module and a channel attention module, wherein the position attention module is used for capturing global semantic information (certain context relation among pixel points) to realize attention to image context information, the channel attention module is used for capturing cross-modal heterogeneous information (relation among different modalities), self-adaptive fusion benefits of remote sensing images of different modalities are improved through the channel attention module, feature weighting of a channel dimension and an image dimension is improved, and the global semantic information output by the two attention modules is better utilized. Meanwhile, in order to better utilize the global semantic information output by the two attention modules, the output of the two modules is subjected to weighted feature fusion after passing through a convolution layer, so that the fusion capability of semantic information labeling is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A semantic annotation method based on self-adaptive multi-mode remote sensing image fusion is characterized by comprising the following steps:
step S1: acquiring remote sensing images of a plurality of modalities;
step S2: respectively carrying out feature extraction processing on the remote sensing images of a plurality of modes to obtain output feature maps of the plurality of modes;
step S3: splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;
step S4: based on the splicing characteristic diagram set, carrying out random inactivation treatment on the channel to obtain a random inactivation characteristic diagram set;
step S5: respectively carrying out convolution processing on the random inactivation characteristic image sets to respectively obtain a first convolution characteristic image set and a second convolution characteristic image set;
step S6: carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;
step S7: performing channel semantic annotation on the channel dimension based on the second convolution feature map set to obtain a channel output feature map set;
step S8: carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;
step S9: and performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
2. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S4 specifically comprises:
step S41: grading each channel based on pixel values in the spliced feature map set to obtain a grading score corresponding to each channel;
step S42: calculating a probability value corresponding to each channel according to the rating score corresponding to each channel;
step S43: according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant;
step S44: and selecting M channels corresponding to the maximum probability value.
3. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S6 specifically comprises:
step S61: performing convolution processing on the first convolution feature map set respectively to obtain a third convolution feature map set and a fourth convolution feature map set;
step S62: reshaping and transposing the third convolution feature map set to obtain a first transposing feature map set;
step S63: reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;
step S64: multiplying the first transfer feature map set and the first shaping feature map set, and obtaining a spatial attention map set through a softmax layer;
step S65: performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;
step S66: reshaping the fifth convolution characteristic graph set to obtain a second reshaping characteristic graph set;
step S67: multiplying the second shaping feature map set and the spatial attention map set and reshaping to obtain a third shaping feature map set;
step S68: and carrying out pixel-level addition processing on the first convolution feature map set and the third shaping feature map set to obtain a position output feature map set.
4. The method for semantic annotation based on fusion of the adaptive multi-modal remote sensing images according to claim 1, wherein the step S7 specifically comprises:
step S71: reshaping and transposing the second convolution feature map set to obtain a second transposed feature map set;
step S72: reshaping the second convolution characteristic graph set to obtain a fourth reshaping characteristic graph set;
step S73: multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;
step S74: multiplying the second convolution characteristic diagram set and the channel attention diagram set and reshaping to obtain a fifth reshaping characteristic diagram set;
step S75: and performing dimension addition processing on the second convolution feature map set and the fifth shaping feature map set to obtain a dimension output feature map set.
5. A semantic annotation system based on self-adaptive multi-mode remote sensing image fusion is characterized by comprising:
the system comprises a plurality of characteristic extraction processing modules, a plurality of image processing modules and a plurality of image processing modules, wherein the plurality of characteristic extraction processing modules are used for respectively carrying out characteristic extraction processing on the obtained remote sensing images in a plurality of modes to obtain output characteristic graphs of the plurality of modes;
the splicing module is used for splicing the output characteristic graphs of the plurality of modes to obtain a spliced characteristic graph set;
the random inactivation processing module is used for carrying out random inactivation processing on the channel based on the splicing characteristic diagram set to obtain a random inactivation characteristic diagram set;
the first convolution layer is used for performing convolution processing on the random inactivation characteristic map sets respectively to obtain first convolution characteristic map sets respectively;
the second convolution layer is used for performing convolution processing on the random inactivation characteristic image set to obtain a second convolution characteristic image set;
the position attention module is used for carrying out global semantic annotation on each pixel point in the first convolution feature map set to obtain a position output feature map set;
the channel attention module is used for carrying out channel semantic annotation on channel dimensions based on the second convolution feature map set to obtain a channel output feature map set;
the weighted fusion module is used for carrying out weighted fusion on the position output characteristic graph set and the dimension output characteristic graph set to obtain an initial fusion annotation image;
and the third convolution layer is used for performing convolution processing on the initial fusion labeling image to obtain a fusion labeling image.
6. The self-adaptive multi-modal remote sensing image fusion semantic annotation system based on claim 5, wherein the random inactivation processing module specifically comprises:
the ranking score determining unit is used for ranking each channel based on the pixel values in the splicing feature map set to obtain a ranking score corresponding to each channel;
the probability value determining unit is used for calculating the probability value corresponding to each channel according to the rating score corresponding to each channel;
a channel reservation number determining unit for determining a channel reservation number according to M-Nx wrsratioCalculating a channel reservation number; where M is the channel reservation number, N is the total number of input channels, wrsratioIs a constant;
and the selecting unit is used for selecting the M channels corresponding to the maximum probability value.
7. The system for semantic annotation based on adaptive multi-modal remote sensing image fusion according to claim 5, wherein the location attention module specifically comprises:
the first convolution processing unit is used for respectively carrying out convolution processing on the first convolution feature map set to obtain a third convolution feature map set and a fourth convolution feature map set;
the first transposition processing unit is used for reshaping and transposing the third convolution feature map set to obtain a first transposition feature map set;
the first reshaping processing unit is used for reshaping the fourth convolution feature map set to obtain a first reshaping feature map set;
the first multiplication processing unit is used for multiplying the first transfer characteristic map set and the first shaping characteristic map set, and obtaining a spatial attention map set through a softmax layer;
the second convolution processing unit is used for performing convolution processing on the first convolution feature map set to obtain a fifth convolution feature map set;
the second reshaping processing unit is used for reshaping the fifth convolution feature map set to obtain a second reshaping feature map set;
the third reshaping processing unit is used for multiplying the second reshaping characteristic diagram set and the space attention diagram set and reshaping the spatial attention diagram set to obtain a third reshaping characteristic diagram set;
and the first addition processing unit is used for carrying out pixel-level addition processing on the first convolution characteristic diagram set and the third shaping characteristic diagram set to obtain a position output characteristic diagram set.
8. The self-adaptive multi-modal remote sensing image fusion semantic annotation system based on claim 5, wherein the channel attention module specifically comprises:
the second transposition processing unit is used for reshaping and transposing the second convolution characteristic graph set to obtain a second transposition characteristic graph set;
the fourth reshaping processing unit is used for reshaping the second convolution feature map set to obtain a fourth reshaping feature map set;
the second multiplication processing unit is used for multiplying the second transposed feature map set and the fourth shaping feature map set, and obtaining a channel attention map set through a softmax layer;
a fifth reshaping processing unit, configured to multiply the second convolution feature map set and the channel attention map set and perform reshaping processing to obtain a fifth reshaping feature map set;
and the second addition processing unit is used for carrying out dimension addition processing on the second convolution characteristic diagram set and the fifth shaping characteristic diagram set to obtain a dimension output characteristic diagram set.
CN202110270709.3A 2021-03-12 2021-03-12 Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion Active CN112949732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110270709.3A CN112949732B (en) 2021-03-12 2021-03-12 Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110270709.3A CN112949732B (en) 2021-03-12 2021-03-12 Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion

Publications (2)

Publication Number Publication Date
CN112949732A true CN112949732A (en) 2021-06-11
CN112949732B CN112949732B (en) 2022-04-22

Family

ID=76229690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110270709.3A Active CN112949732B (en) 2021-03-12 2021-03-12 Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion

Country Status (1)

Country Link
CN (1) CN112949732B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993236A (en) * 2017-11-27 2018-05-04 上海交通大学 A kind of method and platform of multi-modality images processing
CN108537192A (en) * 2018-04-17 2018-09-14 福州大学 A kind of remote sensing image ground mulching sorting technique based on full convolutional network
US20190087726A1 (en) * 2017-08-30 2019-03-21 The Board Of Regents Of The University Of Texas System Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
CN110287354A (en) * 2019-05-16 2019-09-27 中国科学院西安光学精密机械研究所 A kind of high score remote sensing images semantic understanding method based on multi-modal neural network
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
CN111340047A (en) * 2020-02-28 2020-06-26 江苏实达迪美数据处理有限公司 Image semantic segmentation method and system based on multi-scale feature and foreground and background contrast
CN111461130A (en) * 2020-04-10 2020-07-28 视研智能科技(广州)有限公司 High-precision image semantic segmentation algorithm model and segmentation method
US20200334819A1 (en) * 2018-09-30 2020-10-22 Boe Technology Group Co., Ltd. Image segmentation apparatus, method and relevant computing device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087726A1 (en) * 2017-08-30 2019-03-21 The Board Of Regents Of The University Of Texas System Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
CN107993236A (en) * 2017-11-27 2018-05-04 上海交通大学 A kind of method and platform of multi-modality images processing
CN108537192A (en) * 2018-04-17 2018-09-14 福州大学 A kind of remote sensing image ground mulching sorting technique based on full convolutional network
US20200334819A1 (en) * 2018-09-30 2020-10-22 Boe Technology Group Co., Ltd. Image segmentation apparatus, method and relevant computing device
CN110287354A (en) * 2019-05-16 2019-09-27 中国科学院西安光学精密机械研究所 A kind of high score remote sensing images semantic understanding method based on multi-modal neural network
CN110781895A (en) * 2019-10-10 2020-02-11 湖北工业大学 Image semantic segmentation method based on convolutional neural network
CN111340047A (en) * 2020-02-28 2020-06-26 江苏实达迪美数据处理有限公司 Image semantic segmentation method and system based on multi-scale feature and foreground and background contrast
CN111461130A (en) * 2020-04-10 2020-07-28 视研智能科技(广州)有限公司 High-precision image semantic segmentation algorithm model and segmentation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENG PENG等: ""Densely Based Multi-Scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation"", 《 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING》 *
王子锋: ""基于主动聚类的高分辨率遥感影像标注算法"", 《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》 *

Also Published As

Publication number Publication date
CN112949732B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN111783705B (en) Character recognition method and system based on attention mechanism
CN109919174A (en) A kind of character recognition method based on gate cascade attention mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN114821342B (en) Remote sensing image road extraction method and system
CN113240683B (en) Attention mechanism-based lightweight semantic segmentation model construction method
CN112991350A (en) RGB-T image semantic segmentation method based on modal difference reduction
CN113284100A (en) Image quality evaluation method based on recovery image to mixed domain attention mechanism
CN115620010A (en) Semantic segmentation method for RGB-T bimodal feature fusion
CN115457043A (en) Image segmentation network based on overlapped self-attention deformer framework U-shaped network
CN112418235A (en) Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement
CN115965789A (en) Scene perception attention-based remote sensing image semantic segmentation method
Li et al. Maskformer with improved encoder-decoder module for semantic segmentation of fine-resolution remote sensing images
CN113888399A (en) Face age synthesis method based on style fusion and domain selection structure
Jiang et al. Cross-level reinforced attention network for person re-identification
CN117422978A (en) Grounding visual question-answering method based on dynamic two-stage visual information fusion
CN112949732B (en) Semantic annotation method and system based on self-adaptive multi-mode remote sensing image fusion
CN116543338A (en) Student classroom behavior detection method based on gaze target estimation
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
CN116152263A (en) CM-MLP network-based medical image segmentation method
CN115578638A (en) Method for constructing multi-level feature interactive defogging network based on U-Net
CN115131563A (en) Interactive image segmentation method based on weak supervised learning
CN114331894A (en) Face image restoration method based on potential feature reconstruction and mask perception
CN114693951A (en) RGB-D significance target detection method based on global context information exploration
CN115936073B (en) Language-oriented convolutional neural network and visual question-answering method
CN117392392B (en) Rubber cutting line identification and generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant