CN115311463B - Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system - Google Patents
Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system Download PDFInfo
- Publication number
- CN115311463B CN115311463B CN202211223823.1A CN202211223823A CN115311463B CN 115311463 B CN115311463 B CN 115311463B CN 202211223823 A CN202211223823 A CN 202211223823A CN 115311463 B CN115311463 B CN 115311463B
- Authority
- CN
- China
- Prior art keywords
- image
- text
- features
- remote sensing
- decoupling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Abstract
The invention belongs to the technical field of remote sensing image processing, and discloses a method and a system for searching a marine remote sensing image text with category-guided multi-scale decoupling, wherein image features of different scales of a marine remote sensing image and text features of a remote sensing related text are extracted; then, decoupling the obtained image features of different scales by using a bidirectional multi-scale decoupling module, extracting corresponding potential features on each scale, inhibiting complex features on other scales and obtaining decoupling features; guiding the decoupled image features and text features by using the class label guiding module, and calculating final class-related image and text features by using multiplication; and finally, calculating the similarity and the semantic guide triple loss. The invention realizes multi-scale decoupling, introduces effective information for decoupling, establishes a scale and semantic double-decoupling marine multi-modal information fusion method, and solves the problems of multi-scale dimension noise redundancy and difficult multi-dimension decoupling representation information fusion.
Description
Technical Field
The invention belongs to the technical field of remote sensing image processing, and particularly relates to a method and a system for category-guided multi-scale decoupling text retrieval of ocean remote sensing images.
Background
The ocean remote sensing image text retrieval is an important method for solving the problems of text data deletion and inaccurate text data description in remote sensing data. The ocean remote sensing image text retrieval utilizes a cross-modal retrieval algorithm to analyze a large number of satellite remote sensing images and automatically retrieve a large number of text data accurately describing the images so as to achieve the purposes of solving text data loss and text data inaccurate description. The traditional method mainly faces the problem that the effective image features are difficult to extract, because the space distribution of targets in the ocean remote sensing image is dispersed, and the effective targets in the image are few, the information of the effective targets can be diluted in the fusion process of the global information, and the subsequent data mining is influenced. Therefore, the text retrieval method of the ocean remote sensing image at the leading edge introduces a multi-scale feature extraction and attention mechanism, yuan et al propose a novel fine-grained multi-modal feature matching network, and the method has the advantages that image features under different scales are obtained and key features are extracted, so that more accurate text information is retrieved.
However, the prior method has the following problems: first, a large amount of redundant noise is generated during multi-scale feature interaction. The multi-scale features often comprise repeated regions, when the multi-scale features are fused through addition or cascade, the repeated regions are accumulated continuously, the utilization rate of multi-scale contents is low, a redundant feature filtering algorithm used by the existing method is simple, a large amount of noise cannot be filtered, and the redundant noise can influence subsequent data fusion and mining. For example, the existing method uses a gating idea to filter redundant features, and the method cannot effectively filter a large amount of noise and has the possibility of filtering effective information. Secondly, the existing method usually performs knowledge decoupling based on multi-scale features of the image, and ignores the disambiguation effect of image semantic information and text semantic information in image-text retrieval. On the aspect of the text retrieval of the ocean remote sensing image, only the characteristic decoupling on the dimension is considered, but the waste of the value of rich semantic information is avoided, and the time and difficulty for extracting effective key characteristics from the model are increased due to the lack of value information. The low-order semantic information of the image is the expression of shallow features (such as features of color, geometry, texture and the like), the semantic information of the text can be understood as information related to category division, and the introduction of the image-text semantic information can express the information of texture, geometry, color and the like in the image content, and can also express text description and text type information. The semantic information expressed by the pictures and texts can lead the network back end to correctly predict the category attribution.
Therefore, aiming at the problems, the invention provides a category-guided bidirectional multi-scale decoupling network, which realizes multi-scale decoupling and introduces effective category information (image-text semantic information) for decoupling. A scale and semantic double-decoupling marine multi-modal information fusion framework is established, and the problems of noise redundancy of multi-scale dimensions and difficulty in information fusion of multi-dimensional decoupling representation are solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for category-guided multi-scale decoupling marine remote sensing image text retrieval, decoupling characteristics on different scales are obtained through bidirectional multi-scale decoupling, and the category characteristics of images and texts are guided and decoupled by category labels, so that the problems of noise redundancy of multi-scale dimensions and difficulty in fusion of multi-dimensional decoupling characteristic information are solved.
In order to solve the technical problems, the invention adopts the following technical scheme:
firstly, the invention provides a category-guided multi-scale decoupled marine remote sensing image text retrieval method, which comprises the following steps:
s0, obtaining a marine remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained;
S2, extracting text features of the remote sensing related textT;
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF;
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image featuresAnd text features;
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the categories output in the step S4And text featuresPerforming category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;
s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
Further, step S3 is divided into two steps:
s31, extracting image features of each scale from the image feature extraction moduleConstructing an attention map based on an attention mechanism on the current scaleExtracting potential features; and generating a suppression mask;
S32, aiming at attention diagrams extracted under different feature scalesAnd suppression maskBy passingTo facilitate significant information on the corresponding scale,used for suppressing the salient features of other scales and obtainedThe image characteristics after filtering redundant information realize scale decoupling, and the attention is tried out in a step-by-step suppression modeApplication to decoupling featuresAndin the production process of, whereinIs a decoupling feature in the small-to-large dimension direction,is a decoupling feature in the large to small dimension direction; finally, decoupling characteristics of various characteristic scales through concat operationAndthe decoupling characteristic of the synthesized final imageF。
Further, the calculation formula of the decoupling characteristic is as follows:
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention mapAnd suppression maskDeriving decoupling characteristics by arithmetic concatenationAnd。
further, step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU;
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV,
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is performed to obtain final class-related image featuresText features related to categories。
Further, the step S31 specifically includes: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function;
Further, in step S5, first, the category features are converted into semantic categories of images and texts by softmaxAnd(ii) a Then, a parameter is definedTo adjust the loss and parametersExpressed as:
at a constant value, at a constant valueBased on the above, the triple loss based on the category is designed as follows:
whereinThe distance between the finger edges is equal to the distance between the finger edges,representing the similarity of the sample image and the positive sample text;representing the similarity of the sample image and the negative sample text;representing the similarity of the sample text and the positive sample image;representing the similarity of the sample text and the negative sample image; the first summation being for image featuresMatching with all text features, including the text features of the positive sampleAnd text features of negative examplesSecond summation being over text featuresMatching with all image features, including image features of positive samplesAnd image characteristics of negative examples(ii) a The objective of the triple loss function constructed by two summations is to maximizeThe similarity with the positive samples is minimized.
The invention also provides a category-guided multi-scale decoupling marine remote sensing image text retrieval system, which is used for realizing the category-guided multi-scale decoupling marine remote sensing image text retrieval method, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a depth residual error network and a cavity space convolution pooling pyramid and is used for extracting multi-scale image features,
The text feature extraction module extracts text features to obtain text features of the remote sensing related textT;
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction moduleDecoupling is carried out to obtain decoupling characteristicsF;
The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding the image and the text as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
Compared with the prior art, the invention has the advantages that:
(1) The problem of noise redundancy is solved. The invention effectively filters a large amount of redundant noise generated in the multi-scale feature interaction process. A bidirectional multi-scale decoupling module is constructed, potential features of each scale are extracted in a bidirectional mode in a self-adaptive mode, and tedious features of other scales are suppressed, so that effective features of each scale are extracted, redundant features of each scale are suppressed, a large amount of redundant noise is filtered, and effective features are extracted.
(2) The introduction of category information (semantic information) improves the robustness of the features. The invention unifies the semantic decoupling of the two dimensions. And a category label guide module is constructed, and category semantic labels are used as priori knowledge to monitor images and texts so as to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. The category semantic features can emphasize effective features, and the knowledge of semantic decoupling is mapped into a visual multi-scale sample space through cascade connection. The category attribute serves as a bridge of two kinds of modal information, and external knowledge is provided for the model while multi-modal knowledge is aligned, so that the model is helped to quickly extract effective features, and effective objects in the remote sensing image are excavated. Meanwhile, the expressions of category information, pixel attribution and scale characteristics can also be generated by the alignment and fusion of the image multi-scale characteristics, effective information (text semantic characteristics) and image semantic characteristics, and the semantic information expressed by the pictures and texts can make the network rear end make correct prediction on the category attribution.
(3) The problems of difficult extraction of effective features and low retrieval accuracy are solved by using the priori knowledge. The invention constructs a semantic guide triple loss module to perform category matching on category characteristics, judges whether an image and a text belong to the same category, inputs category attributes as external knowledge into a downstream task, and performs dynamic weight selection on heterogeneous information matched with heterogeneous images and texts. For example, a remote sensing image classification model and a remote sensing text classification model with high accuracy are trained as prior knowledge and added into a loss function, if the categories of the images and the texts are the same, the similarity is increased, so that the model convergence time is greatly shortened, and the matching probability of the images and the texts with the same categories is higher than the unmatching probability. So that the retrieval accuracy of the model is greatly increased.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
Example 1
With reference to fig. 1 and 2, a category-guided bidirectional multi-scale decoupled marine remote sensing image text retrieval method firstly preprocesses data, including processing a marine remote sensing image, and then extracts text features T from the preprocessed data through a text feature extraction module on the one hand, and extracts decoupled image features F through bidirectional multi-scale decoupling on the other hand; then inputting the decoupled image characteristic F and text characteristic T into a category label guide module, and utilizing a category semantic label (F)U&V) Monitoring images and texts as prior knowledge to construct class features and realize feature decoupling on semantic dimensions; and finally, calculating the semantic guide triple loss through the similarity of the image and the text, judging whether the image and the text are the same, and performing reverse propagation.
The method specifically comprises the following steps:
and S0, obtaining the ocean remote sensing image and the remote sensing related text.
S1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained. A characterization of the image is obtained by this step.
S2, extracting text features of the remote sensing related textT. In a specific application, text feature extraction can be selected by using a word vector embedding model (sentence embedding) and a Skip-through text processing model. The representation of the text is obtained by this step.
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF. The method comprises the following two steps:
s31, extracting image features of each scale from the image feature extraction moduleConstructing an attention map based on attention mechanism at the current scaleExtracting potential features; and generating a suppression mask。
The method comprises the following steps: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function;
WhereinIs a binary mask that will be most significantThe value of (b) is taken as 0, and the others are taken as 1; inhibition mask alleviatesThe coverage effect on other scales makes the common finger information of different scales stand out.
S32, aiming at attention diagrams extracted under different feature scalesAnd suppression maskBy passingTo facilitate significant information on the corresponding scale,the method is used for suppressing the salient features of other scales, obtaining the image features after filtering redundant information to realize scale decoupling, and drawing attention in a gradual suppression modeApplication to decoupling featuresAndin the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operationAndthe decoupling characteristic of the synthesized final imageFThe formula is as follows:
wherein m is the number of different scales, namely three scales of large, medium and small, and the attention is soughtAnd suppression maskDeriving decoupling characteristics by operational cascadingAndwhereinIs a decoupling feature in the small to large dimension direction,is a decoupling feature in the large scale to small scale direction.
In particular, since the attention map represents significant regions of a feature, the suppression mask leverages the attention map representation to suppress significance information on the corresponding scale. The suppression mask mitigation attention seeks to show the effect of the overlay on other scales, highlighting different information.
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTMultiplying the resulting class-dependent image and text featuresAndthe method comprises the following steps:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU;
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV;
The two classifiers are pre-training models, the prediction accuracy rate of the two classifiers reaches over 80 percent, rich semantic knowledge in the pre-training models can be transferred to a subsequent training process, and the pre-training models can be regarded as prior knowledge supervision of the models.
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying to guide the retrieval network to detect important and reliable category related information; the text characteristics obtained in the step S2 are processedTAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categoriesAnd text featuresBy making full use of multiplication, significant enhancement of the correlation features can be achieved in the feature combination process.
Andthe method not only captures identifiable multi-scale semantic information, but also highlights reliable knowledge related to categories, thereby improving the accuracy of network retrieval. To guide the retrieval network to probe important and reliable category-related information. Wherein the decoupling characteristic of the imageFAnd remote sensing image category characteristicsUThe image feature and the text feature are guided by using the classified prior knowledge of the image and the text, the knowledge of the pre-trained semantic feature is subjected to semantic decoupling, and the decoupled semantic information is combined with an original retrieval network to explore meaningful and reliable category related data, so that while category supervision is realized, the semantic information is fused and aligned with the scale information on different modal information through a prior knowledge guide module; the formula is as follows:
s5, calculating similarity and semantic guide triple loss:
firstly, the image and text characteristics related to the category output in the step S4Andperforming category matching, and judging whether the image and the text belong to the same category so as to improve the retrieval probability of the cross-modal data of the same category; and attribute the categoryThe image data is input into a downstream task as external knowledge, and dynamic weight selection is carried out on heterogeneous information matched with heterogeneous graphics and texts so as to improve the retrieval probability of the same-class cross-modal data; and then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training.
First, class features are converted to semantic classes of images and text by softmaxAnd(ii) a Then, a parameter is definedTo adjust the loss and parametersExpressed as:
at a constant value, at a constant valueBased on the above, the triple loss based on the category is designed as follows:
the purpose of the triple-loss function is to increase the distance between a sample and the corresponding negative sample while minimizing the semantic spatial distance between the sample and the positive sample. WhereinThe distance between the finger edges is equal to the distance between the finger edges,representing the similarity of the sample image and the positive sample text;representing the similarity of the sample image and the negative sample text;representing the similarity of the sample text and the positive sample image;representing the similarity of the sample text and the negative sample image; the first summation being for image featuresMatching with all text features (including text features of positive samples)And text features of negative examples) The second summation being for text featuresMatching with all image features (including image features of positive samples)And image characteristics of negative examples). The purpose of the triplet loss function constructed by the two summations is to maximize the similarity with the positive samples and minimize the similarity with the negative samples.
And S6, inputting the ocean remote sensing image to be retrieved and outputting remote sensing related text data. (or inputting remote sensing related text data to be retrieved and outputting ocean remote sensing images).
Example 2
The category-guided bidirectional multi-scale decoupling marine remote sensing image text retrieval system comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module.
The image feature extraction module comprises a convolution neural network and a void space convolution pooling module and is used for extracting multi-scale image features,
The text feature extraction module is used for extracting text features by utilizing a word vector embedding (sentence embedding) model and a Skip-through text processing model to obtain the text features of the remote sensing related textsT;
The bidirectional multi-scale decoupling module is used for extracting multi-scale image features output by the image feature extraction moduleDecoupling is carried out to obtain decoupling characteristicsF;
The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUText category features related to remote sensingV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics U of respective corresponding modalities&V carries out attention enhancement, and can also combine enhancement information with an original retrieval network to realize the fusion of semantics and scale characteristics so as to explore meaningful and reliable category-related data and acquire category-related imagesAnd text features;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting marine remote sensing images or remote sensing related text data to be retrieved, and the output module is used for outputting remote sensing related text data or marine remote sensing images.
The function implementation and data processing of each module are partially the same as those in embodiment 1, and are not described herein again.
It should be noted that the method of the present invention can implement two-mode cross-mode retrieval of images and texts, one type of data is used as a query to retrieve the other type of data, when the data is input as an ocean remote sensing image, the output retrieval result is corresponding text data, and when the data is input as ocean remote sensing related text data, the output retrieval result is corresponding ocean remote sensing image.
In summary, the present invention can use the category information as the prior knowledge to guide more accurate cross-modal information representation. Specifically, compared with the existing method, the bidirectional multi-scale decoupling module is constructed in the invention to adaptively extract potential features and inhibit fussy features on other scales, so that discriminative clues are generated and the problem of noise redundancy of cascade scale decoupling is solved. In addition, a category label guide module and a semantic guide triple loss module are constructed, wherein the category label guide module monitors images and texts by using category semantic labels as priori knowledge to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. Then, the decoupled semantic information is combined with an original retrieval network, so that the fusion of semantic and scale characteristics is realized, and meaningful and reliable category related data are explored; and the semantic guide triple loss module performs category matching on the category characteristics, judges whether the image and the text belong to the same category, inputs the category attribute as external knowledge into a downstream task, and performs dynamic weight selection on the heterogeneous information matched with the heterogeneous images and texts so as to improve the retrieval probability of the same-category cross-modal data and improve the retrieval probability and the model convergence speed of the same-category cross-modal data. And finally, by carrying out category matching on the generated category characteristics, a category-based triple loss is designed so as to improve the retrieval probability of the similar cross-modal data.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.
Claims (3)
1. The method for searching the marine remote sensing image text based on category-guided multi-scale decoupling is characterized by comprising the following steps of:
s0, obtaining a marine remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of the image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained;
S2, extracting text features of the remote sensing related textT;
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF;
Step S3 is divided into two steps:
s31, extracting image features of each scale from the image feature extraction moduleBased on attention on the current scaleMechanism build attention diagramsExtracting potential features; and generating a suppression mask;
S32, aiming at attention diagrams extracted under different feature scalesAnd suppression maskBy passingTo facilitate significant information on the corresponding scale,the method is used for suppressing salient features of other scales, obtaining image features after redundant information is filtered to achieve scale decoupling, and performing attention drawing through a gradual suppression modeApplication to decoupling featuresAndin the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operationAnddecoupling features of synthesized final imagesF;
In step S32, the calculation formula of the decoupling characteristic is as follows:
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention mapAnd suppression maskDeriving decoupling characteristics by operational cascadingAndwhereinIs a decoupling feature in the small to large dimension direction,is a decoupling feature in the large to small dimension direction;
step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image featuresAnd text features;
Step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining category characteristics of the remote sensing images through training of a remote sensing image classifierU;
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV,
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categoriesAnd text features;
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the category output in step S4And text featuresPerforming category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;
step (ii) ofIn S5, firstly, the category characteristics are converted into semantic categories of images and texts through softmaxAnd(ii) a Then, a parameter is definedTo adjust the loss and parametersExpressed as:
at a constant value, at a constant valueBased on the above, the category-based triple loss is designed as follows:
whereinThe distance between the finger edges is equal to the distance between the finger edges,representing the similarity of the sample image and the positive sample text;representing the similarity of the sample image and the negative sample text;representing the similarity of the sample text and the positive sample image;representing the similarity of the sample text and the negative sample image; the first summation being over image featuresMatching with all text features, including the text features of the positive sampleAnd text features of negative examplesThe second summation being for text featuresMatching with all image features, including image features of positive samplesAnd image characteristics of negative examples(ii) a The purpose of the triple loss function constructed by the two summations is to maximize the similarity between the triple loss function and the positive sample and minimize the similarity between the triple loss function and the negative sample;
s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
2. The category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 1, characterized in that the specific steps of step S31 are: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function;
3. The marine remote sensing image text retrieval system based on category-guided multi-scale decoupling is characterized in that the marine remote sensing image text retrieval method based on category-guided multi-scale decoupling is used for achieving the category-guided multi-scale decoupling, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a convolution neural network and a cavity space convolution pooling module and is used for extracting multi-scale image features,
The text feature extraction module extracts text features to obtain text features of the remote sensing related textT;
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction moduleDecoupling is carried out to obtain decoupling characteristicsF;
The class label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining class characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; whereinU&VClass features labeled through a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU& VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211223823.1A CN115311463B (en) | 2022-10-09 | 2022-10-09 | Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211223823.1A CN115311463B (en) | 2022-10-09 | 2022-10-09 | Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115311463A CN115311463A (en) | 2022-11-08 |
CN115311463B true CN115311463B (en) | 2023-02-03 |
Family
ID=83866005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211223823.1A Active CN115311463B (en) | 2022-10-09 | 2022-10-09 | Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115311463B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116127123B (en) * | 2023-04-17 | 2023-07-07 | 中国海洋大学 | Semantic instance relation-based progressive ocean remote sensing image-text retrieval method |
CN116186317B (en) * | 2023-04-23 | 2023-06-30 | 中国海洋大学 | Cross-modal cross-guidance-based image-text retrieval method and system |
CN117556062B (en) * | 2024-01-05 | 2024-04-16 | 武汉理工大学三亚科教创新园 | Ocean remote sensing image audio retrieval network training method and application method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017103035A1 (en) * | 2015-12-18 | 2017-06-22 | Ventana Medical Systems, Inc. | Systems and methods of unmixing images with varying acquisition properties |
US10713794B1 (en) * | 2017-03-16 | 2020-07-14 | Facebook, Inc. | Method and system for using machine-learning for object instance segmentation |
CN111798460A (en) * | 2020-06-17 | 2020-10-20 | 南京信息工程大学 | Satellite image segmentation method |
CN113487629A (en) * | 2021-07-07 | 2021-10-08 | 电子科技大学 | Image attribute editing method based on structured scene and text description |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112766199B (en) * | 2021-01-26 | 2022-04-29 | 武汉大学 | Hyperspectral image classification method based on self-adaptive multi-scale feature extraction model |
-
2022
- 2022-10-09 CN CN202211223823.1A patent/CN115311463B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017103035A1 (en) * | 2015-12-18 | 2017-06-22 | Ventana Medical Systems, Inc. | Systems and methods of unmixing images with varying acquisition properties |
US10713794B1 (en) * | 2017-03-16 | 2020-07-14 | Facebook, Inc. | Method and system for using machine-learning for object instance segmentation |
CN111798460A (en) * | 2020-06-17 | 2020-10-20 | 南京信息工程大学 | Satellite image segmentation method |
CN113487629A (en) * | 2021-07-07 | 2021-10-08 | 电子科技大学 | Image attribute editing method based on structured scene and text description |
Non-Patent Citations (3)
Title |
---|
A Semantic-Preserving Deep Hashing Model for multi-label remote sensing image retrieval;Qinmin cheng等;《remote sensing》;20211207;全文 * |
Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification;Yuansheng Hua等;《ISPRS Journal of Photogrammetry and Remote Sensing》;20191231;全文 * |
基于深度学习和标签语义关联的遥感影像多标签分类;单守平;《中国优秀硕士学位论文全文数据库(电子期刊)》;20220415;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115311463A (en) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115311463B (en) | Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system | |
CN112966684B (en) | Cooperative learning character recognition method under attention mechanism | |
CN111815602A (en) | Building PDF drawing wall recognition device and method based on deep learning and morphology | |
CN111932577B (en) | Text detection method, electronic device and computer readable medium | |
US11948078B2 (en) | Joint representation learning from images and text | |
CN115658934A (en) | Image-text cross-modal retrieval method based on multi-class attention mechanism | |
CN113947161A (en) | Attention mechanism-based multi-label text classification method and system | |
CN112348001B (en) | Training method, recognition method, device, equipment and medium for expression recognition model | |
Vu et al. | Revising FUNSD dataset for key-value detection in document images | |
CN113051932A (en) | Method for detecting category of network media event of semantic and knowledge extension topic model | |
CN116579348A (en) | False news detection method and system based on uncertain semantic fusion | |
CN111159411A (en) | Knowledge graph fused text position analysis method, system and storage medium | |
EP4012668A2 (en) | Training method for character generation model, character generation method, apparatus and device | |
CN112800259B (en) | Image generation method and system based on edge closure and commonality detection | |
CN115311598A (en) | Video description generation system based on relation perception | |
CN114637846A (en) | Video data processing method, video data processing device, computer equipment and storage medium | |
Li et al. | ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval. | |
CN113159071A (en) | Cross-modal image-text association anomaly detection method | |
Divya et al. | An Empirical Study on Fake News Detection System using Deep and Machine Learning Ensemble Techniques | |
Fang et al. | PiPo-Net: A Semi-automatic and Polygon-based Annotation Method for Pathological Images | |
US20240028828A1 (en) | Machine learning model architecture and user interface to indicate impact of text ngrams | |
CN111985505B (en) | Interest visual relation detection method and device based on interest propagation network | |
CN115146618B (en) | Complex causal relation extraction method based on contrast representation learning | |
CN112347196B (en) | Entity relation extraction method and device based on neural network | |
Xianlun et al. | Deep global-attention based convolutional network with dense connections for text classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |