CN115311463B

CN115311463B - Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Info

Publication number: CN115311463B
Application number: CN202211223823.1A
Authority: CN
Inventors: 魏志强; 郑程予; 宋宁; 赵恩源; 聂婕; 刘安安; 宋丹; 李文辉; 孙正雅; 张文生
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-02-03
Anticipated expiration: 2042-10-09
Also published as: CN115311463A

Abstract

The invention belongs to the technical field of remote sensing image processing, and discloses a method and a system for searching a marine remote sensing image text with category-guided multi-scale decoupling, wherein image features of different scales of a marine remote sensing image and text features of a remote sensing related text are extracted; then, decoupling the obtained image features of different scales by using a bidirectional multi-scale decoupling module, extracting corresponding potential features on each scale, inhibiting complex features on other scales and obtaining decoupling features; guiding the decoupled image features and text features by using the class label guiding module, and calculating final class-related image and text features by using multiplication; and finally, calculating the similarity and the semantic guide triple loss. The invention realizes multi-scale decoupling, introduces effective information for decoupling, establishes a scale and semantic double-decoupling marine multi-modal information fusion method, and solves the problems of multi-scale dimension noise redundancy and difficult multi-dimension decoupling representation information fusion.

Description

Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Technical Field

The invention belongs to the technical field of remote sensing image processing, and particularly relates to a method and a system for category-guided multi-scale decoupling text retrieval of ocean remote sensing images.

Background

The ocean remote sensing image text retrieval is an important method for solving the problems of text data deletion and inaccurate text data description in remote sensing data. The ocean remote sensing image text retrieval utilizes a cross-modal retrieval algorithm to analyze a large number of satellite remote sensing images and automatically retrieve a large number of text data accurately describing the images so as to achieve the purposes of solving text data loss and text data inaccurate description. The traditional method mainly faces the problem that the effective image features are difficult to extract, because the space distribution of targets in the ocean remote sensing image is dispersed, and the effective targets in the image are few, the information of the effective targets can be diluted in the fusion process of the global information, and the subsequent data mining is influenced. Therefore, the text retrieval method of the ocean remote sensing image at the leading edge introduces a multi-scale feature extraction and attention mechanism, yuan et al propose a novel fine-grained multi-modal feature matching network, and the method has the advantages that image features under different scales are obtained and key features are extracted, so that more accurate text information is retrieved.

However, the prior method has the following problems: first, a large amount of redundant noise is generated during multi-scale feature interaction. The multi-scale features often comprise repeated regions, when the multi-scale features are fused through addition or cascade, the repeated regions are accumulated continuously, the utilization rate of multi-scale contents is low, a redundant feature filtering algorithm used by the existing method is simple, a large amount of noise cannot be filtered, and the redundant noise can influence subsequent data fusion and mining. For example, the existing method uses a gating idea to filter redundant features, and the method cannot effectively filter a large amount of noise and has the possibility of filtering effective information. Secondly, the existing method usually performs knowledge decoupling based on multi-scale features of the image, and ignores the disambiguation effect of image semantic information and text semantic information in image-text retrieval. On the aspect of the text retrieval of the ocean remote sensing image, only the characteristic decoupling on the dimension is considered, but the waste of the value of rich semantic information is avoided, and the time and difficulty for extracting effective key characteristics from the model are increased due to the lack of value information. The low-order semantic information of the image is the expression of shallow features (such as features of color, geometry, texture and the like), the semantic information of the text can be understood as information related to category division, and the introduction of the image-text semantic information can express the information of texture, geometry, color and the like in the image content, and can also express text description and text type information. The semantic information expressed by the pictures and texts can lead the network back end to correctly predict the category attribution.

Therefore, aiming at the problems, the invention provides a category-guided bidirectional multi-scale decoupling network, which realizes multi-scale decoupling and introduces effective category information (image-text semantic information) for decoupling. A scale and semantic double-decoupling marine multi-modal information fusion framework is established, and the problems of noise redundancy of multi-scale dimensions and difficulty in information fusion of multi-dimensional decoupling representation are solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for category-guided multi-scale decoupling marine remote sensing image text retrieval, decoupling characteristics on different scales are obtained through bidirectional multi-scale decoupling, and the category characteristics of images and texts are guided and decoupled by category labels, so that the problems of noise redundancy of multi-scale dimensions and difficulty in fusion of multi-dimensional decoupling characteristic information are solved.

In order to solve the technical problems, the invention adopts the following technical scheme:

firstly, the invention provides a category-guided multi-scale decoupled marine remote sensing image text retrieval method, which comprises the following steps:

s0, obtaining a marine remote sensing image and a remote sensing related text;

s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained

；

S2, extracting text features of the remote sensing related textT；

S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF；

Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image features

And text features

；

S5, calculating similarity and semantic guide triple loss:

firstly, the image characteristics related to the categories output in the step S4

And text features

Performing category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;

s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.

Further, step S3 is divided into two steps:

s31, extracting image features of each scale from the image feature extraction module

Constructing an attention map based on an attention mechanism on the current scale

Extracting potential features; and generating a suppression mask

；

S32, aiming at attention diagrams extracted under different feature scales

And suppression mask

By passing

To facilitate significant information on the corresponding scale,

used for suppressing the salient features of other scales and obtainedThe image characteristics after filtering redundant information realize scale decoupling, and the attention is tried out in a step-by-step suppression mode

Application to decoupling features

And

in the production process of, wherein

Is a decoupling feature in the small-to-large dimension direction,

is a decoupling feature in the large to small dimension direction; finally, decoupling characteristics of various characteristic scales through concat operation

And

the decoupling characteristic of the synthesized final imageF。

Further, the calculation formula of the decoupling characteristic is as follows:

wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map

And suppression mask

Deriving decoupling characteristics by arithmetic concatenation

And

。

further, step S4 is specifically as follows:

s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU；

S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV，

S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is performed to obtain final class-related image features

Text features related to categories

。

Further, the step S31 specifically includes: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function

；

Generation of a suppression mask by binary masking

；

Wherein

Is a binary mask that will be most significant

The value of (b) is 0, and the others are 1.

Further, in step S5, first, the category features are converted into semantic categories of images and texts by softmax

And

(ii) a Then, a parameter is defined

To adjust the loss and parameters

Expressed as:

at a constant value, at a constant value

Based on the above, the triple loss based on the category is designed as follows:

wherein

The distance between the finger edges is equal to the distance between the finger edges,

representing the similarity of the sample image and the positive sample text;

representing the similarity of the sample image and the negative sample text;

representing the similarity of the sample text and the positive sample image;

representing the similarity of the sample text and the negative sample image; the first summation being for image features

Matching with all text features, including the text features of the positive sample

And text features of negative examples

Second summation being over text features

Matching with all image features, including image features of positive samples

And image characteristics of negative examples

(ii) a The objective of the triple loss function constructed by two summations is to maximizeThe similarity with the positive samples is minimized.

The invention also provides a category-guided multi-scale decoupling marine remote sensing image text retrieval system, which is used for realizing the category-guided multi-scale decoupling marine remote sensing image text retrieval method, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;

the image feature extraction module comprises a depth residual error network and a cavity space convolution pooling pyramid and is used for extracting multi-scale image features

，

The text feature extraction module extracts text features to obtain text features of the remote sensing related textT；

The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module

Decoupling is carried out to obtain decoupling characteristicsF；

The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding the image and the text as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VPerforming attention enhancement to obtain image and text characteristics related to categories;

the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;

the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.

Compared with the prior art, the invention has the advantages that:

(1) The problem of noise redundancy is solved. The invention effectively filters a large amount of redundant noise generated in the multi-scale feature interaction process. A bidirectional multi-scale decoupling module is constructed, potential features of each scale are extracted in a bidirectional mode in a self-adaptive mode, and tedious features of other scales are suppressed, so that effective features of each scale are extracted, redundant features of each scale are suppressed, a large amount of redundant noise is filtered, and effective features are extracted.

(2) The introduction of category information (semantic information) improves the robustness of the features. The invention unifies the semantic decoupling of the two dimensions. And a category label guide module is constructed, and category semantic labels are used as priori knowledge to monitor images and texts so as to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. The category semantic features can emphasize effective features, and the knowledge of semantic decoupling is mapped into a visual multi-scale sample space through cascade connection. The category attribute serves as a bridge of two kinds of modal information, and external knowledge is provided for the model while multi-modal knowledge is aligned, so that the model is helped to quickly extract effective features, and effective objects in the remote sensing image are excavated. Meanwhile, the expressions of category information, pixel attribution and scale characteristics can also be generated by the alignment and fusion of the image multi-scale characteristics, effective information (text semantic characteristics) and image semantic characteristics, and the semantic information expressed by the pictures and texts can make the network rear end make correct prediction on the category attribution.

(3) The problems of difficult extraction of effective features and low retrieval accuracy are solved by using the priori knowledge. The invention constructs a semantic guide triple loss module to perform category matching on category characteristics, judges whether an image and a text belong to the same category, inputs category attributes as external knowledge into a downstream task, and performs dynamic weight selection on heterogeneous information matched with heterogeneous images and texts. For example, a remote sensing image classification model and a remote sensing text classification model with high accuracy are trained as prior knowledge and added into a loss function, if the categories of the images and the texts are the same, the similarity is increased, so that the model convergence time is greatly shortened, and the matching probability of the images and the texts with the same categories is higher than the unmatching probability. So that the retrieval accuracy of the model is greatly increased.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Example 1

With reference to fig. 1 and 2, a category-guided bidirectional multi-scale decoupled marine remote sensing image text retrieval method firstly preprocesses data, including processing a marine remote sensing image, and then extracts text features T from the preprocessed data through a text feature extraction module on the one hand, and extracts decoupled image features F through bidirectional multi-scale decoupling on the other hand; then inputting the decoupled image characteristic F and text characteristic T into a category label guide module, and utilizing a category semantic label (F)U&V) Monitoring images and texts as prior knowledge to construct class features and realize feature decoupling on semantic dimensions; and finally, calculating the semantic guide triple loss through the similarity of the image and the text, judging whether the image and the text are the same, and performing reverse propagation.

The method specifically comprises the following steps:

and S0, obtaining the ocean remote sensing image and the remote sensing related text.

. A characterization of the image is obtained by this step.

S2, extracting text features of the remote sensing related textT. In a specific application, text feature extraction can be selected by using a word vector embedding model (sentence embedding) and a Skip-through text processing model. The representation of the text is obtained by this step.

S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF. The method comprises the following two steps:

Constructing an attention map based on attention mechanism at the current scale

Extracting potential features; and generating a suppression mask

。

The method comprises the following steps: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function

；

Generation of a suppression mask by binary masking

；

Wherein

Is a binary mask that will be most significant

The value of (b) is taken as 0, and the others are taken as 1; inhibition mask alleviates

The coverage effect on other scales makes the common finger information of different scales stand out.

S32, aiming at attention diagrams extracted under different feature scales

And suppression mask

By passing

To facilitate significant information on the corresponding scale,

the method is used for suppressing the salient features of other scales, obtaining the image features after filtering redundant information to realize scale decoupling, and drawing attention in a gradual suppression mode

Application to decoupling features

And

in the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operation

And

the decoupling characteristic of the synthesized final imageFThe formula is as follows:

wherein m is the number of different scales, namely three scales of large, medium and small, and the attention is sought

And suppression mask

Deriving decoupling characteristics by operational cascading

And

wherein

Is a decoupling feature in the small to large dimension direction,

is a decoupling feature in the large scale to small scale direction.

In particular, since the attention map represents significant regions of a feature, the suppression mask leverages the attention map representation to suppress significance information on the corresponding scale. The suppression mask mitigation attention seeks to show the effect of the overlay on other scales, highlighting different information.

Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTMultiplying the resulting class-dependent image and text features

And

the method comprises the following steps:

S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV；

The two classifiers are pre-training models, the prediction accuracy rate of the two classifiers reaches over 80 percent, rich semantic knowledge in the pre-training models can be transferred to a subsequent training process, and the pre-training models can be regarded as prior knowledge supervision of the models.

S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying to guide the retrieval network to detect important and reliable category related information; the text characteristics obtained in the step S2 are processedTAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categories

And text features

By making full use of multiplication, significant enhancement of the correlation features can be achieved in the feature combination process.

And

the method not only captures identifiable multi-scale semantic information, but also highlights reliable knowledge related to categories, thereby improving the accuracy of network retrieval. To guide the retrieval network to probe important and reliable category-related information. Wherein the decoupling characteristic of the imageFAnd remote sensing image category characteristicsUThe image feature and the text feature are guided by using the classified prior knowledge of the image and the text, the knowledge of the pre-trained semantic feature is subjected to semantic decoupling, and the decoupled semantic information is combined with an original retrieval network to explore meaningful and reliable category related data, so that while category supervision is realized, the semantic information is fused and aligned with the scale information on different modal information through a prior knowledge guide module; the formula is as follows:

。

s5, calculating similarity and semantic guide triple loss:

firstly, the image and text characteristics related to the category output in the step S4

And

performing category matching, and judging whether the image and the text belong to the same category so as to improve the retrieval probability of the cross-modal data of the same category; and attribute the categoryThe image data is input into a downstream task as external knowledge, and dynamic weight selection is carried out on heterogeneous information matched with heterogeneous graphics and texts so as to improve the retrieval probability of the same-class cross-modal data; and then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training.

First, class features are converted to semantic classes of images and text by softmax

And

(ii) a Then, a parameter is defined

To adjust the loss and parameters

Expressed as:

at a constant value, at a constant value

the purpose of the triple-loss function is to increase the distance between a sample and the corresponding negative sample while minimizing the semantic spatial distance between the sample and the positive sample. Wherein

representing the similarity of the sample image and the positive sample text;

representing the similarity of the sample image and the negative sample text;

representing the similarity of the sample text and the positive sample image;

Matching with all text features (including text features of positive samples)

And text features of negative examples

) The second summation being for text features

Matching with all image features (including image features of positive samples)

And image characteristics of negative examples

). The purpose of the triplet loss function constructed by the two summations is to maximize the similarity with the positive samples and minimize the similarity with the negative samples.

And S6, inputting the ocean remote sensing image to be retrieved and outputting remote sensing related text data. (or inputting remote sensing related text data to be retrieved and outputting ocean remote sensing images).

Example 2

The category-guided bidirectional multi-scale decoupling marine remote sensing image text retrieval system comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module.

The image feature extraction module comprises a convolution neural network and a void space convolution pooling module and is used for extracting multi-scale image features

，

The text feature extraction module is used for extracting text features by utilizing a word vector embedding (sentence embedding) model and a Skip-through text processing model to obtain the text features of the remote sensing related textsT；

The bidirectional multi-scale decoupling module is used for extracting multi-scale image features output by the image feature extraction module

Decoupling is carried out to obtain decoupling characteristicsF；

The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUText category features related to remote sensingV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics U of respective corresponding modalities&V carries out attention enhancement, and can also combine enhancement information with an original retrieval network to realize the fusion of semantics and scale characteristics so as to explore meaningful and reliable category-related data and acquire category-related imagesAnd text features;

the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;

the input module is used for inputting marine remote sensing images or remote sensing related text data to be retrieved, and the output module is used for outputting remote sensing related text data or marine remote sensing images.

The function implementation and data processing of each module are partially the same as those in embodiment 1, and are not described herein again.

It should be noted that the method of the present invention can implement two-mode cross-mode retrieval of images and texts, one type of data is used as a query to retrieve the other type of data, when the data is input as an ocean remote sensing image, the output retrieval result is corresponding text data, and when the data is input as ocean remote sensing related text data, the output retrieval result is corresponding ocean remote sensing image.

In summary, the present invention can use the category information as the prior knowledge to guide more accurate cross-modal information representation. Specifically, compared with the existing method, the bidirectional multi-scale decoupling module is constructed in the invention to adaptively extract potential features and inhibit fussy features on other scales, so that discriminative clues are generated and the problem of noise redundancy of cascade scale decoupling is solved. In addition, a category label guide module and a semantic guide triple loss module are constructed, wherein the category label guide module monitors images and texts by using category semantic labels as priori knowledge to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. Then, the decoupled semantic information is combined with an original retrieval network, so that the fusion of semantic and scale characteristics is realized, and meaningful and reliable category related data are explored; and the semantic guide triple loss module performs category matching on the category characteristics, judges whether the image and the text belong to the same category, inputs the category attribute as external knowledge into a downstream task, and performs dynamic weight selection on the heterogeneous information matched with the heterogeneous images and texts so as to improve the retrieval probability of the same-category cross-modal data and improve the retrieval probability and the model convergence speed of the same-category cross-modal data. And finally, by carrying out category matching on the generated category characteristics, a category-based triple loss is designed so as to improve the retrieval probability of the similar cross-modal data.

It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. The method for searching the marine remote sensing image text based on category-guided multi-scale decoupling is characterized by comprising the following steps of:

s0, obtaining a marine remote sensing image and a remote sensing related text;

s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of the image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained

；

S2, extracting text features of the remote sensing related textT；

Step S3 is divided into two steps:

Based on attention on the current scaleMechanism build attention diagrams

Extracting potential features; and generating a suppression mask

；

S32, aiming at attention diagrams extracted under different feature scales

And suppression mask

By passing

To facilitate significant information on the corresponding scale,

the method is used for suppressing salient features of other scales, obtaining image features after redundant information is filtered to achieve scale decoupling, and performing attention drawing through a gradual suppression mode

Application to decoupling features

And

And

decoupling features of synthesized final imagesF；

In step S32, the calculation formula of the decoupling characteristic is as follows:

And suppression mask

Deriving decoupling characteristics by operational cascading

And

wherein

Is a decoupling feature in the small to large dimension direction,

is a decoupling feature in the large to small dimension direction;

And text features

；

Step S4 is specifically as follows:

s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining category characteristics of the remote sensing images through training of a remote sensing image classifierU；

S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categories

And text features

；

S5, calculating similarity and semantic guide triple loss:

firstly, the image characteristics related to the category output in step S4

And text features

step (ii) ofIn S5, firstly, the category characteristics are converted into semantic categories of images and texts through softmax

And

(ii) a Then, a parameter is defined

To adjust the loss and parameters

Expressed as:

at a constant value, at a constant value

Based on the above, the category-based triple loss is designed as follows:

wherein

representing the similarity of the sample image and the positive sample text;

representing the similarity of the sample image and the negative sample text;

representing the similarity of the sample text and the positive sample image;

representing the similarity of the sample text and the negative sample image; the first summation being over image features

And text features of negative examples

The second summation being for text features

Matching with all image features, including image features of positive samples

And image characteristics of negative examples

(ii) a The purpose of the triple loss function constructed by the two summations is to maximize the similarity between the triple loss function and the positive sample and minimize the similarity between the triple loss function and the negative sample;

2. The category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 1, characterized in that the specific steps of step S31 are: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function

；

Generation of a suppression mask by binary masking

；

Wherein

Is a binary mask that will be most significant

The value of (b) is 0, and the others are 1.

3. The marine remote sensing image text retrieval system based on category-guided multi-scale decoupling is characterized in that the marine remote sensing image text retrieval method based on category-guided multi-scale decoupling is used for achieving the category-guided multi-scale decoupling, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;

the image feature extraction module comprises a convolution neural network and a cavity space convolution pooling module and is used for extracting multi-scale image features

，

Decoupling is carried out to obtain decoupling characteristicsF；

The class label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining class characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; whereinU&VClass features labeled through a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU& VPerforming attention enhancement to obtain image and text characteristics related to categories;