CN115311463B - Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system - Google Patents

Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system Download PDF

Info

Publication number
CN115311463B
CN115311463B CN202211223823.1A CN202211223823A CN115311463B CN 115311463 B CN115311463 B CN 115311463B CN 202211223823 A CN202211223823 A CN 202211223823A CN 115311463 B CN115311463 B CN 115311463B
Authority
CN
China
Prior art keywords
image
text
features
remote sensing
decoupling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211223823.1A
Other languages
Chinese (zh)
Other versions
CN115311463A (en
Inventor
魏志强
郑程予
宋宁
赵恩源
聂婕
刘安安
宋丹
李文辉
孙正雅
张文生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202211223823.1A priority Critical patent/CN115311463B/en
Publication of CN115311463A publication Critical patent/CN115311463A/en
Application granted granted Critical
Publication of CN115311463B publication Critical patent/CN115311463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention belongs to the technical field of remote sensing image processing, and discloses a method and a system for searching a marine remote sensing image text with category-guided multi-scale decoupling, wherein image features of different scales of a marine remote sensing image and text features of a remote sensing related text are extracted; then, decoupling the obtained image features of different scales by using a bidirectional multi-scale decoupling module, extracting corresponding potential features on each scale, inhibiting complex features on other scales and obtaining decoupling features; guiding the decoupled image features and text features by using the class label guiding module, and calculating final class-related image and text features by using multiplication; and finally, calculating the similarity and the semantic guide triple loss. The invention realizes multi-scale decoupling, introduces effective information for decoupling, establishes a scale and semantic double-decoupling marine multi-modal information fusion method, and solves the problems of multi-scale dimension noise redundancy and difficult multi-dimension decoupling representation information fusion.

Description

Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
Technical Field
The invention belongs to the technical field of remote sensing image processing, and particularly relates to a method and a system for category-guided multi-scale decoupling text retrieval of ocean remote sensing images.
Background
The ocean remote sensing image text retrieval is an important method for solving the problems of text data deletion and inaccurate text data description in remote sensing data. The ocean remote sensing image text retrieval utilizes a cross-modal retrieval algorithm to analyze a large number of satellite remote sensing images and automatically retrieve a large number of text data accurately describing the images so as to achieve the purposes of solving text data loss and text data inaccurate description. The traditional method mainly faces the problem that the effective image features are difficult to extract, because the space distribution of targets in the ocean remote sensing image is dispersed, and the effective targets in the image are few, the information of the effective targets can be diluted in the fusion process of the global information, and the subsequent data mining is influenced. Therefore, the text retrieval method of the ocean remote sensing image at the leading edge introduces a multi-scale feature extraction and attention mechanism, yuan et al propose a novel fine-grained multi-modal feature matching network, and the method has the advantages that image features under different scales are obtained and key features are extracted, so that more accurate text information is retrieved.
However, the prior method has the following problems: first, a large amount of redundant noise is generated during multi-scale feature interaction. The multi-scale features often comprise repeated regions, when the multi-scale features are fused through addition or cascade, the repeated regions are accumulated continuously, the utilization rate of multi-scale contents is low, a redundant feature filtering algorithm used by the existing method is simple, a large amount of noise cannot be filtered, and the redundant noise can influence subsequent data fusion and mining. For example, the existing method uses a gating idea to filter redundant features, and the method cannot effectively filter a large amount of noise and has the possibility of filtering effective information. Secondly, the existing method usually performs knowledge decoupling based on multi-scale features of the image, and ignores the disambiguation effect of image semantic information and text semantic information in image-text retrieval. On the aspect of the text retrieval of the ocean remote sensing image, only the characteristic decoupling on the dimension is considered, but the waste of the value of rich semantic information is avoided, and the time and difficulty for extracting effective key characteristics from the model are increased due to the lack of value information. The low-order semantic information of the image is the expression of shallow features (such as features of color, geometry, texture and the like), the semantic information of the text can be understood as information related to category division, and the introduction of the image-text semantic information can express the information of texture, geometry, color and the like in the image content, and can also express text description and text type information. The semantic information expressed by the pictures and texts can lead the network back end to correctly predict the category attribution.
Therefore, aiming at the problems, the invention provides a category-guided bidirectional multi-scale decoupling network, which realizes multi-scale decoupling and introduces effective category information (image-text semantic information) for decoupling. A scale and semantic double-decoupling marine multi-modal information fusion framework is established, and the problems of noise redundancy of multi-scale dimensions and difficulty in information fusion of multi-dimensional decoupling representation are solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for category-guided multi-scale decoupling marine remote sensing image text retrieval, decoupling characteristics on different scales are obtained through bidirectional multi-scale decoupling, and the category characteristics of images and texts are guided and decoupled by category labels, so that the problems of noise redundancy of multi-scale dimensions and difficulty in fusion of multi-dimensional decoupling characteristic information are solved.
In order to solve the technical problems, the invention adopts the following technical scheme:
firstly, the invention provides a category-guided multi-scale decoupled marine remote sensing image text retrieval method, which comprises the following steps:
s0, obtaining a marine remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 521511DEST_PATH_IMAGE001
S2, extracting text features of the remote sensing related textT
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image features
Figure 454963DEST_PATH_IMAGE002
And text features
Figure 641225DEST_PATH_IMAGE003
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the categories output in the step S4
Figure 833172DEST_PATH_IMAGE002
And text features
Figure 460594DEST_PATH_IMAGE003
Performing category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;
s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
Further, step S3 is divided into two steps:
s31, extracting image features of each scale from the image feature extraction module
Figure 345373DEST_PATH_IMAGE004
Constructing an attention map based on an attention mechanism on the current scale
Figure 425456DEST_PATH_IMAGE005
Extracting potential features; and generating a suppression mask
Figure 686673DEST_PATH_IMAGE006
S32, aiming at attention diagrams extracted under different feature scales
Figure 168601DEST_PATH_IMAGE005
And suppression mask
Figure 771752DEST_PATH_IMAGE006
By passing
Figure 198185DEST_PATH_IMAGE005
To facilitate significant information on the corresponding scale,
Figure 200776DEST_PATH_IMAGE006
used for suppressing the salient features of other scales and obtainedThe image characteristics after filtering redundant information realize scale decoupling, and the attention is tried out in a step-by-step suppression mode
Figure 537210DEST_PATH_IMAGE005
Application to decoupling features
Figure 294951DEST_PATH_IMAGE007
And
Figure 349626DEST_PATH_IMAGE008
in the production process of, wherein
Figure 687066DEST_PATH_IMAGE009
Is a decoupling feature in the small-to-large dimension direction,
Figure 737062DEST_PATH_IMAGE010
is a decoupling feature in the large to small dimension direction; finally, decoupling characteristics of various characteristic scales through concat operation
Figure 88540DEST_PATH_IMAGE012
And
Figure 513003DEST_PATH_IMAGE013
the decoupling characteristic of the synthesized final imageF
Further, the calculation formula of the decoupling characteristic is as follows:
Figure 857396DEST_PATH_IMAGE014
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map
Figure 434002DEST_PATH_IMAGE015
And suppression mask
Figure 346595DEST_PATH_IMAGE016
Deriving decoupling characteristics by arithmetic concatenation
Figure 828392DEST_PATH_IMAGE017
And
Figure 789526DEST_PATH_IMAGE018
further, step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is performed to obtain final class-related image features
Figure 204326DEST_PATH_IMAGE002
Text features related to categories
Figure 287820DEST_PATH_IMAGE003
Further, the step S31 specifically includes: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function
Figure 538804DEST_PATH_IMAGE019
Generation of a suppression mask by binary masking
Figure 21738DEST_PATH_IMAGE006
Figure 307357DEST_PATH_IMAGE020
Wherein
Figure 483123DEST_PATH_IMAGE021
Is a binary mask that will be most significant
Figure 690245DEST_PATH_IMAGE019
The value of (b) is 0, and the others are 1.
Further, in step S5, first, the category features are converted into semantic categories of images and texts by softmax
Figure 242449DEST_PATH_IMAGE022
And
Figure 116995DEST_PATH_IMAGE023
(ii) a Then, a parameter is defined
Figure 463663DEST_PATH_IMAGE024
To adjust the loss and parameters
Figure 17135DEST_PATH_IMAGE024
Expressed as:
Figure 858183DEST_PATH_IMAGE025
Figure 102083DEST_PATH_IMAGE026
at a constant value, at a constant value
Figure 573646DEST_PATH_IMAGE026
Based on the above, the triple loss based on the category is designed as follows:
Figure 801365DEST_PATH_IMAGE027
wherein
Figure 508421DEST_PATH_IMAGE028
The distance between the finger edges is equal to the distance between the finger edges,
Figure 623139DEST_PATH_IMAGE029
representing the similarity of the sample image and the positive sample text;
Figure 311609DEST_PATH_IMAGE030
representing the similarity of the sample image and the negative sample text;
Figure 511778DEST_PATH_IMAGE031
representing the similarity of the sample text and the positive sample image;
Figure 943896DEST_PATH_IMAGE032
representing the similarity of the sample text and the negative sample image; the first summation being for image features
Figure 647541DEST_PATH_IMAGE002
Matching with all text features, including the text features of the positive sample
Figure 975754DEST_PATH_IMAGE003
And text features of negative examples
Figure 725535DEST_PATH_IMAGE033
Second summation being over text features
Figure 712077DEST_PATH_IMAGE003
Matching with all image features, including image features of positive samples
Figure 785075DEST_PATH_IMAGE002
And image characteristics of negative examples
Figure 566081DEST_PATH_IMAGE034
(ii) a The objective of the triple loss function constructed by two summations is to maximizeThe similarity with the positive samples is minimized.
The invention also provides a category-guided multi-scale decoupling marine remote sensing image text retrieval system, which is used for realizing the category-guided multi-scale decoupling marine remote sensing image text retrieval method, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a depth residual error network and a cavity space convolution pooling pyramid and is used for extracting multi-scale image features
Figure 724530DEST_PATH_IMAGE004
The text feature extraction module extracts text features to obtain text features of the remote sensing related textT
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module
Figure 249183DEST_PATH_IMAGE004
Decoupling is carried out to obtain decoupling characteristicsF
The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding the image and the text as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
Compared with the prior art, the invention has the advantages that:
(1) The problem of noise redundancy is solved. The invention effectively filters a large amount of redundant noise generated in the multi-scale feature interaction process. A bidirectional multi-scale decoupling module is constructed, potential features of each scale are extracted in a bidirectional mode in a self-adaptive mode, and tedious features of other scales are suppressed, so that effective features of each scale are extracted, redundant features of each scale are suppressed, a large amount of redundant noise is filtered, and effective features are extracted.
(2) The introduction of category information (semantic information) improves the robustness of the features. The invention unifies the semantic decoupling of the two dimensions. And a category label guide module is constructed, and category semantic labels are used as priori knowledge to monitor images and texts so as to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. The category semantic features can emphasize effective features, and the knowledge of semantic decoupling is mapped into a visual multi-scale sample space through cascade connection. The category attribute serves as a bridge of two kinds of modal information, and external knowledge is provided for the model while multi-modal knowledge is aligned, so that the model is helped to quickly extract effective features, and effective objects in the remote sensing image are excavated. Meanwhile, the expressions of category information, pixel attribution and scale characteristics can also be generated by the alignment and fusion of the image multi-scale characteristics, effective information (text semantic characteristics) and image semantic characteristics, and the semantic information expressed by the pictures and texts can make the network rear end make correct prediction on the category attribution.
(3) The problems of difficult extraction of effective features and low retrieval accuracy are solved by using the priori knowledge. The invention constructs a semantic guide triple loss module to perform category matching on category characteristics, judges whether an image and a text belong to the same category, inputs category attributes as external knowledge into a downstream task, and performs dynamic weight selection on heterogeneous information matched with heterogeneous images and texts. For example, a remote sensing image classification model and a remote sensing text classification model with high accuracy are trained as prior knowledge and added into a loss function, if the categories of the images and the texts are the same, the similarity is increased, so that the model convergence time is greatly shortened, and the matching probability of the images and the texts with the same categories is higher than the unmatching probability. So that the retrieval accuracy of the model is greatly increased.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
Example 1
With reference to fig. 1 and 2, a category-guided bidirectional multi-scale decoupled marine remote sensing image text retrieval method firstly preprocesses data, including processing a marine remote sensing image, and then extracts text features T from the preprocessed data through a text feature extraction module on the one hand, and extracts decoupled image features F through bidirectional multi-scale decoupling on the other hand; then inputting the decoupled image characteristic F and text characteristic T into a category label guide module, and utilizing a category semantic label (F)U&V) Monitoring images and texts as prior knowledge to construct class features and realize feature decoupling on semantic dimensions; and finally, calculating the semantic guide triple loss through the similarity of the image and the text, judging whether the image and the text are the same, and performing reverse propagation.
The method specifically comprises the following steps:
and S0, obtaining the ocean remote sensing image and the remote sensing related text.
S1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 645529DEST_PATH_IMAGE004
. A characterization of the image is obtained by this step.
S2, extracting text features of the remote sensing related textT. In a specific application, text feature extraction can be selected by using a word vector embedding model (sentence embedding) and a Skip-through text processing model. The representation of the text is obtained by this step.
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF. The method comprises the following two steps:
s31, extracting image features of each scale from the image feature extraction module
Figure 863015DEST_PATH_IMAGE004
Constructing an attention map based on attention mechanism at the current scale
Figure 508760DEST_PATH_IMAGE005
Extracting potential features; and generating a suppression mask
Figure 571525DEST_PATH_IMAGE006
The method comprises the following steps: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function
Figure 619116DEST_PATH_IMAGE005
Generation of a suppression mask by binary masking
Figure 741923DEST_PATH_IMAGE006
Figure 874965DEST_PATH_IMAGE020
Wherein
Figure 741421DEST_PATH_IMAGE021
Is a binary mask that will be most significant
Figure 190988DEST_PATH_IMAGE005
The value of (b) is taken as 0, and the others are taken as 1; inhibition mask alleviates
Figure 733964DEST_PATH_IMAGE005
The coverage effect on other scales makes the common finger information of different scales stand out.
S32, aiming at attention diagrams extracted under different feature scales
Figure 839455DEST_PATH_IMAGE005
And suppression mask
Figure 758869DEST_PATH_IMAGE006
By passing
Figure 266205DEST_PATH_IMAGE005
To facilitate significant information on the corresponding scale,
Figure 245662DEST_PATH_IMAGE006
the method is used for suppressing the salient features of other scales, obtaining the image features after filtering redundant information to realize scale decoupling, and drawing attention in a gradual suppression mode
Figure 104028DEST_PATH_IMAGE005
Application to decoupling features
Figure 640183DEST_PATH_IMAGE011
And
Figure 516872DEST_PATH_IMAGE008
in the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operation
Figure 621225DEST_PATH_IMAGE035
And
Figure 950575DEST_PATH_IMAGE008
the decoupling characteristic of the synthesized final imageFThe formula is as follows:
Figure 228104DEST_PATH_IMAGE014
wherein m is the number of different scales, namely three scales of large, medium and small, and the attention is sought
Figure 693721DEST_PATH_IMAGE005
And suppression mask
Figure 765713DEST_PATH_IMAGE006
Deriving decoupling characteristics by operational cascading
Figure 582359DEST_PATH_IMAGE017
And
Figure 702458DEST_PATH_IMAGE018
wherein
Figure 242155DEST_PATH_IMAGE017
Is a decoupling feature in the small to large dimension direction,
Figure 734316DEST_PATH_IMAGE010
is a decoupling feature in the large scale to small scale direction.
In particular, since the attention map represents significant regions of a feature, the suppression mask leverages the attention map representation to suppress significance information on the corresponding scale. The suppression mask mitigation attention seeks to show the effect of the overlay on other scales, highlighting different information.
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTMultiplying the resulting class-dependent image and text features
Figure 523412DEST_PATH_IMAGE002
And
Figure 126431DEST_PATH_IMAGE003
the method comprises the following steps:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV
The two classifiers are pre-training models, the prediction accuracy rate of the two classifiers reaches over 80 percent, rich semantic knowledge in the pre-training models can be transferred to a subsequent training process, and the pre-training models can be regarded as prior knowledge supervision of the models.
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying to guide the retrieval network to detect important and reliable category related information; the text characteristics obtained in the step S2 are processedTAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categories
Figure 582952DEST_PATH_IMAGE002
And text features
Figure 980435DEST_PATH_IMAGE003
By making full use of multiplication, significant enhancement of the correlation features can be achieved in the feature combination process.
Figure 787985DEST_PATH_IMAGE002
And
Figure 273324DEST_PATH_IMAGE003
the method not only captures identifiable multi-scale semantic information, but also highlights reliable knowledge related to categories, thereby improving the accuracy of network retrieval. To guide the retrieval network to probe important and reliable category-related information. Wherein the decoupling characteristic of the imageFAnd remote sensing image category characteristicsUThe image feature and the text feature are guided by using the classified prior knowledge of the image and the text, the knowledge of the pre-trained semantic feature is subjected to semantic decoupling, and the decoupled semantic information is combined with an original retrieval network to explore meaningful and reliable category related data, so that while category supervision is realized, the semantic information is fused and aligned with the scale information on different modal information through a prior knowledge guide module; the formula is as follows:
Figure 833619DEST_PATH_IMAGE036
s5, calculating similarity and semantic guide triple loss:
firstly, the image and text characteristics related to the category output in the step S4
Figure 152736DEST_PATH_IMAGE002
And
Figure 431270DEST_PATH_IMAGE003
performing category matching, and judging whether the image and the text belong to the same category so as to improve the retrieval probability of the cross-modal data of the same category; and attribute the categoryThe image data is input into a downstream task as external knowledge, and dynamic weight selection is carried out on heterogeneous information matched with heterogeneous graphics and texts so as to improve the retrieval probability of the same-class cross-modal data; and then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training.
First, class features are converted to semantic classes of images and text by softmax
Figure 330087DEST_PATH_IMAGE022
And
Figure 744888DEST_PATH_IMAGE023
(ii) a Then, a parameter is defined
Figure 500486DEST_PATH_IMAGE024
To adjust the loss and parameters
Figure 344945DEST_PATH_IMAGE037
Expressed as:
Figure DEST_PATH_IMAGE038
Figure 906507DEST_PATH_IMAGE026
at a constant value, at a constant value
Figure 644656DEST_PATH_IMAGE026
Based on the above, the triple loss based on the category is designed as follows:
Figure DEST_PATH_IMAGE040
the purpose of the triple-loss function is to increase the distance between a sample and the corresponding negative sample while minimizing the semantic spatial distance between the sample and the positive sample. Wherein
Figure 633472DEST_PATH_IMAGE028
The distance between the finger edges is equal to the distance between the finger edges,
Figure 371752DEST_PATH_IMAGE029
representing the similarity of the sample image and the positive sample text;
Figure 2584DEST_PATH_IMAGE030
representing the similarity of the sample image and the negative sample text;
Figure 595240DEST_PATH_IMAGE031
representing the similarity of the sample text and the positive sample image;
Figure 692640DEST_PATH_IMAGE032
representing the similarity of the sample text and the negative sample image; the first summation being for image features
Figure 433063DEST_PATH_IMAGE002
Matching with all text features (including text features of positive samples)
Figure 8532DEST_PATH_IMAGE003
And text features of negative examples
Figure 252431DEST_PATH_IMAGE033
) The second summation being for text features
Figure 520733DEST_PATH_IMAGE003
Matching with all image features (including image features of positive samples)
Figure 482873DEST_PATH_IMAGE002
And image characteristics of negative examples
Figure 924349DEST_PATH_IMAGE034
). The purpose of the triplet loss function constructed by the two summations is to maximize the similarity with the positive samples and minimize the similarity with the negative samples.
And S6, inputting the ocean remote sensing image to be retrieved and outputting remote sensing related text data. (or inputting remote sensing related text data to be retrieved and outputting ocean remote sensing images).
Example 2
The category-guided bidirectional multi-scale decoupling marine remote sensing image text retrieval system comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module.
The image feature extraction module comprises a convolution neural network and a void space convolution pooling module and is used for extracting multi-scale image features
Figure 773488DEST_PATH_IMAGE004
The text feature extraction module is used for extracting text features by utilizing a word vector embedding (sentence embedding) model and a Skip-through text processing model to obtain the text features of the remote sensing related textsT
The bidirectional multi-scale decoupling module is used for extracting multi-scale image features output by the image feature extraction module
Figure 540587DEST_PATH_IMAGE004
Decoupling is carried out to obtain decoupling characteristicsF
The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUText category features related to remote sensingV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics U of respective corresponding modalities&V carries out attention enhancement, and can also combine enhancement information with an original retrieval network to realize the fusion of semantics and scale characteristics so as to explore meaningful and reliable category-related data and acquire category-related imagesAnd text features;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting marine remote sensing images or remote sensing related text data to be retrieved, and the output module is used for outputting remote sensing related text data or marine remote sensing images.
The function implementation and data processing of each module are partially the same as those in embodiment 1, and are not described herein again.
It should be noted that the method of the present invention can implement two-mode cross-mode retrieval of images and texts, one type of data is used as a query to retrieve the other type of data, when the data is input as an ocean remote sensing image, the output retrieval result is corresponding text data, and when the data is input as ocean remote sensing related text data, the output retrieval result is corresponding ocean remote sensing image.
In summary, the present invention can use the category information as the prior knowledge to guide more accurate cross-modal information representation. Specifically, compared with the existing method, the bidirectional multi-scale decoupling module is constructed in the invention to adaptively extract potential features and inhibit fussy features on other scales, so that discriminative clues are generated and the problem of noise redundancy of cascade scale decoupling is solved. In addition, a category label guide module and a semantic guide triple loss module are constructed, wherein the category label guide module monitors images and texts by using category semantic labels as priori knowledge to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. Then, the decoupled semantic information is combined with an original retrieval network, so that the fusion of semantic and scale characteristics is realized, and meaningful and reliable category related data are explored; and the semantic guide triple loss module performs category matching on the category characteristics, judges whether the image and the text belong to the same category, inputs the category attribute as external knowledge into a downstream task, and performs dynamic weight selection on the heterogeneous information matched with the heterogeneous images and texts so as to improve the retrieval probability of the same-category cross-modal data and improve the retrieval probability and the model convergence speed of the same-category cross-modal data. And finally, by carrying out category matching on the generated category characteristics, a category-based triple loss is designed so as to improve the retrieval probability of the similar cross-modal data.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims (3)

1. The method for searching the marine remote sensing image text based on category-guided multi-scale decoupling is characterized by comprising the following steps of:
s0, obtaining a marine remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of the image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 801678DEST_PATH_IMAGE001
S2, extracting text features of the remote sensing related textT
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF
Step S3 is divided into two steps:
s31, extracting image features of each scale from the image feature extraction module
Figure 63026DEST_PATH_IMAGE002
Based on attention on the current scaleMechanism build attention diagrams
Figure 311605DEST_PATH_IMAGE003
Extracting potential features; and generating a suppression mask
Figure 565869DEST_PATH_IMAGE004
S32, aiming at attention diagrams extracted under different feature scales
Figure 849082DEST_PATH_IMAGE005
And suppression mask
Figure 12823DEST_PATH_IMAGE004
By passing
Figure 748698DEST_PATH_IMAGE006
To facilitate significant information on the corresponding scale,
Figure 541073DEST_PATH_IMAGE004
the method is used for suppressing salient features of other scales, obtaining image features after redundant information is filtered to achieve scale decoupling, and performing attention drawing through a gradual suppression mode
Figure 678794DEST_PATH_IMAGE007
Application to decoupling features
Figure 16365DEST_PATH_IMAGE008
And
Figure 239536DEST_PATH_IMAGE009
in the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operation
Figure 773286DEST_PATH_IMAGE010
And
Figure 640879DEST_PATH_IMAGE011
decoupling features of synthesized final imagesF
In step S32, the calculation formula of the decoupling characteristic is as follows:
Figure 539564DEST_PATH_IMAGE012
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map
Figure 374665DEST_PATH_IMAGE007
And suppression mask
Figure 118630DEST_PATH_IMAGE004
Deriving decoupling characteristics by operational cascading
Figure 106309DEST_PATH_IMAGE013
And
Figure 910317DEST_PATH_IMAGE014
wherein
Figure 232714DEST_PATH_IMAGE015
Is a decoupling feature in the small to large dimension direction,
Figure 45949DEST_PATH_IMAGE016
is a decoupling feature in the large to small dimension direction;
step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image features
Figure 94326DEST_PATH_IMAGE017
And text features
Figure 459449DEST_PATH_IMAGE018
Step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU&VAttention enhancement is carried out to obtain final image characteristics related to categories
Figure 878929DEST_PATH_IMAGE017
And text features
Figure 105642DEST_PATH_IMAGE018
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the category output in step S4
Figure 926967DEST_PATH_IMAGE017
And text features
Figure 462991DEST_PATH_IMAGE018
Performing category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;
step (ii) ofIn S5, firstly, the category characteristics are converted into semantic categories of images and texts through softmax
Figure 369767DEST_PATH_IMAGE019
And
Figure 134592DEST_PATH_IMAGE020
(ii) a Then, a parameter is defined
Figure 76003DEST_PATH_IMAGE021
To adjust the loss and parameters
Figure 782928DEST_PATH_IMAGE022
Expressed as:
Figure 442579DEST_PATH_IMAGE023
Figure 11095DEST_PATH_IMAGE024
at a constant value, at a constant value
Figure 541434DEST_PATH_IMAGE024
Based on the above, the category-based triple loss is designed as follows:
Figure 153680DEST_PATH_IMAGE026
wherein
Figure 35049DEST_PATH_IMAGE027
The distance between the finger edges is equal to the distance between the finger edges,
Figure 669905DEST_PATH_IMAGE028
representing the similarity of the sample image and the positive sample text;
Figure 320329DEST_PATH_IMAGE030
representing the similarity of the sample image and the negative sample text;
Figure 103477DEST_PATH_IMAGE032
representing the similarity of the sample text and the positive sample image;
Figure 472142DEST_PATH_IMAGE034
representing the similarity of the sample text and the negative sample image; the first summation being over image features
Figure 382460DEST_PATH_IMAGE017
Matching with all text features, including the text features of the positive sample
Figure 480866DEST_PATH_IMAGE018
And text features of negative examples
Figure 310282DEST_PATH_IMAGE035
The second summation being for text features
Figure 41609DEST_PATH_IMAGE018
Matching with all image features, including image features of positive samples
Figure 880252DEST_PATH_IMAGE017
And image characteristics of negative examples
Figure 364323DEST_PATH_IMAGE036
(ii) a The purpose of the triple loss function constructed by the two summations is to maximize the similarity between the triple loss function and the positive sample and minimize the similarity between the triple loss function and the negative sample;
s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
2. The category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 1, characterized in that the specific steps of step S31 are: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function
Figure 895798DEST_PATH_IMAGE005
Generation of a suppression mask by binary masking
Figure 583263DEST_PATH_IMAGE004
Figure 225596DEST_PATH_IMAGE037
Wherein
Figure 829753DEST_PATH_IMAGE038
Is a binary mask that will be most significant
Figure 266551DEST_PATH_IMAGE005
The value of (b) is 0, and the others are 1.
3. The marine remote sensing image text retrieval system based on category-guided multi-scale decoupling is characterized in that the marine remote sensing image text retrieval method based on category-guided multi-scale decoupling is used for achieving the category-guided multi-scale decoupling, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a convolution neural network and a cavity space convolution pooling module and is used for extracting multi-scale image features
Figure 987118DEST_PATH_IMAGE039
The text feature extraction module extracts text features to obtain text features of the remote sensing related textT
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module
Figure 167564DEST_PATH_IMAGE039
Decoupling is carried out to obtain decoupling characteristicsF
The class label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining class characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; whereinU&VClass features labeled through a pre-training model; decoupling features of imagesFText features with related textTClass characteristics respectively corresponding to corresponding modalitiesU& VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
CN202211223823.1A 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system Active CN115311463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211223823.1A CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211223823.1A CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Publications (2)

Publication Number Publication Date
CN115311463A CN115311463A (en) 2022-11-08
CN115311463B true CN115311463B (en) 2023-02-03

Family

ID=83866005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211223823.1A Active CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Country Status (1)

Country Link
CN (1) CN115311463B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127123B (en) * 2023-04-17 2023-07-07 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116186317B (en) * 2023-04-23 2023-06-30 中国海洋大学 Cross-modal cross-guidance-based image-text retrieval method and system
CN117556062B (en) * 2024-01-05 2024-04-16 武汉理工大学三亚科教创新园 Ocean remote sensing image audio retrieval network training method and application method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017103035A1 (en) * 2015-12-18 2017-06-22 Ventana Medical Systems, Inc. Systems and methods of unmixing images with varying acquisition properties
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN111798460A (en) * 2020-06-17 2020-10-20 南京信息工程大学 Satellite image segmentation method
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766199B (en) * 2021-01-26 2022-04-29 武汉大学 Hyperspectral image classification method based on self-adaptive multi-scale feature extraction model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017103035A1 (en) * 2015-12-18 2017-06-22 Ventana Medical Systems, Inc. Systems and methods of unmixing images with varying acquisition properties
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN111798460A (en) * 2020-06-17 2020-10-20 南京信息工程大学 Satellite image segmentation method
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Semantic-Preserving Deep Hashing Model for multi-label remote sensing image retrieval;Qinmin cheng等;《remote sensing》;20211207;全文 *
Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification;Yuansheng Hua等;《ISPRS Journal of Photogrammetry and Remote Sensing》;20191231;全文 *
基于深度学习和标签语义关联的遥感影像多标签分类;单守平;《中国优秀硕士学位论文全文数据库(电子期刊)》;20220415;全文 *

Also Published As

Publication number Publication date
CN115311463A (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN115311463B (en) Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
CN111815602A (en) Building PDF drawing wall recognition device and method based on deep learning and morphology
CN111932577B (en) Text detection method, electronic device and computer readable medium
US11948078B2 (en) Joint representation learning from images and text
CN115658934A (en) Image-text cross-modal retrieval method based on multi-class attention mechanism
CN113947161A (en) Attention mechanism-based multi-label text classification method and system
CN112348001B (en) Training method, recognition method, device, equipment and medium for expression recognition model
Vu et al. Revising FUNSD dataset for key-value detection in document images
CN113051932A (en) Method for detecting category of network media event of semantic and knowledge extension topic model
CN116579348A (en) False news detection method and system based on uncertain semantic fusion
CN111159411A (en) Knowledge graph fused text position analysis method, system and storage medium
EP4012668A2 (en) Training method for character generation model, character generation method, apparatus and device
CN112800259B (en) Image generation method and system based on edge closure and commonality detection
CN115311598A (en) Video description generation system based on relation perception
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
Li et al. ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval.
CN113159071A (en) Cross-modal image-text association anomaly detection method
Divya et al. An Empirical Study on Fake News Detection System using Deep and Machine Learning Ensemble Techniques
Fang et al. PiPo-Net: A Semi-automatic and Polygon-based Annotation Method for Pathological Images
US20240028828A1 (en) Machine learning model architecture and user interface to indicate impact of text ngrams
CN111985505B (en) Interest visual relation detection method and device based on interest propagation network
CN115146618B (en) Complex causal relation extraction method based on contrast representation learning
CN112347196B (en) Entity relation extraction method and device based on neural network
Xianlun et al. Deep global-attention based convolutional network with dense connections for text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant