CN115311463A - Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system - Google Patents

Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system Download PDF

Info

Publication number
CN115311463A
CN115311463A CN202211223823.1A CN202211223823A CN115311463A CN 115311463 A CN115311463 A CN 115311463A CN 202211223823 A CN202211223823 A CN 202211223823A CN 115311463 A CN115311463 A CN 115311463A
Authority
CN
China
Prior art keywords
image
text
features
remote sensing
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211223823.1A
Other languages
Chinese (zh)
Other versions
CN115311463B (en
Inventor
魏志强
郑程予
宋宁
赵恩源
聂婕
刘安安
宋丹
李文辉
孙正雅
张文生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202211223823.1A priority Critical patent/CN115311463B/en
Publication of CN115311463A publication Critical patent/CN115311463A/en
Application granted granted Critical
Publication of CN115311463B publication Critical patent/CN115311463B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of remote sensing image processing, and discloses a method and a system for searching a marine remote sensing image text with category-guided multi-scale decoupling, wherein image features of different scales of a marine remote sensing image and text features of a remote sensing related text are extracted; then, decoupling the obtained image features of different scales by using a bidirectional multi-scale decoupling module, extracting corresponding potential features on each scale, inhibiting complex features on other scales and obtaining decoupling features; guiding the decoupled image characteristics and text characteristics by using the category label guiding module, and calculating final category-related image and text characteristics by using multiplication; and finally, calculating the similarity and the semantic guide triple loss. The invention realizes multi-scale decoupling, introduces effective information for decoupling, establishes a scale and semantic double-decoupling marine multi-modal information fusion method, and solves the problems of multi-scale dimension noise redundancy and difficult multi-dimension decoupling representation information fusion.

Description

Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
Technical Field
The invention belongs to the technical field of remote sensing image processing, and particularly relates to a method and a system for category-guided multi-scale decoupling text retrieval of ocean remote sensing images.
Background
The ocean remote sensing image text retrieval is an important method for solving the problems of text data deletion and inaccurate text data description in remote sensing data. The ocean remote sensing image text retrieval utilizes a cross-modal retrieval algorithm to analyze a large number of satellite remote sensing images and automatically retrieve a large number of text data accurately describing the images so as to achieve the purposes of solving text data loss and inaccurate text data description. The traditional method mainly faces the problem that the effective image features are difficult to extract, because the space distribution of targets in the ocean remote sensing image is dispersed, and the effective targets in the image are few, the information of the effective targets can be diluted in the fusion process of the global information, and the subsequent data mining is influenced. Therefore, the text retrieval method of the ocean remote sensing image at the leading edge introduces a multi-scale feature extraction and attention mechanism, yuan et al propose a novel fine-grained multi-modal feature matching network, and the method has the advantages that image features under different scales are obtained and key features are extracted, so that more accurate text information is retrieved.
However, the prior method has the following problems: first, a large amount of redundant noise is generated during multi-scale feature interaction. The multi-scale features often comprise repeated regions, when the multi-scale features are fused through addition or cascade, the repeated regions are accumulated continuously, the utilization rate of multi-scale contents is low, a redundant feature filtering algorithm used by the existing method is simple, a large amount of noise cannot be filtered, and the redundant noise can influence subsequent data fusion and mining. For example, the existing method uses the gating idea to filter the redundant features, which not only can not effectively filter a large amount of noise, but also has the possibility of filtering effective information. Secondly, the existing method usually performs knowledge decoupling based on multi-scale features of the image, and ignores the disambiguation effect of image semantic information and text semantic information in image-text retrieval. On the aspect of the text retrieval of the ocean remote sensing image, only the characteristic decoupling on the dimension is considered, but the waste of the value of rich semantic information is avoided, and the time and difficulty for extracting effective key characteristics from the model are increased due to the lack of value information. The low-order semantic information of the image is the expression of shallow features (such as features of color, geometry, texture and the like), the semantic information of the text can be understood as information related to category division, and the introduction of the image-text semantic information can express the information of texture, geometry, color and the like in the image content, and can also express text description and text type information. This teletext semantic information will allow the network backend to make the correct predictions of category attribution.
Therefore, aiming at the problems, the invention provides a class-guided bidirectional multi-scale decoupling network, which realizes multi-scale decoupling and introduces effective class information (image-text semantic information) for decoupling. A scale and semantic double-decoupling marine multi-modal information fusion framework is established, and the problems of noise redundancy of multi-scale dimensions and difficulty in information fusion of multi-dimensional decoupling representation are solved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for category-guided multi-scale decoupling marine remote sensing image text retrieval, decoupling characteristics on different scales are obtained through bidirectional multi-scale decoupling, and the category characteristics of images and texts are guided and decoupled by category labels, so that the problems of noise redundancy of multi-scale dimensions and difficulty in fusion of multi-dimensional decoupling characteristic information are solved.
In order to solve the technical problems, the invention adopts the following technical scheme:
firstly, the invention provides a category-guided multi-scale decoupled marine remote sensing image text retrieval method, which comprises the following steps:
s0, obtaining an ocean remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 521511DEST_PATH_IMAGE001
S2, extracting text features of the remote sensing related textT
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding the decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTComputing final class-dependent image features by multiplication
Figure 454963DEST_PATH_IMAGE002
And text features
Figure 641225DEST_PATH_IMAGE003
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the category output in step S4
Figure 833172DEST_PATH_IMAGE002
And text features
Figure 460594DEST_PATH_IMAGE003
Performing category matching, judging whether the image and the text belong to the same category, inputting category attributes serving as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training;
s6, inputting a marine remote sensing image to be retrieved, and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
Further, step S3 is divided into two steps:
s31, extracting image features of each scale from the image feature extraction module
Figure 345373DEST_PATH_IMAGE004
Constructing an attention map based on attention mechanism at the current scale
Figure 425456DEST_PATH_IMAGE005
Extracting potential features; and generating a suppression mask
Figure 686673DEST_PATH_IMAGE006
S32, aiming at attention diagrams extracted under different feature scales
Figure 168601DEST_PATH_IMAGE005
And suppression mask
Figure 771752DEST_PATH_IMAGE006
By passing
Figure 198185DEST_PATH_IMAGE005
To facilitate significant information on the corresponding scale,
Figure 200776DEST_PATH_IMAGE006
the method is used for suppressing salient features of other scales, obtaining image features after redundant information is filtered to achieve scale decoupling, and performing attention drawing through a gradual suppression mode
Figure 537210DEST_PATH_IMAGE005
Application to decoupling features
Figure 294951DEST_PATH_IMAGE007
And
Figure 349626DEST_PATH_IMAGE008
in the production process of, wherein
Figure 687066DEST_PATH_IMAGE009
Is a decoupling feature in the small to large dimension direction,
Figure 737062DEST_PATH_IMAGE010
is a decoupling feature in the large scale to small scale direction; finally, decoupling characteristics of various characteristic scales are carried out through concat operation
Figure 88540DEST_PATH_IMAGE012
And
Figure 513003DEST_PATH_IMAGE013
of the composite final imageF
Further, the calculation formula of the decoupling characteristic is as follows:
Figure 857396DEST_PATH_IMAGE014
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map
Figure 434002DEST_PATH_IMAGE015
And suppression mask
Figure 346595DEST_PATH_IMAGE016
Deriving decoupling characteristics by arithmetic concatenation
Figure 828392DEST_PATH_IMAGE017
And with
Figure 789526DEST_PATH_IMAGE018
Further, step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TText category features related to remote sensingVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics of respective corresponding modalitiesU&VAttention enhancement is performed to obtain final class-related image features
Figure 204326DEST_PATH_IMAGE002
Text features related to categories
Figure 287820DEST_PATH_IMAGE003
Further, the step S31 specifically includes: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function
Figure 538804DEST_PATH_IMAGE019
Generation of a suppression mask by binary masking
Figure 21738DEST_PATH_IMAGE006
Figure 307357DEST_PATH_IMAGE020
Wherein
Figure 483123DEST_PATH_IMAGE021
Is a binary mask that will be most significant
Figure 690245DEST_PATH_IMAGE019
The value of (b) is 0, and the others are 1.
Further, in step S5, first, the category features are converted into semantic categories of images and texts by softmax
Figure 242449DEST_PATH_IMAGE022
And
Figure 116995DEST_PATH_IMAGE023
(ii) a Then, a parameter is defined
Figure 463663DEST_PATH_IMAGE024
To adjust the loss and parameters
Figure 17135DEST_PATH_IMAGE024
Expressed as:
Figure 858183DEST_PATH_IMAGE025
Figure 102083DEST_PATH_IMAGE026
at a constant value, at a constant value
Figure 573646DEST_PATH_IMAGE026
Based on the above, the category-based triple loss is designed as follows:
Figure 801365DEST_PATH_IMAGE027
wherein
Figure 508421DEST_PATH_IMAGE028
The distance between the finger edges is equal to the distance between the finger edges,
Figure 623139DEST_PATH_IMAGE029
representing the similarity of the sample image and the positive sample text;
Figure 311609DEST_PATH_IMAGE030
representing the similarity of the sample image and the negative sample text;
Figure 511778DEST_PATH_IMAGE031
representing the similarity of the sample text and the positive sample image;
Figure 943896DEST_PATH_IMAGE032
representing the similarity of the sample text and the negative sample image; the first summation being for image features
Figure 647541DEST_PATH_IMAGE002
Matching with all text features, including the text features of the positive sample
Figure 975754DEST_PATH_IMAGE003
And text features of negative examples
Figure 725535DEST_PATH_IMAGE033
Second summation being over text features
Figure 712077DEST_PATH_IMAGE003
Matching with all image features, including image features of positive samples
Figure 785075DEST_PATH_IMAGE002
And image characteristics of negative examples
Figure 566081DEST_PATH_IMAGE034
(ii) a The purpose of the triplet loss function constructed by the two summations is to maximize the similarity with the positive samples and minimize the similarity with the negative samples.
The invention also provides a category-guided multi-scale decoupling marine remote sensing image text retrieval system, which is used for realizing the category-guided multi-scale decoupling marine remote sensing image text retrieval method, and comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a depth residual error network and a cavity space convolution pooling pyramid and is used for extracting multi-scale image features
Figure 724530DEST_PATH_IMAGE004
The text feature extraction module extracts text features to obtain the text features of the remote sensing related textT
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module
Figure 249183DEST_PATH_IMAGE004
Decoupling is carried out to obtain decoupling characteristicsF
The category label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining the category characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding the image and the text as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics of respective corresponding modalitiesU&VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
Compared with the prior art, the invention has the advantages that:
(1) The problem of noise redundancy is solved. The invention effectively filters a large amount of redundant noise generated in the multi-scale feature interaction process. A bidirectional multi-scale decoupling module is constructed, potential features of each scale are extracted in a bidirectional mode in a self-adaptive mode, and tedious features of other scales are suppressed, so that effective features of each scale are extracted, redundant features of each scale are suppressed, a large amount of redundant noise is filtered, and effective features are extracted.
(2) The introduction of category information (semantic information) improves the robustness of the features. The invention unifies the semantic decoupling of the two dimensions. And a category label guide module is constructed, and category semantic labels are used as priori knowledge to monitor images and texts so as to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. The category semantic features can emphasize effective features, and the knowledge of semantic decoupling is mapped into a visual multi-scale sample space through cascade connection. The category attribute is used as a bridge of two modal information, and external knowledge is provided for the model while multi-modal knowledge is aligned, so that the model is helped to quickly extract effective features, and effective objects in the remote sensing image are excavated. Meanwhile, the expressions of category information, pixel attribution and scale characteristics can also be generated by the alignment and fusion of the image multi-scale characteristics, effective information (text semantic characteristics) and image semantic characteristics, and the semantic information expressed by the pictures and texts can make the network rear end make correct prediction on the category attribution.
(3) The problems of difficult extraction of effective characteristics and low retrieval accuracy are solved by using the prior knowledge. The invention constructs a semantic guide triple loss module to perform category matching on category characteristics, judges whether an image and a text belong to the same category, inputs category attributes as external knowledge into a downstream task, and performs dynamic weight selection on heterogeneous information matched with heterogeneous images and texts. For example, the remote sensing image classification model and the remote sensing text classification model with high accuracy are trained as prior knowledge and added into the loss function, if the categories of the image and the text are the same, the similarity is increased, so that the model convergence time is greatly shortened, and the matching probability of the image and the text with the same category is actually higher than the unmatched probability. So that the retrieval accuracy of the model is greatly increased.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
Example 1
With reference to fig. 1 and 2, a category-guided bidirectional multi-scale decoupled marine remote sensing image text retrieval method firstly preprocesses data, including processing a marine remote sensing image, and then extracts text features T from the preprocessed data through a text feature extraction module on the one hand, and extracts decoupled image features F through bidirectional multi-scale decoupling on the other hand; then inputting the decoupled image characteristic F and text characteristic T into a category label guide module, and utilizing a category semantic label (F)U&V) As a priori knowledgeMonitoring images and texts to construct class features and realize feature decoupling on semantic dimensions; and finally, calculating the semantic guide triple loss through the similarity of the image and the text, judging whether the image and the text are the same, and performing back propagation.
The method specifically comprises the following steps:
and S0, acquiring a marine remote sensing image and a remote sensing related text.
S1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 645529DEST_PATH_IMAGE004
. A characterization of the image is obtained by this step.
S2, extracting text features of the remote sensing related textT. In a specific application, text feature extraction can be selected by using a word vector embedding model (sentence embedding) and a Skip-through text processing model. The representation of the text is obtained by this step.
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF. The method comprises the following two steps:
s31, extracting image features of each scale from the image feature extraction module
Figure 863015DEST_PATH_IMAGE004
Constructing an attention map based on an attention mechanism on the current scale
Figure 508760DEST_PATH_IMAGE005
Extracting potential features; and generating a suppression mask
Figure 571525DEST_PATH_IMAGE006
The method comprises the following steps: headAggregating channel information of a feature by average pooling and maximum pooling operations to generate two feature descriptors, and generating an attention map by the feature descriptors through a standard convolutional layer and sigmoid function
Figure 619116DEST_PATH_IMAGE005
Generation of a suppression mask by binary masking
Figure 741923DEST_PATH_IMAGE006
Figure 874965DEST_PATH_IMAGE020
Wherein
Figure 741421DEST_PATH_IMAGE021
Is a binary mask that will be most significant
Figure 190988DEST_PATH_IMAGE005
The value of (b) is taken as 0, and the others are taken as 1; inhibition mask alleviates
Figure 733964DEST_PATH_IMAGE005
Coverage effects at other scales make the common reference information at different scales stand out.
S32, aiming at attention diagrams extracted under different feature scales
Figure 839455DEST_PATH_IMAGE005
And suppression mask
Figure 758869DEST_PATH_IMAGE006
By passing
Figure 266205DEST_PATH_IMAGE005
To facilitate significant information on the corresponding scale,
Figure 245662DEST_PATH_IMAGE006
for inhibitingMaking other scales of salient features, obtaining image features after filtering redundant information to realize scale decoupling, and drawing attention in a step-by-step inhibition mode
Figure 104028DEST_PATH_IMAGE005
Application to decoupling features
Figure 640183DEST_PATH_IMAGE011
And with
Figure 516872DEST_PATH_IMAGE008
In the generation process of (1); finally, decoupling characteristics of various characteristic scales through concat operation
Figure 621225DEST_PATH_IMAGE035
And with
Figure 950575DEST_PATH_IMAGE008
Of the composite final imageFThe formula is as follows:
Figure 228104DEST_PATH_IMAGE014
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map
Figure 693721DEST_PATH_IMAGE005
And suppression mask
Figure 765713DEST_PATH_IMAGE006
Deriving decoupling characteristics by operational cascading
Figure 582359DEST_PATH_IMAGE017
And with
Figure 702458DEST_PATH_IMAGE018
In which
Figure 242155DEST_PATH_IMAGE017
Is in the direction from small scale to large scaleThe decoupling characteristic of (a) is that,
Figure 734316DEST_PATH_IMAGE010
is a decoupling feature in the large scale to small scale direction.
In particular, since the attention map represents significant regions of a feature, the suppression mask leverages the attention map representation to suppress significance information on the corresponding scale. The suppression mask mitigation attention seeks to show the effect of the coverage on other scales, highlighting different information.
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTMultiplying the resulting class-dependent image and text features
Figure 523412DEST_PATH_IMAGE002
And
Figure 126431DEST_PATH_IMAGE003
the method comprises the following steps:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category characteristics of the remote sensing related texts through training of a remote sensing related text classifierV
The two classifiers are pre-training models, the prediction accuracy rate of the two classifiers reaches over 80 percent, rich semantic knowledge in the pre-training models can be transferred to a subsequent training process, and the pre-training models can be regarded as prior knowledge supervision of the models.
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying to guide the retrieval network to detect important and reliable category-related information; the text characteristics obtained in the step S2 are usedTAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTRespectively associated with the categories of the corresponding modalitiesFeature(s)U&VAttention enhancement is carried out to obtain final image characteristics related to categories
Figure 582952DEST_PATH_IMAGE002
And text features
Figure 980435DEST_PATH_IMAGE003
By making full use of multiplication, significant enhancement of the relevant features can be achieved in the feature combination process.
Figure 787985DEST_PATH_IMAGE002
And
Figure 273324DEST_PATH_IMAGE003
the method not only captures identifiable multi-scale semantic information, but also highlights reliable knowledge related to categories, thereby improving the accuracy of the retrieval network. To guide the retrieval network to probe important and reliable category-related information. Wherein the decoupling characteristic of the imageFAnd remote sensing image category characteristicsUThe image feature and the text feature are guided by using the classification prior knowledge of the image and the text, the knowledge of the pre-training semantic feature is subjected to semantic decoupling, and the decoupled semantic information is combined with an original retrieval network to explore meaningful and reliable category related data, so that while category supervision is realized for the semantic information, the semantic information and the scale information are fused and aligned on different modal information through a prior knowledge guide module; the formula is as follows:
Figure 833619DEST_PATH_IMAGE036
s5, calculating similarity and semantic guide triple loss:
firstly, the image and text characteristics related to the category output in the step S4
Figure 152736DEST_PATH_IMAGE002
And
Figure 431270DEST_PATH_IMAGE003
performing category matching, and judging whether the image and the text belong to the same category so as to improve the retrieval probability of the cross-modal data of the same category; inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts so as to improve the retrieval probability of the same-category cross-modal data; and then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out back propagation training.
First, class features are converted to semantic classes of images and text by softmax
Figure 330087DEST_PATH_IMAGE022
And
Figure 744888DEST_PATH_IMAGE023
(ii) a Then, a parameter is defined
Figure 500486DEST_PATH_IMAGE024
To adjust the loss and parameters
Figure 344945DEST_PATH_IMAGE037
Expressed as:
Figure 100002_DEST_PATH_IMAGE038
Figure 906507DEST_PATH_IMAGE026
at a constant value, at a constant value
Figure 644656DEST_PATH_IMAGE026
Based on the above, the category-based triple loss is designed as follows:
Figure DEST_PATH_IMAGE040
the purpose of the triple loss function is to minimize the semantic spatial distance between the sample and the positive sampleThe distance between the sample and the corresponding negative sample is increased. Wherein
Figure 633472DEST_PATH_IMAGE028
The distance between the finger edges is equal to the distance between the finger edges,
Figure 371752DEST_PATH_IMAGE029
representing the similarity of the sample image and the positive sample text;
Figure 2584DEST_PATH_IMAGE030
representing the similarity of the sample image and the negative sample text;
Figure 595240DEST_PATH_IMAGE031
representing the similarity of the sample text and the positive sample image;
Figure 692640DEST_PATH_IMAGE032
representing the similarity of the sample text and the negative sample image; the first summation being for image features
Figure 433063DEST_PATH_IMAGE002
Matching with all text features (including text features of positive samples)
Figure 8532DEST_PATH_IMAGE003
And text features of negative examples
Figure 252431DEST_PATH_IMAGE033
) The second summation being for text features
Figure 520733DEST_PATH_IMAGE003
Matching with all image features (including image features of positive samples)
Figure 482873DEST_PATH_IMAGE002
And image characteristics of negative examples
Figure 924349DEST_PATH_IMAGE034
). Triple loss function constructed by two summationsThe number is to maximize the similarity with the positive samples and minimize the similarity with the negative samples.
And S6, inputting the ocean remote sensing image to be retrieved and outputting remote sensing related text data. (or inputting remote sensing related text data to be retrieved and outputting ocean remote sensing images).
Example 2
The category-guided bidirectional multi-scale decoupling marine remote sensing image text retrieval system comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module.
The image feature extraction module comprises a convolution neural network and a void space convolution pooling module and is used for extracting multi-scale image features
Figure 773488DEST_PATH_IMAGE004
The text feature extraction module is used for extracting text features by utilizing a word vector embedding (sentence embedding) model and a Skip-through text processing model to obtain the text features of the remote sensing related textsT
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module
Figure 540587DEST_PATH_IMAGE004
Decoupling is carried out to obtain decoupling characteristicsF
The class label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining class characteristics of the remote sensing imageUAnd remote sensing related text category featuresV(ii) a Utilizing category semantic tagsU&VGuiding images and texts as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; wherein U is&V, class characteristics marked by a pre-training model; decoupling features of imagesFText features with related textTClass characteristics U of respective corresponding modalities&V for attention enhancement, can also beThe enhancement information is combined with the original retrieval network, so that the fusion of semantics and scale characteristics is realized, meaningful and reliable category-related data are explored, and category-related image and text characteristics are obtained;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
The function implementation and data processing of each module are partially the same as those in embodiment 1, and are not described herein again.
It should be noted that the method of the present invention can implement two-mode cross-mode retrieval of images and texts, one type of data is used as a query to retrieve the other type of data, when the data is input as an ocean remote sensing image, the output retrieval result is corresponding text data, and when the data is input as ocean remote sensing related text data, the output retrieval result is corresponding ocean remote sensing image.
In summary, the present invention can use the category information as the prior knowledge to guide the more accurate representation of the cross-modal information. Specifically, compared with the existing method, the bidirectional multi-scale decoupling method constructs a bidirectional multi-scale decoupling module to adaptively extract potential features and inhibit fussy features on other scales, so that a discriminative clue is generated and the problem of noise redundancy of cascaded scale decoupling is solved. In addition, a category label guide module and a semantic guide triple loss module are constructed, wherein the category label guide module monitors images and texts by using category semantic labels as priori knowledge to construct more excellent category characteristics and realize characteristic decoupling on semantic dimensions. Then, the decoupled semantic information is combined with an original retrieval network, so that the fusion of semantic and scale characteristics is realized, and meaningful and reliable category related data are explored; and the semantic guide triple loss module performs class matching on the class characteristics, judges whether the image and the text belong to the same class, inputs the class attributes as external knowledge into a downstream task, and performs dynamic weight selection on the heterogeneous information matched with the heterogeneous images and texts so as to improve the retrieval probability of the same-class cross-mode data and improve the retrieval probability and the model convergence speed of the same-class cross-mode data. And finally, by carrying out category matching on the generated category characteristics, a category-based triple loss is designed so as to improve the retrieval probability of the similar cross-modal data.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims (7)

1. The method for searching the marine remote sensing image text with multi-scale decoupling guided by categories is characterized by comprising the following steps:
s0, obtaining a marine remote sensing image and a remote sensing related text;
s1, extracting image characteristics of the ocean remote sensing image: firstly, a convolution neural network is used for embedding the characteristics of an image, the obtained basic characteristics of the image are sampled by cavity convolution with different sampling rates, and the image characteristics with different scales are obtained
Figure 14355DEST_PATH_IMAGE001
S2, extracting text features of the remote sensing related textT
S3, bidirectional multi-scale decoupling: decoupling the image features of different scales obtained in the step S1, extracting corresponding potential features on each scale, inhibiting fussy features on other scales, and obtaining the decoupling features of the imageF
Step S4, guiding by category labels: firstly, generating class characteristics of the image and the text, and then guiding the decoupling characteristics of the image by using the generated class characteristicsFAnd text featuresTUsing multiplication to calculate the final class-dependent image features
Figure 10124DEST_PATH_IMAGE002
And text features
Figure 852178DEST_PATH_IMAGE003
S5, calculating similarity and semantic guide triple loss:
firstly, the image characteristics related to the category output in step S4
Figure 326016DEST_PATH_IMAGE002
And text features
Figure 953437DEST_PATH_IMAGE003
Performing category matching, judging whether the image and the text belong to the same category, inputting category attributes as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts; then calculating the loss of the semantic guide triple, iterating the steps S1-S5, and carrying out reverse propagation training;
s6, inputting a marine remote sensing image to be retrieved and outputting remote sensing related text data; or inputting remote sensing related text data to be retrieved and outputting the ocean remote sensing image.
2. The category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 1, wherein step S3 is divided into two steps:
s31, extracting the image features of each scale from the image feature extraction module
Figure 838217DEST_PATH_IMAGE004
Constructing an attention map based on an attention mechanism on the current scale
Figure 918299DEST_PATH_IMAGE005
Extracting potential features(ii) a And generating a suppression mask
Figure 710675DEST_PATH_IMAGE006
S32, aiming at attention diagrams extracted under different feature scales
Figure 723761DEST_PATH_IMAGE007
And suppression mask
Figure 795754DEST_PATH_IMAGE006
By passing
Figure 612400DEST_PATH_IMAGE008
To facilitate significant information on the corresponding scale,
Figure 162461DEST_PATH_IMAGE006
the method is used for suppressing salient features of other scales, obtaining image features after redundant information is filtered to achieve scale decoupling, and performing attention drawing through a gradual suppression mode
Figure 748163DEST_PATH_IMAGE009
Application to decoupling features
Figure 991057DEST_PATH_IMAGE011
And
Figure 826158DEST_PATH_IMAGE012
in the process of generation, wherein
Figure 445489DEST_PATH_IMAGE013
Is a decoupling feature in the small to large dimension direction,
Figure 902009DEST_PATH_IMAGE014
is a decoupling feature in the large scale to small scale direction; finally, decoupling characteristics of various characteristic scales are carried out through concat operation
Figure 565072DEST_PATH_IMAGE015
And
Figure 107043DEST_PATH_IMAGE012
decoupling features of synthesized final imagesF
3. The method for retrieving the marine remote sensing image text with the category-guided multi-scale decoupling according to claim 2, wherein in step S32, a calculation formula of the decoupling characteristic is as follows:
Figure 248174DEST_PATH_IMAGE016
wherein m is a number of different scales, namely three scales of large, medium and small, and an attention map
Figure 559201DEST_PATH_IMAGE017
And suppression mask
Figure 675055DEST_PATH_IMAGE018
Deriving decoupling characteristics by operational cascading
Figure 219169DEST_PATH_IMAGE019
And
Figure 914724DEST_PATH_IMAGE020
4. the category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 1, characterized in that step S4 is specifically as follows:
s41, obtaining category semantic labels from the ocean remote sensing images obtained in the step S0, and obtaining the category characteristics of the remote sensing images through training of a remote sensing image classifierU
S42, obtaining category semantic labels from the remote sensing related texts obtained in the step S0, and obtaining the category semantic labels through the stepsTraining a remote sensing related text classifier to obtain remote sensing related text class characteristicsV
S43, decoupling characteristics of the image obtained in the step S3FAnd remote sensing image category characteristicsUMultiplying the text characteristics obtained in the step S2TAnd remote sensing related text category featuresVMultiplication, the purpose of which is to decouple the features of the imageFText features with related textTClass characteristics of respective corresponding modalitiesU&VAttention enhancement is performed to obtain final class-related image features
Figure 595104DEST_PATH_IMAGE002
And text features
Figure 350701DEST_PATH_IMAGE003
5. The category-guided multi-scale decoupled marine remote sensing image text retrieval method according to claim 2, characterized in that the specific steps of step S31 are: firstly aggregating channel information of a feature through average pooling and maximum pooling operations to generate two feature descriptors, and then generating an attention map through the feature descriptors by a standard convolution layer and sigmoid function
Figure 867265DEST_PATH_IMAGE007
Generation of a suppression mask by binary masking
Figure 350199DEST_PATH_IMAGE006
Figure 635817DEST_PATH_IMAGE021
Wherein
Figure 342742DEST_PATH_IMAGE022
Is a binary mask that will be most apparentWriting tools
Figure 81022DEST_PATH_IMAGE007
The value of (d) is 0, and the others are 1.
6. The method for retrieving the marine remote sensing image text with the category-guided multi-scale decoupling function according to claim 1, wherein in step S5, firstly, category features are converted into semantic categories of the image and the text through softmax
Figure 633226DEST_PATH_IMAGE024
And
Figure 507773DEST_PATH_IMAGE026
(ii) a Then, a parameter is defined
Figure 854440DEST_PATH_IMAGE027
To adjust the loss and parameters
Figure 80016DEST_PATH_IMAGE028
Expressed as:
Figure 170332DEST_PATH_IMAGE029
Figure 938264DEST_PATH_IMAGE030
at a constant value, at a constant value
Figure 206566DEST_PATH_IMAGE030
Based on the above, the category-based triple loss is designed as follows:
Figure 981755DEST_PATH_IMAGE032
wherein
Figure 626494DEST_PATH_IMAGE033
The distance between the finger edges is equal to the distance between the finger edges,
Figure 256058DEST_PATH_IMAGE034
representing the similarity of the sample image and the positive sample text;
Figure 429682DEST_PATH_IMAGE035
representing the similarity of the sample image and the negative sample text;
Figure DEST_PATH_IMAGE036
representing the similarity of the sample text and the positive sample image;
Figure 488905DEST_PATH_IMAGE037
representing the similarity of the sample text and the negative sample image; the first summation being for image features
Figure 937335DEST_PATH_IMAGE002
Matching with all text features, including the text features of the positive sample
Figure 890247DEST_PATH_IMAGE003
And text features of negative examples
Figure DEST_PATH_IMAGE038
Second summation being over text features
Figure 562668DEST_PATH_IMAGE003
Matching with all image features, including image features of positive samples
Figure 984553DEST_PATH_IMAGE002
And image characteristics of negative examples
Figure DEST_PATH_IMAGE039
(ii) a The objective of the triple loss function constructed by two summations is to maximize the sumThe similarity between the positive samples is minimized with the similarity between the negative samples.
7. The marine remote sensing image text retrieval system with category-guided multi-scale decoupling is characterized in that the marine remote sensing image text retrieval method for achieving category-guided multi-scale decoupling of any one of claims 1-6 comprises an input module, an image feature extraction module, a text feature extraction module, a bidirectional multi-scale decoupling module, a category label guide module, a semantic guide triple loss module and an output module;
the image feature extraction module comprises a convolution neural network and a void space convolution pooling module and is used for extracting multi-scale image features
Figure 564570DEST_PATH_IMAGE001
The text feature extraction module extracts text features to obtain text features of the remote sensing related textT
The bidirectional multi-scale decoupling module is used for extracting the multi-scale image features output by the image feature extraction module
Figure 653880DEST_PATH_IMAGE001
Decoupling is carried out to obtain decoupling characteristicsF
The class label guiding module comprises a remote sensing image classifier and a remote sensing related text classifier which are respectively used for obtaining class characteristics of the remote sensing imageUText category features related to remote sensingV(ii) a Utilizing category semantic tagsU&VGuiding the image and the text as priori knowledge to construct class features and realize feature decoupling on semantic dimensions; whereinU&VClass features labeled through a pre-training model; decoupling features of imagesFText features with related textTClass characteristics of respective corresponding modalitiesU& VPerforming attention enhancement to obtain image and text characteristics related to categories;
the semantic guide triple loss module is used for calculating the semantic guide triple loss; performing category matching on the category characteristics, judging whether the image and the text belong to the same category, inputting the category attribute as external knowledge into a downstream task, and performing dynamic weight selection on heterogeneous information matched with heterogeneous graphics and texts;
the input module is used for inputting a marine remote sensing image or remote sensing related text data to be retrieved, and the output module is used for outputting the remote sensing related text data or the marine remote sensing image.
CN202211223823.1A 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system Active CN115311463B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211223823.1A CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211223823.1A CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Publications (2)

Publication Number Publication Date
CN115311463A true CN115311463A (en) 2022-11-08
CN115311463B CN115311463B (en) 2023-02-03

Family

ID=83866005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211223823.1A Active CN115311463B (en) 2022-10-09 2022-10-09 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system

Country Status (1)

Country Link
CN (1) CN115311463B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127123A (en) * 2023-04-17 2023-05-16 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116186317A (en) * 2023-04-23 2023-05-30 中国海洋大学 Cross-modal cross-guidance-based image-text retrieval method and system
CN117556062A (en) * 2024-01-05 2024-02-13 武汉理工大学三亚科教创新园 Ocean remote sensing image audio retrieval network training method and application method
CN117573916A (en) * 2024-01-17 2024-02-20 武汉理工大学三亚科教创新园 Retrieval method, device and storage medium for image text of marine unmanned aerial vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017103035A1 (en) * 2015-12-18 2017-06-22 Ventana Medical Systems, Inc. Systems and methods of unmixing images with varying acquisition properties
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN111798460A (en) * 2020-06-17 2020-10-20 南京信息工程大学 Satellite image segmentation method
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description
WO2022160771A1 (en) * 2021-01-26 2022-08-04 武汉大学 Method for classifying hyperspectral images on basis of adaptive multi-scale feature extraction model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017103035A1 (en) * 2015-12-18 2017-06-22 Ventana Medical Systems, Inc. Systems and methods of unmixing images with varying acquisition properties
US10713794B1 (en) * 2017-03-16 2020-07-14 Facebook, Inc. Method and system for using machine-learning for object instance segmentation
CN111798460A (en) * 2020-06-17 2020-10-20 南京信息工程大学 Satellite image segmentation method
WO2022160771A1 (en) * 2021-01-26 2022-08-04 武汉大学 Method for classifying hyperspectral images on basis of adaptive multi-scale feature extraction model
CN113487629A (en) * 2021-07-07 2021-10-08 电子科技大学 Image attribute editing method based on structured scene and text description

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
QINMIN CHENG等: "A Semantic-Preserving Deep Hashing Model for multi-label remote sensing image retrieval", 《REMOTE SENSING》 *
YUANSHENG HUA等: "Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification", 《ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING》 *
单守平: "基于深度学习和标签语义关联的遥感影像多标签分类", 《中国优秀硕士学位论文全文数据库(电子期刊)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127123A (en) * 2023-04-17 2023-05-16 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116127123B (en) * 2023-04-17 2023-07-07 中国海洋大学 Semantic instance relation-based progressive ocean remote sensing image-text retrieval method
CN116186317A (en) * 2023-04-23 2023-05-30 中国海洋大学 Cross-modal cross-guidance-based image-text retrieval method and system
CN116186317B (en) * 2023-04-23 2023-06-30 中国海洋大学 Cross-modal cross-guidance-based image-text retrieval method and system
CN117556062A (en) * 2024-01-05 2024-02-13 武汉理工大学三亚科教创新园 Ocean remote sensing image audio retrieval network training method and application method
CN117556062B (en) * 2024-01-05 2024-04-16 武汉理工大学三亚科教创新园 Ocean remote sensing image audio retrieval network training method and application method
CN117573916A (en) * 2024-01-17 2024-02-20 武汉理工大学三亚科教创新园 Retrieval method, device and storage medium for image text of marine unmanned aerial vehicle
CN117573916B (en) * 2024-01-17 2024-04-26 武汉理工大学三亚科教创新园 Retrieval method, device and storage medium for image text of marine unmanned aerial vehicle

Also Published As

Publication number Publication date
CN115311463B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN115311463B (en) Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
JP7335907B2 (en) Character structuring extraction method and device, electronic device, storage medium, and computer program
CN112966684A (en) Cooperative learning character recognition method under attention mechanism
CN111914107B (en) Instance retrieval method based on multi-channel attention area expansion
CN111815602A (en) Building PDF drawing wall recognition device and method based on deep learning and morphology
TW202207077A (en) Text area positioning method and device
CN111932577B (en) Text detection method, electronic device and computer readable medium
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN115658934A (en) Image-text cross-modal retrieval method based on multi-class attention mechanism
CN114372475A (en) Network public opinion emotion analysis method and system based on RoBERTA model
CN110245292B (en) Natural language relation extraction method based on neural network noise filtering characteristics
CN114782722A (en) Image-text similarity determining method and device and electronic equipment
CN112348001B (en) Training method, recognition method, device, equipment and medium for expression recognition model
Vu et al. Revising FUNSD dataset for key-value detection in document images
CN111159411B (en) Knowledge graph fused text position analysis method, system and storage medium
CN116579348A (en) False news detection method and system based on uncertain semantic fusion
US20230154077A1 (en) Training method for character generation model, character generation method, apparatus and storage medium
CN112800259B (en) Image generation method and system based on edge closure and commonality detection
CN111652164B (en) Isolated word sign language recognition method and system based on global-local feature enhancement
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN114820885A (en) Image editing method and model training method, device, equipment and medium thereof
Priya et al. Developing an offline and real-time Indian sign language recognition system with machine learning and deep learning
CN113313108A (en) Saliency target detection method based on super-large receptive field characteristic optimization
Li et al. ViT2CMH: Vision Transformer Cross-Modal Hashing for Fine-Grained Vision-Text Retrieval.
CN112347196B (en) Entity relation extraction method and device based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant