CN117520589A

CN117520589A - Cross-modal remote sensing image-text retrieval method with fusion of local features and global features

Info

Publication number: CN117520589A
Application number: CN202410008193.9A
Authority: CN
Inventors: 赵作鹏; 缪小然; 刘营; 高宇蒙; 闵冰冰; 胡建峰; 贺晨; 赵广明; 周杰; 孙守都; 宋喷玉
Original assignee: Yanyuan Intelligent Technology Xuzhou Co ltd; China University of Mining and Technology CUMT
Current assignee: Yanyuan Intelligent Technology Xuzhou Co ltd; China University of Mining and Technology CUMT
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-02-06
Anticipated expiration: 2044-01-04
Also published as: CN117520589B

Abstract

The invention discloses a cross-modal remote sensing image-text retrieval method for fusion of local features and global features, which is characterized in that after global feature extraction and local feature extraction are carried out on an input image, the difference between the global features and the local features is considered, weights are dynamically generated through a multi-level information feature fusion module, and the two features are weighted to better represent the image; modeling the text information using a recurrent neural network to extract text time information; calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small; and carrying out reverse retrieval by utilizing the candidate information to obtain a final retrieval result. The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features can correct global information through the local information and supplement the local information by utilizing the global information, so that the desired remote sensing data can be retrieved more accurately.

Description

Cross-modal remote sensing image-text retrieval method with fusion of local features and global features

Technical Field

The invention relates to a remote sensing image-text retrieval method, in particular to a cross-mode remote sensing image-text retrieval method with local features and global features fused, and belongs to the technical field of remote sensing image processing.

Background

The remote sensing technology is an important data acquisition and processing means, and is a comprehensive technology for collecting and processing electromagnetic wave information radiated and reflected by a remote target by using various sensing instruments and finally imaging, so that various scenes on the ground are detected and identified, and a high-resolution remote sensing image can be inquired through the remote sensing technology. Efficient, systematic and reasonable remote sensing image retrieval technology is helpful to provide reliable support for environmental monitoring, urban planning, agricultural management, disaster emergency and other scenes, but how to effectively utilize massive remote sensing image data which is rapidly expanded and has complex sources becomes a non-negligible challenge.

The remote sensing image and text retrieval purpose is to precisely locate text description matched with the query image or search corresponding images according to the title description in a multi-mode remote sensing image database. The efficient and accurate retrieval of target modality data becomes a research hotspot and difficulty in the face of multi-source and multi-modality remote sensing image data. At present, the cross-modal remote sensing image-text retrieval and device based on the double-tower neural network gradually become hot spots of the technology, however, the cross-modal remote sensing image-text retrieval is difficult to accurately achieve due to the difference between multiple modes and the difficulty of alignment. Designing an efficient and accurate remote sensing image retrieval method is a problem to be solved in the prior art.

The current image-text retrieval method based on deep learning mainly focuses on processing global features of remote sensing images. A limitation of this approach is that it may not be able to accurately retrieve certain key regions in the image, and for some particular word semantics, it may not be possible to find the image region corresponding thereto. Furthermore, these methods often ignore the comprehensive learning of local and global features, resulting in insufficiently tight links between regional features and overall context. This defect makes the matching degree of the text word related to the text search result and the non-object not high.

In addition, the retrieval process is reversible, if the sample in the mode A is a candidate retrieval object of the sample in the mode B, and vice versa, when the remote sensing image is used for text retrieval, the retrieval principle is still followed, and although the correctness of the retrieval method is undoubted, the existing retrieval framework often ignores the internal relation of bidirectional retrieval and ignores the bidirectional ordering information in the cross-mode similarity matrix in the reasoning process, which is important for further improving the retrieval precision, and the information can be used for carrying out secondary optimization on the retrieval result. Rather, the text and the image, once matched, must be retrievable from each other. Therefore, developing a practical and efficient image-text retrieval method to solve these problems has become an urgent need.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a cross-modal remote sensing image-text retrieval method with fused local features and global features, which can correct global information through local information and supplement the local information by using the global information so as to more accurately retrieve the expected remote sensing data.

In order to achieve the purpose, the cross-modal remote sensing image-text retrieval method for fusing local features and global features specifically comprises the following steps:

step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < > for>Is an input image;

step2, identifying targets in the remote sensing image by adopting a target detection network, and acquiring local targetsWherein->The first to input imagejThe local object of the class is referred to as,Ddetecting a network for a target; then, deep learning is carried out on the detected target area through a graph neural network to obtain a local feature vector +.>Wherein, the method comprises the steps of, wherein,Ga graph neural network feature extractor;

step3, first, the global feature vectorAnd local feature vector->Internal similarity calculation is performed and then through a cross-attention mechanismCAFor the processed global feature vector +.>And local feature vector->Dynamic weighted fusion is carried out to obtain a complete image feature vector +.>；

Step4, input text using recurrent neural networkModeling is carried out, and then text feature vectors are obtained through a multi-layer perceptron>；

Step5, calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small;

step6, reverse retrieval is carried out by utilizing the candidate information, and a final retrieval result is obtained.

Further, step3 specifically comprises the following steps:

step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->：/>Wherein,MAfor the multi-head attentiveness mechanism, < >>And->The original global vector and the local vector;

step3-2, by a cross-attention mechanismCADynamically weighting and fusing the processed global feature vector and the local feature vector to further obtain the processed global feature vectorAnd local feature vector->：

Step3-3, filtering the global features using the local feature generation mask, while directly supplementing the local features using the global features:

wherein,is a dynamically updatable weight;

step3-4, performing feature superposition on the global features and the local features to obtainTo mix visual information:

step3-5, performing linear transformation on the mixed visual information to obtain a learnable dynamic weight, further updating the fused features, and finally obtaining a visual vector fused by the global feature vector and the local feature vector:

and->Is a weight matrix.

Further, step4 specifically comprises the following steps:

will input textEmbedding word level to form word vector sequence +.>The sequence is then input into a recurrent neural network to effect layer-by-layer processing of the text: />Wherein->Heel->Representing forward and backward recurrent neural networks, respectively, < >>Represent the firstiA hidden state of the layer;

the text feature vector is formed by the multi-layer perceptron as follows:wherein, the method comprises the steps of, wherein,Tthe generated text feature vector is represented and,MLPrepresenting a multi-layer perceptron.

Further, in Step5, the search results are ranked in order from the top to the bottom as follows:

wherein, the method comprises the steps of, wherein,is the minimum margin +.>For paired image text pairs, < >>Is unpaired image text pair.

Further, in Step6, when performing an image-to-text query, the following is specific:

according to the most similarity with the query imagekComputing query components from individual text：/>Wherein->For ranking coefficient, ++>Ranking the candidate text;

reverse search using the obtained candidate text if the query image is locatedLIn the nearest images, the query component is calculated：/>Wherein->Ranking candidate images if notLIn the nearest images, the query component is 0;

definition of the saliency componentTo quantify the similarity degree of the model, significance component +.>The expression is as follows:for candidate text in reverse search +.>And (2) image->The higher the similarity ratio, the higher the certainty thereof.

Compared with the prior art, the cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features has the following advantages:

1. the invention adopts the fusion of the local features and the global features, corrects the global information through the local information, supplements the local information by utilizing the global information, can improve the retrieval performance of the cross-mode remote sensing image-text, and can more accurately retrieve the expected remote sensing data;

2. according to the invention, the redundant characteristics with high similarity are filtered to strengthen and highlight the characteristics of the targets, so that the pressure caused by the redundant target relation on the model is reduced, the attention of the model to the salient examples is improved, and more excellent visual characterization can be obtained;

3. the invention can fully utilize the information in the similarity matrix to secondarily optimize the search result, and the post-processing algorithm comprehensively considers various information in the similarity matrix through a bidirectional search method, so that higher search precision can be obtained without additional training.

Drawings

FIG. 1 is a schematic diagram of the architecture of the present invention;

FIG. 2 is a schematic diagram of a multi-level information feature fusion of the present invention;

fig. 3 is a schematic diagram of a reordering module of the present invention.

Detailed Description

The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features uses a convolutional neural network containing a multi-level visual attention mechanism to conduct global feature extraction on an input image; extracting local features of the input image by using a graph neural network; taking the difference between the global features and the local features into consideration, dynamically generating weights through a multi-level information feature fusion module, and weighting the two features to better characterize the image; modeling the text information using a recurrent neural network to extract text time information; calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small; and carrying out reverse retrieval by utilizing the candidate information, and obtaining a final retrieval result by taking various sequencing factors into consideration so as to enhance the retrieval reliability. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the cross-modal remote sensing image-text retrieval method for fusing local features and global features specifically comprises the following steps:

step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < ->Is an input image.

The convolutional neural network structure containing the multi-level visual attention mechanism can focus on key information of the image on different levels, so that the input image is subjected to more global feature capture, and richer characterization can be provided for further deep learning tasks.

Step2, detecting the target of the remote sensing image by using a target detection network, and extracting local features by using a graph neural network.

Accurately identifying target in remote sensing image by using target detection network to obtain local targetWherein->The first to input imagejThe local object of the class is referred to as,Dthe network is detected for the Faster-RCNN target. Then, the detected target area is subjected to deep learning through a graph neural network so as to extract finer and local characteristic information, thereby obtaining local characteristic vector +.>. Wherein,Gis a graph neural network feature extractor. The method integrates the advantages of target detection and a graph neural network, so that the system can realize efficient target identification on a target level, capture more specific and detailed characteristics in a local image area and provide a more comprehensive information basis for subsequent tasks.

Step3, dynamically generating weights through a multi-level information feature fusion module, and weighting the two features to better characterize the image.

As shown in fig. 2, the global feature vector and the local feature vector are subjected to internal similarity calculation to find out the correlation between the information; and then, dynamically weighting and fusing the processed global feature vector and the local feature vector through a cross attention mechanism to obtain a complete image feature vector. Specific:

step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->：/>

Wherein,MAfor a multi-headed attention mechanism,and->Is the original global vector and the local vector.

wherein,is a dynamically updatable weight.

Step3-4, performing feature superposition on the global features and the local features to obtain mixed visual information:

and->Is a weight matrix.

Step4, embedding word vectors into the input text, inputting the word vectors into a recurrent neural network, and obtaining text feature vectors through a multi-layer perceptron.

Will input textEmbedding word level to form word vector sequence +.>The sequence is then input into a recurrent neural network to effect layer-by-layer processing of the text: />Wherein->Heel with heel bodyRepresenting forward and backward recurrent neural networks, respectively, < >>Represent the firstiHidden state of the layer.

Through the learning of the recurrent neural network, finally, the higher-order characteristic representation of the text is extracted through a multi-layer perceptron, and a text characteristic vector is formed as follows:

this process flow aims at capturing deep semantic information at the text level, providing a richer feature representation for further text analysis and understanding tasks.

Step5, calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small as follows:

And calculating the similarity between the weighted and fused image features and the text features to obtain a similarity measurement value. The search results are then ranked in order of the similarity metric from greater to lesser order to ensure that the top ranked results are more relevant. The process aims at effectively optimizing the search result by comprehensively considering the characteristic information of the image and the text, so that the top-ranked result meets the requirement of similarity matching.

Taking an image-to-text query as an example, as shown in FIG. 3, according to the most similar to the query imagekComputing query components from individual text：/>Wherein->For ranking coefficient, ++>Ranking the candidate text.

UsingThe obtained candidate text is subjected to reverse search, if the query image is positionedLIn the nearest images, the query component is calculated：/>Wherein->Ranking candidate images if notLIn the nearest images, the query component is 0.

Definition of the saliency componentTo quantify the similarity degree of the model, significance component +.>The expression is as follows:，

and calculating the confidence coefficient of the model on the similarity, wherein the confidence coefficient is used as a weighted term of the final similarity to dynamically weight and sum the two query components, simultaneously taking the results of forward search and reverse search into consideration, and carrying out secondary ranking on the confirmation results of the initial similarity by combining the model, thereby bringing finer ranking results.

For candidate text in reverse searchAnd (2) image->The higher the similarity ratio, the higher the certainty thereof.

The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features corrects global information through the local information, supplements the local information by using the global information, can improve the retrieval performance of the cross-modal remote sensing image-text, and can more accurately retrieve the expected remote sensing data; meanwhile, by filtering redundant features with high similarity, the features of the targets are enhanced and highlighted, the pressure caused by the redundant target relation on the model is reduced, and the attention of the model to the salient examples is improved, so that more excellent visual characterization can be obtained; in addition, the information in the similarity matrix can be fully utilized to ensure that the search result is secondarily optimized, and the post-processing algorithm comprehensively considers various information in the similarity matrix through a bidirectional search method, so that higher search precision can be obtained without additional training.

Claims

1. A cross-modal remote sensing image-text retrieval method integrating local features and global features is characterized by comprising the following steps:

step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < ->Is an input image;

step2, identifying targets in the remote sensing image by adopting a target detection network, and acquiring local targetsWherein->The first to input imagejThe local object of the class is referred to as,Ddetecting a network for a target; then, deep learning is carried out on the detected target area through a graph neural network to obtain a local feature directionQuantity->Wherein, the method comprises the steps of, wherein,Ga graph neural network feature extractor;

2. The cross-modal remote sensing image-text retrieval method based on fusion of local features and global features as claimed in claim 1, wherein Step3 comprises the following specific processes:

step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->：/> Wherein,MAfor the multi-head attentiveness mechanism, < >>And->The original global vector and the local vector;

step3-2, by a cross-attention mechanismCADynamically weighting and fusing the processed global feature vector and the local feature vector to further obtain the processed global feature vectorAnd local feature vector->：，

Step3-3, filtering the global features using the local feature generation mask, while directly supplementing the local features using the global features:wherein (1)>Is a dynamically updatable weight;

step3-4, performing feature superposition on the global features and the local features to obtain mixed visual information:，

step3-5, performing linear transformation on the mixed visual information to obtain a learnable dynamic weight, further updating the fused features, and finally obtaining a visual vector fused by the global feature vector and the local feature vector:wherein (1)>And->Is a weight matrix.

3. The cross-modal remote sensing image-text retrieval method based on fusion of local features and global features as claimed in claim 2, wherein Step4 comprises the following specific processes:

4. The cross-modal remote sensing image-text retrieval method based on the fusion of local features and global features as claimed in claim 3, wherein the retrieval results are ordered in Step5 from big to small as follows:wherein->Is the minimum margin +.>For paired image text pairs, < >>Is unpaired image text pair.

5. The method for cross-modal remote sensing image-text retrieval with fusion of local features and global features as claimed in claim 4, wherein in Step6, when the query from image to text is performed, the method is specifically as follows: