CN117520589A - Cross-modal remote sensing image-text retrieval method with fusion of local features and global features - Google Patents

Cross-modal remote sensing image-text retrieval method with fusion of local features and global features Download PDF

Info

Publication number
CN117520589A
CN117520589A CN202410008193.9A CN202410008193A CN117520589A CN 117520589 A CN117520589 A CN 117520589A CN 202410008193 A CN202410008193 A CN 202410008193A CN 117520589 A CN117520589 A CN 117520589A
Authority
CN
China
Prior art keywords
features
text
local
global
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410008193.9A
Other languages
Chinese (zh)
Other versions
CN117520589B (en
Inventor
赵作鹏
缪小然
刘营
高宇蒙
闵冰冰
胡建峰
贺晨
赵广明
周杰
孙守都
宋喷玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanyuan Intelligent Technology Xuzhou Co ltd
China University of Mining and Technology CUMT
Original Assignee
Yanyuan Intelligent Technology Xuzhou Co ltd
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanyuan Intelligent Technology Xuzhou Co ltd, China University of Mining and Technology CUMT filed Critical Yanyuan Intelligent Technology Xuzhou Co ltd
Priority to CN202410008193.9A priority Critical patent/CN117520589B/en
Publication of CN117520589A publication Critical patent/CN117520589A/en
Application granted granted Critical
Publication of CN117520589B publication Critical patent/CN117520589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal remote sensing image-text retrieval method for fusion of local features and global features, which is characterized in that after global feature extraction and local feature extraction are carried out on an input image, the difference between the global features and the local features is considered, weights are dynamically generated through a multi-level information feature fusion module, and the two features are weighted to better represent the image; modeling the text information using a recurrent neural network to extract text time information; calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small; and carrying out reverse retrieval by utilizing the candidate information to obtain a final retrieval result. The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features can correct global information through the local information and supplement the local information by utilizing the global information, so that the desired remote sensing data can be retrieved more accurately.

Description

Cross-modal remote sensing image-text retrieval method with fusion of local features and global features
Technical Field
The invention relates to a remote sensing image-text retrieval method, in particular to a cross-mode remote sensing image-text retrieval method with local features and global features fused, and belongs to the technical field of remote sensing image processing.
Background
The remote sensing technology is an important data acquisition and processing means, and is a comprehensive technology for collecting and processing electromagnetic wave information radiated and reflected by a remote target by using various sensing instruments and finally imaging, so that various scenes on the ground are detected and identified, and a high-resolution remote sensing image can be inquired through the remote sensing technology. Efficient, systematic and reasonable remote sensing image retrieval technology is helpful to provide reliable support for environmental monitoring, urban planning, agricultural management, disaster emergency and other scenes, but how to effectively utilize massive remote sensing image data which is rapidly expanded and has complex sources becomes a non-negligible challenge.
The remote sensing image and text retrieval purpose is to precisely locate text description matched with the query image or search corresponding images according to the title description in a multi-mode remote sensing image database. The efficient and accurate retrieval of target modality data becomes a research hotspot and difficulty in the face of multi-source and multi-modality remote sensing image data. At present, the cross-modal remote sensing image-text retrieval and device based on the double-tower neural network gradually become hot spots of the technology, however, the cross-modal remote sensing image-text retrieval is difficult to accurately achieve due to the difference between multiple modes and the difficulty of alignment. Designing an efficient and accurate remote sensing image retrieval method is a problem to be solved in the prior art.
The current image-text retrieval method based on deep learning mainly focuses on processing global features of remote sensing images. A limitation of this approach is that it may not be able to accurately retrieve certain key regions in the image, and for some particular word semantics, it may not be possible to find the image region corresponding thereto. Furthermore, these methods often ignore the comprehensive learning of local and global features, resulting in insufficiently tight links between regional features and overall context. This defect makes the matching degree of the text word related to the text search result and the non-object not high.
In addition, the retrieval process is reversible, if the sample in the mode A is a candidate retrieval object of the sample in the mode B, and vice versa, when the remote sensing image is used for text retrieval, the retrieval principle is still followed, and although the correctness of the retrieval method is undoubted, the existing retrieval framework often ignores the internal relation of bidirectional retrieval and ignores the bidirectional ordering information in the cross-mode similarity matrix in the reasoning process, which is important for further improving the retrieval precision, and the information can be used for carrying out secondary optimization on the retrieval result. Rather, the text and the image, once matched, must be retrievable from each other. Therefore, developing a practical and efficient image-text retrieval method to solve these problems has become an urgent need.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a cross-modal remote sensing image-text retrieval method with fused local features and global features, which can correct global information through local information and supplement the local information by using the global information so as to more accurately retrieve the expected remote sensing data.
In order to achieve the purpose, the cross-modal remote sensing image-text retrieval method for fusing local features and global features specifically comprises the following steps:
step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < > for>Is an input image;
step2, identifying targets in the remote sensing image by adopting a target detection network, and acquiring local targetsWherein->The first to input imagejThe local object of the class is referred to as,Ddetecting a network for a target; then, deep learning is carried out on the detected target area through a graph neural network to obtain a local feature vector +.>Wherein, the method comprises the steps of, wherein,Ga graph neural network feature extractor;
step3, first, the global feature vectorAnd local feature vector->Internal similarity calculation is performed and then through a cross-attention mechanismCAFor the processed global feature vector +.>And local feature vector->Dynamic weighted fusion is carried out to obtain a complete image feature vector +.>
Step4, input text using recurrent neural networkModeling is carried out, and then text feature vectors are obtained through a multi-layer perceptron>
Step5, calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small;
step6, reverse retrieval is carried out by utilizing the candidate information, and a final retrieval result is obtained.
Further, step3 specifically comprises the following steps:
step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->:/>Wherein,MAfor the multi-head attentiveness mechanism, < >>And->The original global vector and the local vector;
step3-2, by a cross-attention mechanismCADynamically weighting and fusing the processed global feature vector and the local feature vector to further obtain the processed global feature vectorAnd local feature vector->
Step3-3, filtering the global features using the local feature generation mask, while directly supplementing the local features using the global features:
wherein,is a dynamically updatable weight;
step3-4, performing feature superposition on the global features and the local features to obtainTo mix visual information:
step3-5, performing linear transformation on the mixed visual information to obtain a learnable dynamic weight, further updating the fused features, and finally obtaining a visual vector fused by the global feature vector and the local feature vector:
and->Is a weight matrix.
Further, step4 specifically comprises the following steps:
will input textEmbedding word level to form word vector sequence +.>The sequence is then input into a recurrent neural network to effect layer-by-layer processing of the text: />Wherein->Heel->Representing forward and backward recurrent neural networks, respectively, < >>Represent the firstiA hidden state of the layer;
the text feature vector is formed by the multi-layer perceptron as follows:wherein, the method comprises the steps of, wherein,Tthe generated text feature vector is represented and,MLPrepresenting a multi-layer perceptron.
Further, in Step5, the search results are ranked in order from the top to the bottom as follows:
wherein, the method comprises the steps of, wherein,is the minimum margin +.>For paired image text pairs, < >>Is unpaired image text pair.
Further, in Step6, when performing an image-to-text query, the following is specific:
according to the most similarity with the query imagekComputing query components from individual text:/>Wherein->For ranking coefficient, ++>Ranking the candidate text;
reverse search using the obtained candidate text if the query image is locatedLIn the nearest images, the query component is calculated:/>Wherein->Ranking candidate images if notLIn the nearest images, the query component is 0;
definition of the saliency componentTo quantify the similarity degree of the model, significance component +.>The expression is as follows:for candidate text in reverse search +.>And (2) image->The higher the similarity ratio, the higher the certainty thereof.
Compared with the prior art, the cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features has the following advantages:
1. the invention adopts the fusion of the local features and the global features, corrects the global information through the local information, supplements the local information by utilizing the global information, can improve the retrieval performance of the cross-mode remote sensing image-text, and can more accurately retrieve the expected remote sensing data;
2. according to the invention, the redundant characteristics with high similarity are filtered to strengthen and highlight the characteristics of the targets, so that the pressure caused by the redundant target relation on the model is reduced, the attention of the model to the salient examples is improved, and more excellent visual characterization can be obtained;
3. the invention can fully utilize the information in the similarity matrix to secondarily optimize the search result, and the post-processing algorithm comprehensively considers various information in the similarity matrix through a bidirectional search method, so that higher search precision can be obtained without additional training.
Drawings
FIG. 1 is a schematic diagram of the architecture of the present invention;
FIG. 2 is a schematic diagram of a multi-level information feature fusion of the present invention;
fig. 3 is a schematic diagram of a reordering module of the present invention.
Detailed Description
The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features uses a convolutional neural network containing a multi-level visual attention mechanism to conduct global feature extraction on an input image; extracting local features of the input image by using a graph neural network; taking the difference between the global features and the local features into consideration, dynamically generating weights through a multi-level information feature fusion module, and weighting the two features to better characterize the image; modeling the text information using a recurrent neural network to extract text time information; calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small; and carrying out reverse retrieval by utilizing the candidate information, and obtaining a final retrieval result by taking various sequencing factors into consideration so as to enhance the retrieval reliability. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the cross-modal remote sensing image-text retrieval method for fusing local features and global features specifically comprises the following steps:
step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < ->Is an input image.
The convolutional neural network structure containing the multi-level visual attention mechanism can focus on key information of the image on different levels, so that the input image is subjected to more global feature capture, and richer characterization can be provided for further deep learning tasks.
Step2, detecting the target of the remote sensing image by using a target detection network, and extracting local features by using a graph neural network.
Accurately identifying target in remote sensing image by using target detection network to obtain local targetWherein->The first to input imagejThe local object of the class is referred to as,Dthe network is detected for the Faster-RCNN target. Then, the detected target area is subjected to deep learning through a graph neural network so as to extract finer and local characteristic information, thereby obtaining local characteristic vector +.>. Wherein,Gis a graph neural network feature extractor. The method integrates the advantages of target detection and a graph neural network, so that the system can realize efficient target identification on a target level, capture more specific and detailed characteristics in a local image area and provide a more comprehensive information basis for subsequent tasks.
Step3, dynamically generating weights through a multi-level information feature fusion module, and weighting the two features to better characterize the image.
As shown in fig. 2, the global feature vector and the local feature vector are subjected to internal similarity calculation to find out the correlation between the information; and then, dynamically weighting and fusing the processed global feature vector and the local feature vector through a cross attention mechanism to obtain a complete image feature vector. Specific:
step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->:/>
Wherein,MAfor a multi-headed attention mechanism,and->Is the original global vector and the local vector.
Step3-2, by a cross-attention mechanismCADynamically weighting and fusing the processed global feature vector and the local feature vector to further obtain the processed global feature vectorAnd local feature vector->
Step3-3, filtering the global features using the local feature generation mask, while directly supplementing the local features using the global features:
wherein,is a dynamically updatable weight.
Step3-4, performing feature superposition on the global features and the local features to obtain mixed visual information:
step3-5, performing linear transformation on the mixed visual information to obtain a learnable dynamic weight, further updating the fused features, and finally obtaining a visual vector fused by the global feature vector and the local feature vector:
and->Is a weight matrix.
Step4, embedding word vectors into the input text, inputting the word vectors into a recurrent neural network, and obtaining text feature vectors through a multi-layer perceptron.
Will input textEmbedding word level to form word vector sequence +.>The sequence is then input into a recurrent neural network to effect layer-by-layer processing of the text: />Wherein->Heel with heel bodyRepresenting forward and backward recurrent neural networks, respectively, < >>Represent the firstiHidden state of the layer.
Through the learning of the recurrent neural network, finally, the higher-order characteristic representation of the text is extracted through a multi-layer perceptron, and a text characteristic vector is formed as follows:
this process flow aims at capturing deep semantic information at the text level, providing a richer feature representation for further text analysis and understanding tasks.
Step5, calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small as follows:
wherein, the method comprises the steps of, wherein,is the minimum margin +.>For paired image text pairs, < >>Is unpaired image text pair.
And calculating the similarity between the weighted and fused image features and the text features to obtain a similarity measurement value. The search results are then ranked in order of the similarity metric from greater to lesser order to ensure that the top ranked results are more relevant. The process aims at effectively optimizing the search result by comprehensively considering the characteristic information of the image and the text, so that the top-ranked result meets the requirement of similarity matching.
Step6, reverse retrieval is carried out by utilizing the candidate information, and a final retrieval result is obtained.
Taking an image-to-text query as an example, as shown in FIG. 3, according to the most similar to the query imagekComputing query components from individual text:/>Wherein->For ranking coefficient, ++>Ranking the candidate text.
UsingThe obtained candidate text is subjected to reverse search, if the query image is positionedLIn the nearest images, the query component is calculated:/>Wherein->Ranking candidate images if notLIn the nearest images, the query component is 0.
Definition of the saliency componentTo quantify the similarity degree of the model, significance component +.>The expression is as follows:
and calculating the confidence coefficient of the model on the similarity, wherein the confidence coefficient is used as a weighted term of the final similarity to dynamically weight and sum the two query components, simultaneously taking the results of forward search and reverse search into consideration, and carrying out secondary ranking on the confirmation results of the initial similarity by combining the model, thereby bringing finer ranking results.
For candidate text in reverse searchAnd (2) image->The higher the similarity ratio, the higher the certainty thereof.
The cross-modal remote sensing image-text retrieval method with the fusion of the local features and the global features corrects global information through the local information, supplements the local information by using the global information, can improve the retrieval performance of the cross-modal remote sensing image-text, and can more accurately retrieve the expected remote sensing data; meanwhile, by filtering redundant features with high similarity, the features of the targets are enhanced and highlighted, the pressure caused by the redundant target relation on the model is reduced, and the attention of the model to the salient examples is improved, so that more excellent visual characterization can be obtained; in addition, the information in the similarity matrix can be fully utilized to ensure that the search result is secondarily optimized, and the post-processing algorithm comprehensively considers various information in the similarity matrix through a bidirectional search method, so that higher search precision can be obtained without additional training.

Claims (5)

1. A cross-modal remote sensing image-text retrieval method integrating local features and global features is characterized by comprising the following steps:
step1, performing global feature extraction on an input image by using a convolutional neural network comprising a multi-level visual attention mechanism to obtain a global feature vectorWherein->For convolutional neural network comprising a multi-level visual attention mechanism, < ->Is an input image;
step2, identifying targets in the remote sensing image by adopting a target detection network, and acquiring local targetsWherein->The first to input imagejThe local object of the class is referred to as,Ddetecting a network for a target; then, deep learning is carried out on the detected target area through a graph neural network to obtain a local feature directionQuantity->Wherein, the method comprises the steps of, wherein,Ga graph neural network feature extractor;
step3, first, the global feature vectorAnd local feature vector->Internal similarity calculation is performed and then through a cross-attention mechanismCAFor the processed global feature vector +.>And local feature vector->Dynamic weighted fusion is carried out to obtain a complete image feature vector +.>
Step4, input text using recurrent neural networkModeling is carried out, and then text feature vectors are obtained through a multi-layer perceptron>
Step5, calculating similarity measurement between the weighted and fused image features and text features, and sequencing the retrieval results according to the sequence from big to small;
step6, reverse retrieval is carried out by utilizing the candidate information, and a final retrieval result is obtained.
2. The cross-modal remote sensing image-text retrieval method based on fusion of local features and global features as claimed in claim 1, wherein Step3 comprises the following specific processes:
step3-1, performing internal similarity calculation on the global feature vector and the local feature vector to obtain feature vectors with finer granularityAnd->:/> Wherein,MAfor the multi-head attentiveness mechanism, < >>And->The original global vector and the local vector;
step3-2, by a cross-attention mechanismCADynamically weighting and fusing the processed global feature vector and the local feature vector to further obtain the processed global feature vectorAnd local feature vector->
Step3-3, filtering the global features using the local feature generation mask, while directly supplementing the local features using the global features:wherein (1)>Is a dynamically updatable weight;
step3-4, performing feature superposition on the global features and the local features to obtain mixed visual information:
step3-5, performing linear transformation on the mixed visual information to obtain a learnable dynamic weight, further updating the fused features, and finally obtaining a visual vector fused by the global feature vector and the local feature vector:wherein (1)>And->Is a weight matrix.
3. The cross-modal remote sensing image-text retrieval method based on fusion of local features and global features as claimed in claim 2, wherein Step4 comprises the following specific processes:
will input textEmbedding word level to form word vector sequence +.>The sequence is then input into a recurrent neural network to effect layer-by-layer processing of the text: />Wherein->Heel->Representing forward and backward recurrent neural networks, respectively, < >>Represent the firstiA hidden state of the layer;
the text feature vector is formed by the multi-layer perceptron as follows:wherein, the method comprises the steps of, wherein,Tthe generated text feature vector is represented and,MLPrepresenting a multi-layer perceptron.
4. The cross-modal remote sensing image-text retrieval method based on the fusion of local features and global features as claimed in claim 3, wherein the retrieval results are ordered in Step5 from big to small as follows:wherein->Is the minimum margin +.>For paired image text pairs, < >>Is unpaired image text pair.
5. The method for cross-modal remote sensing image-text retrieval with fusion of local features and global features as claimed in claim 4, wherein in Step6, when the query from image to text is performed, the method is specifically as follows:
according to the most similarity with the query imagekComputing query components from individual text:/>Wherein->For ranking coefficient, ++>Ranking the candidate text;
reverse search using the obtained candidate text if the query image is locatedLIn the nearest images, the query component is calculated:/>Wherein->Ranking candidate images if notLIn the nearest images, the query component is 0;
definition of the saliency componentTo quantify the similarity degree of the model, significance component +.>The expression is as follows:for candidate text in reverse search +.>And (2) image->The higher the similarity ratio, the higher the certainty thereof.
CN202410008193.9A 2024-01-04 2024-01-04 Cross-modal remote sensing image-text retrieval method with fusion of local features and global features Active CN117520589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410008193.9A CN117520589B (en) 2024-01-04 2024-01-04 Cross-modal remote sensing image-text retrieval method with fusion of local features and global features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410008193.9A CN117520589B (en) 2024-01-04 2024-01-04 Cross-modal remote sensing image-text retrieval method with fusion of local features and global features

Publications (2)

Publication Number Publication Date
CN117520589A true CN117520589A (en) 2024-02-06
CN117520589B CN117520589B (en) 2024-03-15

Family

ID=89744230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410008193.9A Active CN117520589B (en) 2024-01-04 2024-01-04 Cross-modal remote sensing image-text retrieval method with fusion of local features and global features

Country Status (1)

Country Link
CN (1) CN117520589B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874262A (en) * 2024-03-12 2024-04-12 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
US20220277038A1 (en) * 2019-11-22 2022-09-01 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image search based on combined local and global information
CN116561365A (en) * 2023-05-16 2023-08-08 中国海洋大学 Remote sensing image cross-modal retrieval method based on layout semantic joint significant characterization
CN116610778A (en) * 2023-03-29 2023-08-18 杭州电子科技大学 Bidirectional image-text matching method based on cross-modal global and local attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220277038A1 (en) * 2019-11-22 2022-09-01 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image search based on combined local and global information
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN116610778A (en) * 2023-03-29 2023-08-18 杭州电子科技大学 Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN116561365A (en) * 2023-05-16 2023-08-08 中国海洋大学 Remote sensing image cross-modal retrieval method based on layout semantic joint significant characterization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHIPENG ZHAO 等: "Attention-Enhanced Cross-Modal Localization Between Spherical Images and Point Clouds", IEEE SENSORS JOURNAL, vol. 23, no. 19, 1 October 2023 (2023-10-01), pages 1 - 10 *
杨迪: "一种融合注意力机制的跨模态图文检索算法", 计算机技术与发展, vol. 33, no. 11, 30 November 2023 (2023-11-30), pages 143 - 148 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874262A (en) * 2024-03-12 2024-04-12 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching
CN117874262B (en) * 2024-03-12 2024-06-04 北京邮电大学 Text-dynamic picture cross-modal retrieval method based on progressive prototype matching

Also Published As

Publication number Publication date
CN117520589B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
Mithun et al. Weakly supervised video moment retrieval from text queries
Yuan et al. Remote sensing cross-modal text-image retrieval based on global and local information
CN111401201B (en) Aerial image multi-scale target detection method based on spatial pyramid attention drive
Tang et al. Human-centric spatio-temporal video grounding with visual transformers
Cong et al. Does thermal really always matter for RGB-T salient object detection?
CN117520589B (en) Cross-modal remote sensing image-text retrieval method with fusion of local features and global features
CN113989851B (en) Cross-modal pedestrian re-identification method based on heterogeneous fusion graph convolution network
CN116186317A (en) Cross-modal cross-guidance-based image-text retrieval method and system
CN115712740B (en) Method and system for multi-modal implication enhanced image text retrieval
CN116452688A (en) Image description generation method based on common attention mechanism
CN114529751B (en) Automatic screening method for intelligent identification sample data of power scene
Jia et al. STCM-Net: A symmetrical one-stage network for temporal language localization in videos
CN117829243A (en) Model training method, target detection device, electronic equipment and medium
CN109241315A (en) A kind of fast face search method based on deep learning
CN117150069A (en) Cross-modal retrieval method and system based on global and local semantic comparison learning
Liu et al. A survey on natural language video localization
CN116737979A (en) Context-guided multi-modal-associated image text retrieval method and system
CN116052108A (en) Transformer-based traffic scene small sample target detection method and device
Liang et al. Weakly supervised video anomaly detection based on spatial–temporal feature fusion enhancement
Deng et al. Abnormal occupancy grid map recognition using attention network
Zhu et al. Find gold in sand: Fine-grained similarity mining for domain-adaptive crowd counting
Rosso-Mateus et al. Deep Metric Learning for Effective Passage Retrieval in the BioASQ Challenge.
Liang et al. THU-IMG at TRECVID 2009.
Zhang et al. Joint global feature and part-based pyramid features for unsupervised person re-identification
CN118298428A (en) Unbiased scene graph generation method based on remarkable visual context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant