CN116775922A - Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics - Google Patents

Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics Download PDF

Info

Publication number
CN116775922A
CN116775922A CN202310550653.6A CN202310550653A CN116775922A CN 116775922 A CN116775922 A CN 116775922A CN 202310550653 A CN202310550653 A CN 202310550653A CN 116775922 A CN116775922 A CN 116775922A
Authority
CN
China
Prior art keywords
image
text
encoder
remote sensing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310550653.6A
Other languages
Chinese (zh)
Inventor
何柳
刘姝妍
安然
卓雨东
陶剑
李润岐
王孝天
武铎
孙郁文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Aero Polytechnology Establishment
Original Assignee
China Aero Polytechnology Establishment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Aero Polytechnology Establishment filed Critical China Aero Polytechnology Establishment
Priority to CN202310550653.6A priority Critical patent/CN116775922A/en
Publication of CN116775922A publication Critical patent/CN116775922A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion, which comprises the following steps of: processing training data of a remote sensing image-text retrieval model; step 2: constructing a multi-detail language and vision fusion model; step 3: training a detail language and vision fusion model of multi-objective optimization; step 4: constructing a remote sensing image-text description feature library; step 5: and (5) performing cross-modal retrieval of the remote sensing image-text description. According to the invention, the single-mode encoder is utilized to respectively express the image and text characteristics, the multi-mode encoder is utilized to fuse the characteristics of two modes, the expression of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through the characteristic fusion and multi-task optimization training, and the cross-mode retrieval is completed through the similarity calculation of the semantic characteristics; by designing a multi-objective optimization strategy for the model, the model has the characteristic expression capability of multiple details for remote sensing images and text description.

Description

Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics
Technical Field
The application relates to the technical field of image processing, in particular to a remote sensing image cross-mode retrieval method based on fusion of language and visual detail characteristics.
Background
In recent years, the technology of remote sensing satellites and unmanned aerial vehicles is rapidly developed, and the remote sensing technology is used as a core of the technology, so that the technology has remarkable effects in the fields of geographic positioning, disaster rescue, military reconnaissance, disaster monitoring and the like. With the wide application of remote sensing technology, remote sensing images show explosive growth, and great difficulty is brought to a plurality of tasks such as large-scale remote sensing image identification, detection, classification, retrieval and the like. The remote sensing image cross-mode retrieval task refers to finding out remote sensing images which are the same as or similar to the remote sensing image data set in a large scale according to given natural language description, and vice versa. Compared with the traditional remote sensing image retrieval, the image-text cross-mode retrieval presents better man-machine interaction characteristics and has stronger application value.
In the application scene of the cross-mode retrieval of the remote sensing image, one important requirement of a user is to input a section of description of the scene, and the image consistent with or similar to the input description is retrieved from a huge remote sensing image library. In the process, the query data and the data stored in the database have different modal types, and huge representation gaps exist in feature expression among different modalities, so that the connection among samples with the same semantics in different modalities is required to be established. The cross-mode retrieval method for the remote sensing image at the present stage mainly comprises a method based on image tag retrieval and a method for retrieving image-text feature vectors. The method based on image tag retrieval mainly carries out keyword description on each image as a characteristic tag of the image. In the retrieval process, the description input by the user is disassembled into keywords, and the keywords are matched with the keyword labels of the images to find similar target images. The method for retrieving the image-text feature vector is to encode the image-text with the same or similar semantics into the feature vector with a shorter distance by using a trained image text encoder, and vice versa. The two current search modes have different degrees of defects, and are mainly characterized in the following aspects:
1. The method based on image tag retrieval relies on performing high-quality tag description on the existing image data, so that a great deal of time is required, and the method is not applicable to the large-scale data retrieval process;
2. the retrieval method based on the image-text feature vector needs to perform feature alignment on image content and text description, and the process of feature extraction, alignment and fusion of two types of data is very difficult due to the difference of data structures of the image and the text;
3. current image feature encoders often rely on a high quality remote sensing image target recognition model to represent details of the image, the accuracy of which has a great impact on the overall search results, and training the target recognition model takes more approaches.
In order to solve the problems, the invention provides a remote sensing image cross-modal retrieval method based on language-visual detail feature fusion, which comprises the steps of designing a remote sensing image cross-modal retrieval frame comprising two single-modal encoders and a multi-modal encoder for visual and language, respectively representing image and text features by using the single-modal encoder, carrying out fusion learning on the features of the two modalities by using the multi-modal encoder, improving the expression capability of each encoder on fine-granularity semantic features of corresponding modal data by means of feature fusion and multi-task optimization training, and completing the cross-modal retrieval task by means of similarity calculation of the semantic features. In order to express the detail characteristics of the image, the invention extracts the local characteristics of the image by designing a shallow visual transducer model, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model; according to the invention, a set of multi-objective optimization strategies is designed for the model, and the whole model is trained under the strategies, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.
Disclosure of Invention
In order to overcome the defects of the prior art, a single-mode encoder is used for respectively representing the image and text characteristics, a multi-mode encoder is used for fusing the characteristics of two modes, the expression of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through the characteristic fusion and multi-task optimization training, and cross-mode retrieval is completed through similarity calculation of the semantic characteristics; by designing a multi-objective optimization strategy for the model, the model has the characteristic expression capability of multiple details for remote sensing images and text description.
In order to achieve the above purpose, the solution adopted by the invention is to provide a remote sensing image cross-modal retrieval method based on language and visual detail feature fusion, which comprises the following steps:
step 1: processing training data of a remote sensing image-text retrieval model;
the training data of the remote sensing image-text retrieval model is formed based on image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image; first, the number len of words in text data is obtained words And the number of stop words len stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len words =len stops Len words =len stops +1 image-text data pair, avoiding the effect of disabling word disturbance retrieval model training in the image-text data pair; then, the image data and the pair are processedThe text data to be used; finally, the cleaned remote sensing image-text retrieval training data are used for training an image local encoder and a global encoder;
step 2: constructing a multi-detail language and vision fusion model;
the cross-modal retrieval based on image-text features mainly extracts and expresses features of image data and text data respectively, and features of the image data and the text data with semantic similarity are represented in a vector space with minimum distance by optimizing expression capacity of each extractor; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F enc-V Remote sensing image description language feature encoder F enc-L And a multimodal encoder F based on a vision-language fusion model enc-Mul The following is shown:
wherein: i represents input remote sensing image data; t represents input text data; f (f) IL Representing local features of the remote sensing image; f (f) IGL Representing local-global fusion characteristics of the remote sensing image; f (f) T Representing the text characteristics of the remote sensing image description; s is S distance Representing distance similarity between feature vectors; s is S pairwise A matching probability value representing an image-text pair; f (F) enc-V Representing a remote sensing image visual encoder; f (F) enc-L Representing a remote sensing image description language feature encoder; f (F) enc-Mul Representing a multimodal encoder based on a vision-language fusion model;
step 3: training a detail language and vision fusion model of multi-objective optimization;
constructing a multi-target comprehensive supervision optimization method comprising four loss functions, and introducing a deep supervision strategy into the middle branch; completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed in the step 1 into a training set, a verification set and a test set according to a first proportion value mode, and sending the divided training set into a model constructed in the step 2, wherein model parameters are initialized by adopting normal distribution, and pre-training parameters are not used; freezing the image encoder portion while calculating the image-text match loss, focusing on optimizing the multi-modal encoder;
step 4: constructing a remote sensing image-text description feature library;
step 41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R i2t Representing image-to-text retrieval recall and R t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set i2t @ N and text-to-image recall R t2i @N; finally, calculating the recall rate R from the image to the text i2t @ N and text-to-image recall R t2i Average mR of @ N in test samples i2t @N and mR i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:
wherein: mR (mR) i2t @N and mR t2i @N represents the image-to-text recall rate R, respectively i2t @ N and text-to-image recall R t2i Average value of @ N in test samples; image k Representing the kth image-text pair with similar semantics; text k Representing a kth text-image pair having similar semantics; r is R i2t @N(Image k ) Representing an input Image k ;R t2i @N(Text k ) Representing Text of input Text k The method comprises the steps of carrying out a first treatment on the surface of the k represents the image and text pair number; m represents the total number of image and text pairs; n represents the search task number;
step 42: constructing an image feature database, extracting features of all image data by using the trained image encoder in the step 3, and storing the generated image features in the database so as to improve the retrieval efficiency in the subsequent application;
Step 43: constructing a text feature database, extracting features of all text data by using the trained text encoder in the step 3, and storing the generated text features in the database;
step 5: completing cross-modal retrieval of remote sensing image-text description;
the cross-modal retrieval includes four major modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; and the cross-modal retrieval of the remote sensing image-text description is completed through the four modules.
Preferably, the processing image data and the corresponding text data in the step 1 are specifically;
the processing procedure of the image data is as follows:
step 111: uniformly adjusting all image data to 278x278x 3;
step 112: basic data enhancement is carried out on input image data according to 50% probability, and the basic data enhancement comprises random rotation and random overturn so as to enhance the generalization capability of a model;
step 113: expanding the data amount of input image data, adopting an image data splicing method, randomly selecting two images I_a and I_b in the same category according to category labels, overlapping pixel levels, directly splicing text description for a text part, and directly splicing T_b behind T_a;
Step 114: randomly clipping the image data after the steps 112 and 113, wherein the clipping area is 256x256x3 to adapt to the input of the follow-up neural network model;
step 115: normalizing the image data after the step 114 to transform the gray scale range of the image to between 0 and 1;
the text data is processed as follows:
step 121: performing de-stop word processing on the text data, setting the maximum word length to 64, cutting off the text data, and discarding the part exceeding the maximum word length;
step 122: building a "random mask" in combination with a "directed mask" policy, as follows:
step 1221: constructing a Boolean type text data mask descriptor S_T, wherein the length of the text data mask descriptor S_T is consistent with the length of words in the processed text data, initializing the text data by False, and defaulting the text data mask descriptor S_T without any masking operation;
step 1222: randomly selecting 15% of the positions in the S_T according to Bernoulli distribution and marking with True;
step 1223: recording class labels imgs_cls of all images, traversing text data, and correspondingly marking the positions containing imgs_cls on S_T with True to focus on class information of targets in the follow-up masking operation;
Step 1224: constructing a quantity information descriptor S_N, storing English digital words in 0-10, traversing text data, and correspondingly marking the position containing S_N on S_T with True to focus on the quantity information of the target in the subsequent masking operation;
step 1225: for the positions marked by True in all S_T, performing [ MASK ] blank replacement on text data words according to the probability of 80%; replacing random words according to 10% probability, and replacing original words with random any other words; the remaining 10% of the words are not replaced.
Preferably, the remote sensing image visual encoder in step 2 specifically includes:
in the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are far higher than those of the corresponding text data, so that the image data are in a visual encoder F enc-V Module M for extracting global features of image by medium design cnn-mvsa And a module M for extracting local features vit Simultaneously designing a module M fusing global features and local features midf The method is specifically as follows:
wherein: f (f) IG Representing global features of the remote sensing image; m is M cnn-mvsa Representing a remote sensing image global feature extraction module; m is M vit A local feature extraction module for representing a remote sensing image; m is M midf A module for fusing the global features and the local features;
global feature extraction module M of remote sensing image cnn-mvsa The method comprises the steps of taking a ResNet-50 residual convolution neural network as a feature extractor, optimizing the feature extraction effect by utilizing a multi-scale self-attention model, and inputting the whole image subjected to cleaning and enhancement in the step 1 into M cnn-mvsa In (1), f is obtained IG
Local feature extraction module M of remote sensing image vit The image data after serialization in the step 1 is input into M by using a 6-layer Vision Transformer model as a feature extractor vit Extracting features to obtain f IL
The fusion module M of the global feature and the local feature midf Is a linear function by combining the features f IG And f IL And performing linear addition, wherein a first fusion parameter a and a second fusion parameter b are obtained in the training process, and are as follows:
M midf (f IG ,f IL )=af IG +bf IL
wherein: m is M midf A fusion module representing global features and local features; a represents a first fusion parameter; b represents a second fusion parameter.
Preferably, the remote sensing image description language feature encoder in the step 2 specifically includes:
the remote sensing image description language feature encoder F enc-L Comprising a Bert-based text encoder M bert The following is shown:
f T =M bert (T);
Wherein: m is M bert Representation ofA standard Bert model;
m during training bert The feature expression characteristics of the model can be adjusted according to the labeling data, the feature expression capacity of the model is enhanced, the text data covered in the step 1 is used as training data, and the training data is input into the model M bert In (a) and (b); entering application stage M after model training is completed bert The user input text data is expressed as a feature vector with semantics.
Preferably, the multi-mode encoder based on the vision-language fusion model in the step 2 specifically includes:
multi-modal encoder F based on vision-language fusion model enc-Mul Comprises two modules for calculating M of cross-modal vector distance cms And M for vision-language fusion vlf The following is shown:
F enc-Mul ={M cms ,M vlf };
wherein: m is M cms The representation module is used for calculating the distance similarity of the image feature vector and the text feature vector, and measuring the distance by using a cosine distance; m is M vlf The representation is used for vision-language fusion; f (F) enc-Mul Representing a multimodal encoder based on a vision-language fusion model;
M vlf the module initializes with the last 6 layers of the bert-based model and models visual-language interactions with additional layers of cross-attention.
Preferably, in the step 3, a multi-objective comprehensive supervision optimization method including four loss functions is constructed, which specifically includes:
Step 31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, building a th between the text feature encoder and the image local feature encoderTwo-triplet loss function L itt2 For deep supervision, the following is indicated:
wherein: l (L) itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t≡represents text features that do not match image I; i≡represents image features that do not match text T; l (L) itt2 Representing a second triplet loss function; i loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the internal value is larger than 0, the loss is taken as the loss, and when the internal value is smaller than 0, the loss is 0;
Step 32: constructing an image-text matching loss function L between a multi-modal encoder and an image feature encoder itm The prediction of whether a pair of input images and texts are matched or not is realized by connecting a first fully-connected two classification layers; in the selection of positive/negative samples, the difficult samples closest to the positive samples within a single batch are calculated as negative samples by the similarity recall module to enhance the learning ability of the multi-modal module, the loss calculation form is as follows:
L itm =-y itm log(p itm (I,T))+(1-y itm )log(1-p itm (I,T));
wherein: l (L) itm Representing a constructed image-text matching loss function between the multi-modal encoder and the image feature encoder; y is itm A matching tag representing the constructed image-text pair; p is p itm (I, T) represents the probability of matching of image-text;
step 33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder mlm The text after mask processing is represented by T and the model prediction probability by p msk (I, T≡) as follows:
L mlm =-y msk log(p msk (I,T^))+(1-y msk )log(1-p msk (I,T^));
wherein: l (L) mlm Representing a mask language model loss function constructed between the multi-modal encoder and the image local feature encoder; y is msk Representing a predictive probability of an image-text pair; p is p msk Representing model predictive probabilities;
step 34: the combination strategy of the four loss functions can influence the final expression effect of the model, and proper weight coefficients are required to be distributed for multi-objective co-optimization, so that single tasks are prevented from occupying dominant positions in joint learning; thus, a dynamic update strategy of the loss weights is employed, and for each goal of training, the ratio of the current loss to the initial loss is considered during each round of training, and a super parameter λ is introduced for balancing the effect of the weights, as follows:
Wherein: θ t The weight of the task t calculated by the formula is represented; l (L) i (t) represents the loss value of task t calculated during the current small batch iteration; l (L) 0 (t) represents the loss value of task t during the initial iteration; λ represents the weight used to balance each task, set at 0.5; i represents the number of each training process.
Preferably, the image-to-text recall rate R in step 41 i2t @ N and text-to-image recall R t2i The @ N is specifically:
the image-to-text recall is as follows:
wherein: trext 1 ,text 2 … textN represents the 1 st, 2 nd to N th candidate text sample sets returned by the search algorithm respectively;
the text-to-image recall is as follows:
wherein: image device 1 ,image 2 .. the imageN represents the 1 st, 2 nd through nth candidate image sample sets, respectively, returned by the search algorithm.
Preferably, the cross-modal retrieval of the remote sensing image-text description in the step 5 includes a text search process and a text search process, specifically:
the text searching process comprises the following steps: when a text description is input, the built cross-modal retrieval model firstly utilizes a text coding module to calculate the characteristics of the input text, then utilizes a similarity judgment recall module to calculate the similarity between each image characteristic and the text characteristic in an image characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned image and the input text so as to realize fine adjustment on the primary retrieval result;
The graphic text searching process comprises the following steps: when an image is input, the built cross-mode retrieval model firstly utilizes an image coding module to calculate the characteristics of the input image, then utilizes a similarity judgment recall module to calculate the similarity between each text characteristic and the image characteristic in a text characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned text and the input image, so that fine adjustment on the primary retrieval result is realized.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the invention, a remote sensing image cross-mode retrieval framework comprising two single-mode encoders and one multi-mode encoder is designed, the single-mode encoder is utilized to respectively express image and text characteristics, the multi-mode encoder is utilized to perform fusion learning on the characteristics of two modes, the expression capacity of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through characteristic fusion and multi-task optimization training, and the cross-mode retrieval task is completed through similarity calculation of the semantic characteristics;
(2) In order to express the detail characteristics of the image and eliminate the training expense brought by a pre-trained target recognition model, the invention designs a shallow visual transducer model to extract the local characteristics of the image, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model;
(3) According to the invention, a set of multi-objective optimization strategies is designed for the model, and the whole model is trained under the strategies, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.
Drawings
FIG. 1 is a control block diagram of a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion in an embodiment of the invention;
FIG. 2 is a diagram of a model training process according to an embodiment of the present invention;
FIG. 3 is a cross-modal retrieval process diagram of an embodiment of the present invention;
FIG. 4 is a diagram of a text search process according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a text searching process according to an embodiment of the present invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
According to the embodiment of the invention, a picture is used for analyzing according to cases, in the design of the visual encoder, the capability of realizing target detection by a visual transducer module trained by a multi-mode encoder is introduced, the pipelined training process is converted into an end-to-end training process, the required training data amount and training time are greatly reduced, and the convenience and accuracy of model construction are reduced. In the model training process, the multi-mode fusion learning and the optimization of multi-target tasks enable the model to have the capability of processing multi-detail tasks, and compared with other related methods, the model has better retrieval performance. The whole retrieval flow is optimized, after the recall task is finished, a reordering model based on a language and vision fusion model is introduced, recall results can be further optimized under the condition of smaller calculation consumption, and top-1 and top-5 ordering performance can be improved. Fig. 1 is a control block diagram of a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion according to an embodiment of the invention.
The embodiment of the invention provides a remote sensing image cross-modal retrieval method based on language and visual detail feature fusion, and a model training process diagram of the embodiment of the invention is shown in fig. 2; to illustrate the applicability of the invention, it is applied to examples, comprising in particular the following steps:
S1: and processing training data of the remote sensing image-text retrieval model.
The training data of the remote sensing image-text retrieval model is composed based on image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image.
The processing procedure of the image data is as follows:
s111: all image data is uniformly resized to 278x278x 3.
S112: the input image data is subjected to basic data enhancement according to 50% probability, including random rotation and random inversion so as to enhance the generalization capability of the model.
S113: the method comprises the steps of expanding the data quantity of input image data, randomly selecting two images I_a and I_b in the same category according to category labels by adopting an image data splicing method, overlapping pixel levels, directly splicing text description for a text part, and directly splicing T_b behind T_a.
S114: the image data after S112 and S113 is randomly cropped, and the cropping area is 256×256×3 in size to accommodate the input of the subsequent neural network model.
S115: carrying out normalization processing on the image data after the S114 to convert the gray scale range of the image to between 0 and 1;
the text data is processed as follows:
S121: the text data is subjected to de-stop word processing while setting the maximum word length to 64, the text data is truncated, and the portion exceeding the maximum word length is discarded.
S122: building a "random mask" in combination with a "directed mask" policy, as follows:
s1221: the Boolean type text data mask descriptor S_T is constructed, the length is consistent with the length of words in the processed text data, initialization is carried out by False, and no masking operation is carried out by default.
S1222: the 15% position in s_t was randomly selected according to the bernoulli distribution and marked with True.
S1223: the category labels imgs_cls of all images are recorded, text data are traversed, and the positions containing the imgs_cls are correspondingly marked with True on the S_T so as to focus on the category information of the target in the subsequent masking operation.
S1224: constructing a quantity information descriptor S_N, storing English digital words in 0-10, traversing text data, and correspondingly marking the position containing S_N on S_T with True to focus on the quantity information of the target in the subsequent masking operation.
S1225: for the positions marked by True in all S_T, performing [ MASK ] blank replacement on text data words according to the probability of 80%; replacing random words according to 10% probability, and replacing original words with random any other words; the remaining 10% of the words are not replaced.
First, the number len of words in text data is obtained words And the number of stop words len stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len words =len stops Len words =len stops +1 image-text dataFor avoiding the effect of training the deactivated word disturbance retrieval model in the image-text data pair; then, processing the image data and the corresponding text data; and finally, using the cleaned remote sensing image-text retrieval training data for training the local encoder and the global encoder of the image.
S2: and constructing a multi-detail language and vision fusion model.
The cross-modal retrieval based on image-text features mainly extracts and expresses features of image data and text data respectively, and features of the image data and the text data with semantic similarity are represented in a vector space with minimum distance by optimizing expression capacity of each extractor, as shown in fig. 3, which is a cross-modal retrieval process diagram of the embodiment of the invention; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F enc-V Remote sensing image description language feature encoder F enc-L And a multimodal encoder F based on a vision-language fusion model enc-Mul The following is shown:
wherein: i represents input remote sensing image data; t represents input text data; f (f) IL Representing local features of the remote sensing image; f (f) IGL Representing local-global fusion characteristics of the remote sensing image; f (f) T Representing the text characteristics of the remote sensing image description; s is S distance Representing distance similarity between feature vectors; s is S pairwise A matching probability value representing an image-text pair; f (F) enc-V Representing a remote sensing image visual encoder; f (F) enc-L Representing a remote sensing image description language feature encoder; f (F) enc-Mul A multimodal encoder based on a vision-language fusion model is presented.
In the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are far higher than those of the corresponding text data, so that the image data are in a visual encoder F enc-V Module M for extracting global features of image by medium design cnn-mvsa And an extraction officeModule M of part characteristics vit Simultaneously designing a module M fusing global features and local features midf The method is specifically as follows:
wherein: f (f) IG Representing global features of the remote sensing image; m is M cnn-mvsa Representing a remote sensing image global feature extraction module; m is M vit A local feature extraction module for representing a remote sensing image; m is M midf Representing a module that fuses global features and local features.
Global feature extraction module M of remote sensing image cnn-mvsa A ResNet-50 residual convolution neural network is used as a feature extractor, a multi-scale self-attention model is utilized to optimize the effect of feature extraction, and the whole image after cleaning and enhancement in S1 is input into M cnn-mvsa In (1), f is obtained IG
Local feature extraction module M of remote sensing image vit Is characterized by using a 6-layer Vision Transformer model as a feature extractor, inputting the image data after serialization in S1 into M vit Extracting features to obtain f IL
Fusion module M of global features and local features midf Is a linear function by combining the features f IG And f IL And performing linear addition, wherein a first fusion parameter a and a second fusion parameter b are obtained in the training process, and are as follows:
M midf (f IG ,f IL )=af IG +bf IL
wherein: m is M midf A fusion module representing global features and local features; a represents a first fusion parameter; b represents a second fusion parameter.
Remote sensing image description language characteristic encoder F enc-L Comprising a Bert-based text encoder M bert The following is shown:
f T =M bert (T);
in the middle of:M bert Representing a standard Bert model.
M during training bert The feature expression characteristics of the model can be adjusted according to the labeling data, the feature expression capacity of the model is enhanced, the text data covered in S1 is used as training data, and the training data is input into the model M bert In (a) and (b); entering application stage M after model training is completed bert The user input text data is expressed as a feature vector with semantics.
Multi-modal encoder F based on vision-language fusion model enc-Mul Comprises two modules for calculating M of cross-modal vector distance cms And M for vision-language fusion vlf The following is shown:
F enc-Mul ={M cms ,M vlf };
wherein: m is M cms The representation module is used for calculating the distance similarity of the image feature vector and the text feature vector, and measuring the distance by using a cosine distance; m is M vlf The representation is used for vision-language fusion; f (F) enc-Mul A multimodal encoder based on a vision-language fusion model is presented.
M vlf The module initializes with the last 6 layers of the bert-based model and models visual-language interactions with additional layers of cross-attention.
S3: and training a detail language and vision fusion model of multi-objective optimization.
A multi-target comprehensive supervision optimization method comprising four loss functions is constructed, and a deep supervision strategy is introduced into the middle branch.
S31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, and a second triplet loss function L is constructed between the text feature encoder and the image local feature encoder itt2 For deep supervision, the following is indicated:
wherein: l (L) itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t (T) ^ Representing text features that do not match image I; i ^ Representing image features that do not match text T; l (L) itt2 Representing a second triplet loss function; i loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the value in the set is greater than 0, the loss is taken as the loss, and when the value is less than 0, the loss is 0.
S32: constructing an image-text matching loss function L between a multi-modal encoder and an image feature encoder itm The prediction of whether a pair of input images and texts are matched or not is realized by connecting a first fully-connected two classification layers; in the selection of positive/negative samples, the difficult samples closest to the positive samples within a single batch are calculated as negative samples by the similarity recall module to enhance the learning ability of the multi-modal module, the loss calculation form is as follows:
L itm =-y itm log(p itm (I,T))+(1-y itm )log(1-p itm (I,T));
wherein: l (L) itm Representing a constructed image-text matching loss function between the multi-modal encoder and the image feature encoder; y is itm A matching tag representing the constructed image-text pair; p is p itm (I, T) represents the probability of image-text matching.
S33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder mlm T for text after mask processing ^ The model prediction probability is represented by p msk (I,T ^ ) The representation is as follows:
L mlm =-y msk log(p msk (I,T ^ ))+(1-y msk )log(1-p msk (I,T ^ ));
wherein: l (L) mlm Representing a mask language model loss function constructed between the multi-modal encoder and the image local feature encoder; y is msk Representing a predictive probability of an image-text pair; p is p msk Representing model predictive probabilities.
S34: the combination strategy of the four loss functions can influence the final expression effect of the model, and proper weight coefficients are required to be distributed for multi-objective co-optimization, so that single tasks are prevented from occupying dominant positions in joint learning; thus, a dynamic update strategy of the loss weights is employed, and for each goal of training, the ratio of the current loss to the initial loss is considered during each round of training, and a super parameter λ is introduced for balancing the effect of the weights, as follows:
wherein: θ t The weight of the task t calculated by the formula is represented; l (L) i (t) represents the loss value of task t calculated during the current small batch iteration; l (L) 0 (t) represents the loss value of task t during the initial iteration; λ represents the weight used to balance each task, set at 0.5; i represents the number of each training process.
Completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed by the S1 into a training set, a verification set and a test set according to a first proportion value, setting the first proportion value to be 8:1:1 in a preferred embodiment, and sending the divided training set into a model constructed by the S2, wherein the model parameters are initialized by adopting normal distribution, and the pre-training parameters are not used; the image encoder portion is frozen when calculating the image-text match loss, focusing on optimizing the multi-modal encoder.
S4: and constructing a remote sensing image-text description feature library.
S41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R i2t Representing image-to-text retrieval recall and R t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set i2t @ N and text-to-image recall R t2i @N。
The image-to-text recall is as follows:
wherein: text (text) 1 ,text 2 …text N Representing the 1 st, 2 nd through nth candidate text sample sets returned by the search algorithm, respectively.
Text-to-image recall is as follows:
wherein: image device 1 ,image 2 ...image N Representing the 1 st, 2 nd through nth candidate image sample sets returned by the search algorithm, respectively.
Finally, calculating the recall rate R from the image to the text i2t @ N and text-to-image recall R t2i Average mR of @ N in test samples i2t @N and mR i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:
wherein: mR (mR) i2t @N and mR t2i @N represents the image-to-text recall rate R, respectively i2t @ N and text-to-image recall R t2i Average value of @ N in test samples; image k Representing the kth image-text pair with similar semantics; text k Representing a kth text-image pair having similar semantics; r is R i2t @N(Image k ) Representing an input Image k ;R t2i @N(Text k ) Representing Text of input Text k The method comprises the steps of carrying out a first treatment on the surface of the k represents the image and text pair number; m represents the total number of image and text pairs; n represents the search task number.
S42: and (3) constructing an image feature database, extracting features of all image data by using the trained image encoder in the step (S3), and storing the generated image features in the database so as to improve the retrieval efficiency in the subsequent application.
S43: and (3) constructing a text feature database, extracting features of all text data by using the trained text encoder in the step (S3), and storing the generated text features in the database.
S5: and (5) performing cross-modal retrieval of the remote sensing image-text description.
Cross-modal retrieval consists of four major modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; the cross-modal searching of the remote sensing image-text description is completed through the four modules, and comprises a text searching process and a graph searching process, and the cross-modal searching process specifically comprises the following steps:
the text searching process comprises the following steps: when a text description is input, the built cross-modal retrieval model firstly utilizes a text coding module to calculate the characteristics of the input text, then utilizes a similarity judgment recall module to calculate the similarity between each image characteristic and the text characteristic in an image characteristic database built in S4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned image and the input text, so that fine adjustment on the primary retrieval result is realized. Fig. 4 illustrates a text search process, taking a similar result of returning Top5 as an example in the embodiment of the present invention.
The text searching process by the graph is as follows: when an image is input, the built cross-mode retrieval model firstly utilizes an image coding module to calculate the characteristics of the input image, then utilizes a similarity judgment recall module to calculate the similarity between each text characteristic and the image characteristic in a text characteristic database built in S4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned text and the input image, so that fine adjustment on the primary retrieval result is realized. Fig. 5 illustrates a graph search process according to an embodiment of the present invention, taking a similar result of returning Top5 as an example.
The method has the advantages of three aspects:
in the training stage, the method converts the pipelined training process into an end-to-end training process, converts the traditional two-stage model training into one stage, and can train without depending on an additional remote sensing image high-quality annotation data set based on target recognition, and the model training time is half of that of the two-stage training method.
In the model training process, the model has the capability of processing multi-detail tasks through multi-modal fusion learning and optimizing multi-target tasks, semantic alignment is carried out between images and texts, and compared with the traditional method in experimental data, the method has the advantages that the model is prepared in the specific mR i2t @N and mR t2i As can be seen from the analysis of examples below, the practical application effect of the method is better than that of the conventional methods 1 and 2, as shown in table 1 below, for more than a majority of the conventional methods in the evaluation indexes @ N, n=1, 5, 10.
TABLE 1 comparative analysis results of the present method and the conventional method
After the recall task is completed, a reordering model based on a language and vision fusion model is introduced, recall results can be further optimized under the condition of smaller calculation consumption, the ordering performance of top-1 and top-5 is further improved, and the optimization effect of the reordering model is due to the fact that the reordering model is not included as shown in the following table 2.
TABLE 2 comparative analysis results with and without reordering model
In conclusion, the prediction result of the remote sensing image cross-mode retrieval method based on the fusion of language and visual detail features proves to have a good effect.
(1) According to the embodiment of the invention, a remote sensing image cross-mode retrieval framework comprising two single-mode encoders and one multi-mode encoder is designed, the single-mode encoder is used for respectively representing image and text characteristics, the multi-mode encoder is used for carrying out fusion learning on the characteristics of two modes, the expression capacity of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through characteristic fusion and multi-task optimization training, and the cross-mode retrieval task is completed through similarity calculation of the semantic characteristics.
(2) In order to express the detail characteristics of the image and eliminate the training expense brought by a pre-trained target recognition model, the embodiment of the invention extracts the local characteristics of the image by designing a shallow visual transducer model, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model.
(3) According to the embodiment of the invention, a set of multi-objective optimization strategy is designed for the model, and the whole model is trained under the strategy, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.
The above examples are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the scope of protection defined by the claims of the present invention without departing from the spirit of the present invention.

Claims (8)

1. A remote sensing image cross-mode retrieval method based on language and visual detail feature fusion is characterized by comprising the following steps:
step 1: processing training data of a remote sensing image-text retrieval model;
the training data of the remote sensing image-text retrieval model comprises image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image; first, the number len of words in text data is obtained words And the number of stop words len stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len words =len stops Len words =len stops +1 image-text data pairs, avoiding disabling word disturbance retrieval model training in the image-text data pairs; then, processing the image data and the corresponding text data; finally, the cleaned remote sensing image-text retrieval training data are used for training an image local encoder and a global encoder;
step 2: constructing a multi-detail language and vision fusion model;
the cross-modal retrieval based on the image-text features is to extract and express the features of the image data and the text data respectively, and the features of the image data and the text data with semantic similarity are represented to have the minimum distance in a vector space by optimizing the expression capacity of each extractor; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F enc-V Remote sensing image description language feature encoder F enc-L And a multimodal encoder F based on a vision-language fusion model enc-Mul The following is shown:
wherein: i represents input remote sensing image data; t represents input text data; f (f) IL Representing local features of the remote sensing image; f (f) IGL Representing local-global fusion characteristics of the remote sensing image; f (f) T Representing the text characteristics of the remote sensing image description; s is S distance Representing distance similarity between feature vectors; s is S pairwise A matching probability value representing an image-text pair; f (F) enc-V Representing a remote sensing image visual encoder; f (F) enc-L Representing a remote sensing image description language feature encoder; f (F) enc-Mul Representing a multimodal encoder based on a vision-language fusion model;
step 3: training a detail language and vision fusion model of multi-objective optimization;
constructing a multi-target comprehensive supervision optimization method comprising four loss functions, and introducing a deep supervision strategy into the middle branch; completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed in the step 1 into a training set, a verification set and a test set according to a first proportion value, and sending the divided training set into a model constructed in the step 2, wherein model parameters are initialized by adopting normal distribution, and pre-training parameters are not used; freezing the image encoder portion while calculating the image-text match loss, focusing on optimizing the multi-modal encoder; 8:1:1 mode
Step 4: constructing a remote sensing image-text description feature library;
step 41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R i2t Representing image-to-text retrieval recall and R t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set i2t @ N and text-to-image recall R t2i @N; finally, calculating the recall rate R from the image to the text i2t @ N and text-to-image recall R t2i @ N in test samplesAverage value mR of (2) i2t @N and mR i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:
wherein: mR (mR) i2t @N and mR t2i @N represents the image-to-text recall rate R, respectively i2t @ N and text-to-image recall R t2i Average value of @ N in test samples; image k Representing the kth image-text pair with similar semantics; text k Representing a kth text-image pair having similar semantics; r is R i2t @N(Image k ) Representing an input Image k ;R t2i @N(Text k ) Representing Text of input Text k The method comprises the steps of carrying out a first treatment on the surface of the k represents the image and text pair number; m represents the total number of image and text pairs; n represents the search task number;
step 42: constructing an image feature database, extracting features of all image data by using the trained image encoder in the step 3, and storing the generated image features in the database so as to improve the retrieval efficiency in the subsequent application;
Step 43: constructing a text feature database, extracting features of all text data by using the trained text encoder in the step 3, and storing the generated text features in the database;
step 5: completing cross-modal retrieval of remote sensing image-text description;
the cross-modal retrieval includes four modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; and the cross-modal retrieval of the remote sensing image-text description is completed through the four modules.
2. The remote sensing image cross-mode retrieval method based on language and visual detail feature fusion according to claim 1, wherein the processing image data and the corresponding text data in the step 1 are specifically;
the processing procedure of the image data is as follows:
step 111: uniformly adjusting all image data to 278x278x 3;
step 112: basic data enhancement is carried out on input image data according to 50% probability, and the basic data enhancement comprises random rotation and random overturn so as to enhance the generalization capability of a model;
Step 113: expanding the data amount of input image data, adopting an image data splicing method, randomly selecting two images I_a and I_b in the same category according to category labels, overlapping pixel levels, directly splicing text description for a text part, and directly splicing T_b behind T_a;
step 114: randomly clipping the image data after the steps 112 and 113, wherein the clipping area is 256x256x3 to adapt to the input of the follow-up neural network model;
step 115: normalizing the image data after the step 114 to transform the gray scale range of the image to between 0 and 1;
the text data is processed as follows:
step 121: performing de-stop word processing on the text data, setting the maximum word length to 64, cutting off the text data, and discarding the part exceeding the maximum word length;
step 122: building a "random mask" in combination with a "directed mask" policy, as follows:
step 1221: constructing a Boolean type text data mask descriptor S_T, wherein the length of the text data mask descriptor S_T is consistent with the length of words in the processed text data, initializing the text data by False, and defaulting the text data mask descriptor S_T without any masking operation;
Step 1222: randomly selecting 15% of the positions in the S_T according to Bernoulli distribution and marking with True;
step 1223: recording class labels imgs_cls of all images, traversing text data, and correspondingly marking the positions containing imgs_cls on S_T with True to focus on class information of targets in the follow-up masking operation;
step 1224: constructing a quantity information descriptor S_N, storing English digital words in 0-10, traversing text data, and correspondingly marking the position containing S_N on S_T with True to focus on the quantity information of the target in the subsequent masking operation;
step 1225: for the positions marked by True in all S_T, performing [ MASK ] blank replacement on text data words according to the probability of 80%; replacing random words according to 10% probability, and replacing original words with random any other words; the remaining 10% of the words are not replaced.
3. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the remote sensing image visual encoder in step 2 specifically comprises:
in the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are higher than those of the corresponding text data, so that the visual encoder F enc-V Module M for extracting global features of image by medium design cnn-mvsa And a module M for extracting local features vit Simultaneously designing a module M fusing global features and local features midf The method is specifically as follows:
wherein: f (f) IG Representing global features of the remote sensing image; m is M cnn-mvsa Representing a remote sensing image global feature extraction module; m is M vit A local feature extraction module for representing a remote sensing image; m is M midf A module for fusing the global features and the local features;
global feature extraction module M of remote sensing image cnn-mvsa Is characterized by taking a ResNet-50 residual convolution neural network as a feature extractor and utilizing multi-scale self-attentionOptimizing the effect of feature extraction by using a model, and inputting the whole image subjected to cleaning and enhancement in the step 1 into M cnn-mvsa In (1), f is obtained IG
Local feature extraction module M of remote sensing image vit The image data after the serialization in the step 1 is input into M by a 6-layer Vision transducer model as a feature extractor vit Extracting features to obtain f IL
The fusion module M of the global feature and the local feature midf Is a linear function by combining the features f IG And f IL And performing linear addition, wherein a first fusion parameter a and a second fusion parameter b are obtained in the training process, and are as follows:
M midf f IG ,f IL =af IG +bf IL
wherein: m is M midf A fusion module representing global features and local features; a represents a first fusion parameter; b represents a second fusion parameter.
4. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the remote sensing image description language feature encoder in the step 2 specifically comprises:
the remote sensing image description language feature encoder F enc-L Comprising a Bert-based text encoder M bert The following is shown:
f T =M bert T;
wherein: m is M bert A Bert model representing a standard;
m during training bert The feature expression characteristics of the model can be adjusted according to the labeling data, the feature expression capacity of the model is enhanced, the text data covered in the step 1 is used as training data, and the training data is input into the model M bert In (a) and (b); entering application stage M after model training is completed bert The user input text data is expressed as a feature vector with semantics.
5. The remote sensing image cross-modal retrieval method based on the fusion of language and visual detail features according to claim 1, wherein the multi-modal encoder based on the visual-language fusion model in the step 2 specifically comprises:
multi-modal encoder F based on vision-language fusion model enc-Mul Comprises two modules for calculating M of cross-modal vector distance cms And M for vision-language fusion vlf The following is shown:
F enc-Mul ={M cms ,M vlf };
wherein: m is M cms The representation module is used for calculating the distance similarity of the image feature vector and the text feature vector, and measuring the distance by using a cosine distance; m is M vlf The representation is used for vision-language fusion; f (F) enc-Mul Representing a multimodal encoder based on a vision-language fusion model;
M vlf the module initializes with the last 6 layers of the bert-based model and models visual-language interactions with additional layers of cross-attention.
6. The remote sensing image cross-modal retrieval method based on language and visual detail feature fusion according to claim 1, wherein the multi-objective comprehensive supervision optimization method comprising four loss functions is constructed in the step 3, and specifically comprises the following steps:
step 31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, and a second triplet loss function L is constructed between the text feature encoder and the image local feature encoder itt2 For deep supervision, the following is indicated:
wherein: l (L) itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t (T) ^ Representing text features that do not match image I; i ^ Representing image features that do not match text T; l (L) itt2 Representing a second triplet loss function; i loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the internal value is larger than 0, the loss is taken as the loss, and when the internal value is smaller than 0, the loss is 0;
step 32: constructing an image-text matching loss function L between a multi-modal encoder and an image feature encoder itm The prediction of whether a pair of input images and texts are matched or not is realized by connecting a first fully-connected two classification layers; in the selection of positive/negative samples, the difficult samples closest to the positive samples within a single batch are calculated as negative samples by the similarity recall module to enhance the learning ability of the multi-modal module, the loss calculation form is as follows:
L itm =-y itm log(p itm (I,T))+(1-y itm )log(1-p itm (I,T));
wherein: l (L) itm Representing a constructed image-text matching loss function between the multi-modal encoder and the image feature encoder; y is itm A matching tag representing the constructed image-text pair; p is p itm (I, T) represents the probability of matching of image-text;
step 33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder mlm T for text after mask processing ^ The model prediction probability is represented by p msk (I,T ^ ) The representation is as follows:
L mlm =-y msk log(p msk (I,T ^ ))+(1-y msk )log(1-p msk (I,T ^ ));
wherein: l (L) mlm Representing a mask language model loss function constructed between the multi-modal encoder and the image local feature encoder; y is msk Representing a predictive probability of an image-text pair; p is p msk Representing model predictive probabilities;
step 34: the combination strategy of the four loss functions can influence the final expression effect of the model, and proper weight coefficients are required to be distributed for multi-objective co-optimization, so that single tasks are prevented from occupying dominant positions in joint learning; thus, a dynamic update strategy of the loss weights is employed, and for each goal of training, the ratio of the current loss to the initial loss is considered during each round of training, and a super parameter λ is introduced for balancing the effect of the weights, as follows:
wherein: θ t The weight of the task t calculated by the formula is represented; l (L) i (t) represents the loss value of task t calculated during the current small batch iteration; l (L) 0 (t) represents the loss value of task t during the initial iteration; λ represents the weight used to balance each task, set at 0.5; i represents the number of each training process.
7. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features as claimed in claim 1, wherein the image-to-text recall ratio R in step 41 i2t @ N and text-to-image recall R t2i The @ N is specifically:
the image-to-text recall is as follows:
wherein: text (text) 1 ,text 2 …text N Respectively represent retrieval algorithm returnsA set of candidate text samples 1 st, 2 nd through nth back;
the text-to-image recall is as follows:
wherein: image device 1 ,image 2 ...image N Representing the 1 st, 2 nd through nth candidate image sample sets returned by the search algorithm, respectively.
8. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the cross-modal retrieval of remote sensing image-text description in the step 5 comprises a text-to-text process and a text-to-text process, specifically comprises the following steps:
the text searching process comprises the following steps: when a text description is input, the built cross-modal retrieval model firstly utilizes a text coding module to calculate the characteristics of the input text, then utilizes a similarity judgment recall module to calculate the similarity between each image characteristic and the text characteristic in an image characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned image and the input text so as to realize fine adjustment on the primary retrieval result;
The graphic text searching process comprises the following steps: when an image is input, the built cross-mode retrieval model firstly utilizes an image coding module to calculate the characteristics of the input image, then utilizes a similarity judgment recall module to calculate the similarity between each text characteristic and the image characteristic in a text characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned text and the input image, so that fine adjustment on the primary retrieval result is realized.
CN202310550653.6A 2023-05-16 2023-05-16 Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics Pending CN116775922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310550653.6A CN116775922A (en) 2023-05-16 2023-05-16 Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310550653.6A CN116775922A (en) 2023-05-16 2023-05-16 Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics

Publications (1)

Publication Number Publication Date
CN116775922A true CN116775922A (en) 2023-09-19

Family

ID=87993958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310550653.6A Pending CN116775922A (en) 2023-05-16 2023-05-16 Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics

Country Status (1)

Country Link
CN (1) CN116775922A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977796A (en) * 2023-09-25 2023-10-31 中国科学技术大学 Zero sample image recognition method, system, equipment and storage medium
CN117292146A (en) * 2023-10-27 2023-12-26 中科苏州智能计算技术研究院 Industrial scene-oriented method, system and application method for constructing multi-mode large language model
CN117556079A (en) * 2024-01-12 2024-02-13 航天宏图信息技术股份有限公司 Remote sensing image content retrieval method, remote sensing image content retrieval device, electronic equipment and medium
CN117557414A (en) * 2023-11-30 2024-02-13 重庆欣荣土地房屋勘测技术研究所有限责任公司 Cultivated land supervision method, device, equipment and storage medium based on automatic interpretation of remote sensing image
CN117609527A (en) * 2024-01-16 2024-02-27 合肥人工智能与大数据研究院有限公司 Cross-modal data retrieval optimization method based on vector database
CN117648459A (en) * 2024-01-29 2024-03-05 中国海洋大学 Image-text cross-modal retrieval method and system for high-similarity marine remote sensing data
CN117690031A (en) * 2024-02-04 2024-03-12 中科星图数字地球合肥有限公司 SAM model-based small sample learning remote sensing image detection method
CN117909535A (en) * 2024-03-15 2024-04-19 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model
CN117909535B (en) * 2024-03-15 2024-05-31 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116977796B (en) * 2023-09-25 2024-02-23 中国科学技术大学 Zero sample image recognition method, system, equipment and storage medium
CN116977796A (en) * 2023-09-25 2023-10-31 中国科学技术大学 Zero sample image recognition method, system, equipment and storage medium
CN117292146A (en) * 2023-10-27 2023-12-26 中科苏州智能计算技术研究院 Industrial scene-oriented method, system and application method for constructing multi-mode large language model
CN117557414A (en) * 2023-11-30 2024-02-13 重庆欣荣土地房屋勘测技术研究所有限责任公司 Cultivated land supervision method, device, equipment and storage medium based on automatic interpretation of remote sensing image
CN117556079B (en) * 2024-01-12 2024-04-16 航天宏图信息技术股份有限公司 Remote sensing image content retrieval method, remote sensing image content retrieval device, electronic equipment and medium
CN117556079A (en) * 2024-01-12 2024-02-13 航天宏图信息技术股份有限公司 Remote sensing image content retrieval method, remote sensing image content retrieval device, electronic equipment and medium
CN117609527A (en) * 2024-01-16 2024-02-27 合肥人工智能与大数据研究院有限公司 Cross-modal data retrieval optimization method based on vector database
CN117648459A (en) * 2024-01-29 2024-03-05 中国海洋大学 Image-text cross-modal retrieval method and system for high-similarity marine remote sensing data
CN117648459B (en) * 2024-01-29 2024-04-26 中国海洋大学 Image-text cross-modal retrieval method and system for high-similarity marine remote sensing data
CN117690031A (en) * 2024-02-04 2024-03-12 中科星图数字地球合肥有限公司 SAM model-based small sample learning remote sensing image detection method
CN117690031B (en) * 2024-02-04 2024-04-26 中科星图数字地球合肥有限公司 SAM model-based small sample learning remote sensing image detection method
CN117909535A (en) * 2024-03-15 2024-04-19 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model
CN117909535B (en) * 2024-03-15 2024-05-31 中国科学技术大学 Combined understanding method, system, equipment and medium based on visual language model

Similar Documents

Publication Publication Date Title
CN116775922A (en) Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics
CN110909673B (en) Pedestrian re-identification method based on natural language description
Li et al. Truncation cross entropy loss for remote sensing image captioning
CN112905827B (en) Cross-modal image-text matching method, device and computer readable storage medium
CN113220919B (en) Dam defect image text cross-modal retrieval method and model
CN110738146B (en) Target re-recognition neural network and construction method and application thereof
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
Hoxha et al. A new CNN-RNN framework for remote sensing image captioning
CN109684928B (en) Chinese document identification method based on internet retrieval
WO2021088935A1 (en) Adversarial network architecture optimization method and system, and image description generation method and system
CN113033438B (en) Data feature learning method for modal imperfect alignment
CN111598183A (en) Multi-feature fusion image description method
CN112948601B (en) Cross-modal hash retrieval method based on controlled semantic embedding
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN116226785A (en) Target object recognition method, multi-mode recognition model training method and device
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN113032601A (en) Zero sample sketch retrieval method based on discriminant improvement
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN110197213B (en) Image matching method, device and equipment based on neural network
CN117453859A (en) Agricultural pest and disease damage image-text retrieval method, system and electronic equipment
CN115640418A (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
CN115098646A (en) Multilevel relation analysis and mining method for image-text data
CN110609961A (en) Collaborative filtering recommendation method based on word embedding
CN113139378B (en) Image description method based on visual embedding and condition normalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination