CN116775922A

CN116775922A - Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics

Info

Publication number: CN116775922A
Application number: CN202310550653.6A
Authority: CN
Inventors: 何柳; 刘姝妍; 安然; 卓雨东; 陶剑; 李润岐; 王孝天; 武铎; 孙郁文
Original assignee: China Aero Polytechnology Establishment
Current assignee: China Aero Polytechnology Establishment
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-09-19

Abstract

The invention relates to a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion, which comprises the following steps of: processing training data of a remote sensing image-text retrieval model; step 2: constructing a multi-detail language and vision fusion model; step 3: training a detail language and vision fusion model of multi-objective optimization; step 4: constructing a remote sensing image-text description feature library; step 5: and (5) performing cross-modal retrieval of the remote sensing image-text description. According to the invention, the single-mode encoder is utilized to respectively express the image and text characteristics, the multi-mode encoder is utilized to fuse the characteristics of two modes, the expression of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through the characteristic fusion and multi-task optimization training, and the cross-mode retrieval is completed through the similarity calculation of the semantic characteristics; by designing a multi-objective optimization strategy for the model, the model has the characteristic expression capability of multiple details for remote sensing images and text description.

Description

Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics

Technical Field

The application relates to the technical field of image processing, in particular to a remote sensing image cross-mode retrieval method based on fusion of language and visual detail characteristics.

Background

In recent years, the technology of remote sensing satellites and unmanned aerial vehicles is rapidly developed, and the remote sensing technology is used as a core of the technology, so that the technology has remarkable effects in the fields of geographic positioning, disaster rescue, military reconnaissance, disaster monitoring and the like. With the wide application of remote sensing technology, remote sensing images show explosive growth, and great difficulty is brought to a plurality of tasks such as large-scale remote sensing image identification, detection, classification, retrieval and the like. The remote sensing image cross-mode retrieval task refers to finding out remote sensing images which are the same as or similar to the remote sensing image data set in a large scale according to given natural language description, and vice versa. Compared with the traditional remote sensing image retrieval, the image-text cross-mode retrieval presents better man-machine interaction characteristics and has stronger application value.

In the application scene of the cross-mode retrieval of the remote sensing image, one important requirement of a user is to input a section of description of the scene, and the image consistent with or similar to the input description is retrieved from a huge remote sensing image library. In the process, the query data and the data stored in the database have different modal types, and huge representation gaps exist in feature expression among different modalities, so that the connection among samples with the same semantics in different modalities is required to be established. The cross-mode retrieval method for the remote sensing image at the present stage mainly comprises a method based on image tag retrieval and a method for retrieving image-text feature vectors. The method based on image tag retrieval mainly carries out keyword description on each image as a characteristic tag of the image. In the retrieval process, the description input by the user is disassembled into keywords, and the keywords are matched with the keyword labels of the images to find similar target images. The method for retrieving the image-text feature vector is to encode the image-text with the same or similar semantics into the feature vector with a shorter distance by using a trained image text encoder, and vice versa. The two current search modes have different degrees of defects, and are mainly characterized in the following aspects:

1. The method based on image tag retrieval relies on performing high-quality tag description on the existing image data, so that a great deal of time is required, and the method is not applicable to the large-scale data retrieval process;

2. the retrieval method based on the image-text feature vector needs to perform feature alignment on image content and text description, and the process of feature extraction, alignment and fusion of two types of data is very difficult due to the difference of data structures of the image and the text;

3. current image feature encoders often rely on a high quality remote sensing image target recognition model to represent details of the image, the accuracy of which has a great impact on the overall search results, and training the target recognition model takes more approaches.

In order to solve the problems, the invention provides a remote sensing image cross-modal retrieval method based on language-visual detail feature fusion, which comprises the steps of designing a remote sensing image cross-modal retrieval frame comprising two single-modal encoders and a multi-modal encoder for visual and language, respectively representing image and text features by using the single-modal encoder, carrying out fusion learning on the features of the two modalities by using the multi-modal encoder, improving the expression capability of each encoder on fine-granularity semantic features of corresponding modal data by means of feature fusion and multi-task optimization training, and completing the cross-modal retrieval task by means of similarity calculation of the semantic features. In order to express the detail characteristics of the image, the invention extracts the local characteristics of the image by designing a shallow visual transducer model, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model; according to the invention, a set of multi-objective optimization strategies is designed for the model, and the whole model is trained under the strategies, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.

Disclosure of Invention

In order to overcome the defects of the prior art, a single-mode encoder is used for respectively representing the image and text characteristics, a multi-mode encoder is used for fusing the characteristics of two modes, the expression of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through the characteristic fusion and multi-task optimization training, and cross-mode retrieval is completed through similarity calculation of the semantic characteristics; by designing a multi-objective optimization strategy for the model, the model has the characteristic expression capability of multiple details for remote sensing images and text description.

In order to achieve the above purpose, the solution adopted by the invention is to provide a remote sensing image cross-modal retrieval method based on language and visual detail feature fusion, which comprises the following steps:

step 1: processing training data of a remote sensing image-text retrieval model;

the training data of the remote sensing image-text retrieval model is formed based on image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image; first, the number len of words in text data is obtained _words And the number of stop words len _stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len _words ＝len _stops Len _words ＝len _stops +1 image-text data pair, avoiding the effect of disabling word disturbance retrieval model training in the image-text data pair; then, the image data and the pair are processedThe text data to be used; finally, the cleaned remote sensing image-text retrieval training data are used for training an image local encoder and a global encoder;

step 2: constructing a multi-detail language and vision fusion model;

the cross-modal retrieval based on image-text features mainly extracts and expresses features of image data and text data respectively, and features of the image data and the text data with semantic similarity are represented in a vector space with minimum distance by optimizing expression capacity of each extractor; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F _enc-V Remote sensing image description language feature encoder F _enc-L And a multimodal encoder F based on a vision-language fusion model _enc-Mul The following is shown:

wherein: i represents input remote sensing image data; t represents input text data; f (f) _IL Representing local features of the remote sensing image; f (f) _IGL Representing local-global fusion characteristics of the remote sensing image; f (f) _T Representing the text characteristics of the remote sensing image description; s is S _distance Representing distance similarity between feature vectors; s is S _pairwise A matching probability value representing an image-text pair; f (F) _enc-V Representing a remote sensing image visual encoder; f (F) _enc-L Representing a remote sensing image description language feature encoder; f (F) _enc-Mul Representing a multimodal encoder based on a vision-language fusion model;

step 3: training a detail language and vision fusion model of multi-objective optimization;

constructing a multi-target comprehensive supervision optimization method comprising four loss functions, and introducing a deep supervision strategy into the middle branch; completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed in the step 1 into a training set, a verification set and a test set according to a first proportion value mode, and sending the divided training set into a model constructed in the step 2, wherein model parameters are initialized by adopting normal distribution, and pre-training parameters are not used; freezing the image encoder portion while calculating the image-text match loss, focusing on optimizing the multi-modal encoder;

step 4: constructing a remote sensing image-text description feature library;

step 41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R _i2t Representing image-to-text retrieval recall and R _t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set _i2t @ N and text-to-image recall R _t2i @N; finally, calculating the recall rate R from the image to the text _i2t @ N and text-to-image recall R _t2i Average mR of @ N in test samples _i2t @N and mR _i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:

wherein: mR (mR) _i2t @N and mR _t2i @N represents the image-to-text recall rate R, respectively _i2t @ N and text-to-image recall R _t2i Average value of @ N in test samples; image _k Representing the kth image-text pair with similar semantics; text _k Representing a kth text-image pair having similar semantics; r is R _i2t @N(Image _k ) Representing an input Image _k ；R _t2i @N(Text _k ) Representing Text of input Text _k The method comprises the steps of carrying out a first treatment on the surface of the k represents the image and text pair number; m represents the total number of image and text pairs; n represents the search task number;

step 42: constructing an image feature database, extracting features of all image data by using the trained image encoder in the step 3, and storing the generated image features in the database so as to improve the retrieval efficiency in the subsequent application;

Step 43: constructing a text feature database, extracting features of all text data by using the trained text encoder in the step 3, and storing the generated text features in the database;

step 5: completing cross-modal retrieval of remote sensing image-text description;

the cross-modal retrieval includes four major modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; and the cross-modal retrieval of the remote sensing image-text description is completed through the four modules.

Preferably, the processing image data and the corresponding text data in the step 1 are specifically;

the processing procedure of the image data is as follows:

step 111: uniformly adjusting all image data to 278x278x 3;

step 112: basic data enhancement is carried out on input image data according to 50% probability, and the basic data enhancement comprises random rotation and random overturn so as to enhance the generalization capability of a model;

step 113: expanding the data amount of input image data, adopting an image data splicing method, randomly selecting two images I_a and I_b in the same category according to category labels, overlapping pixel levels, directly splicing text description for a text part, and directly splicing T_b behind T_a;

Step 114: randomly clipping the image data after the steps 112 and 113, wherein the clipping area is 256x256x3 to adapt to the input of the follow-up neural network model;

step 115: normalizing the image data after the step 114 to transform the gray scale range of the image to between 0 and 1;

the text data is processed as follows:

step 121: performing de-stop word processing on the text data, setting the maximum word length to 64, cutting off the text data, and discarding the part exceeding the maximum word length;

step 122: building a "random mask" in combination with a "directed mask" policy, as follows:

step 1221: constructing a Boolean type text data mask descriptor S_T, wherein the length of the text data mask descriptor S_T is consistent with the length of words in the processed text data, initializing the text data by False, and defaulting the text data mask descriptor S_T without any masking operation;

step 1222: randomly selecting 15% of the positions in the S_T according to Bernoulli distribution and marking with True;

step 1223: recording class labels imgs_cls of all images, traversing text data, and correspondingly marking the positions containing imgs_cls on S_T with True to focus on class information of targets in the follow-up masking operation;

Step 1224: constructing a quantity information descriptor S_N, storing English digital words in 0-10, traversing text data, and correspondingly marking the position containing S_N on S_T with True to focus on the quantity information of the target in the subsequent masking operation;

step 1225: for the positions marked by True in all S_T, performing [ MASK ] blank replacement on text data words according to the probability of 80%; replacing random words according to 10% probability, and replacing original words with random any other words; the remaining 10% of the words are not replaced.

Preferably, the remote sensing image visual encoder in step 2 specifically includes:

in the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are far higher than those of the corresponding text data, so that the image data are in a visual encoder F _enc-V Module M for extracting global features of image by medium design _cnn-mvsa And a module M for extracting local features _vit Simultaneously designing a module M fusing global features and local features _midf The method is specifically as follows:

wherein: f (f) _IG Representing global features of the remote sensing image; m is M _cnn-mvsa Representing a remote sensing image global feature extraction module; m is M _vit A local feature extraction module for representing a remote sensing image; m is M _midf A module for fusing the global features and the local features;

global feature extraction module M of remote sensing image _cnn-mvsa The method comprises the steps of taking a ResNet-50 residual convolution neural network as a feature extractor, optimizing the feature extraction effect by utilizing a multi-scale self-attention model, and inputting the whole image subjected to cleaning and enhancement in the step 1 into M _cnn-mvsa In (1), f is obtained _IG ；

Local feature extraction module M of remote sensing image _vit The image data after serialization in the step 1 is input into M by using a 6-layer Vision Transformer model as a feature extractor _vit Extracting features to obtain f _IL ；

The fusion module M of the global feature and the local feature _midf Is a linear function by combining the features f _IG And f _IL And performing linear addition, wherein a first fusion parameter a and a second fusion parameter b are obtained in the training process, and are as follows:

M _midf (f _IG ,f _IL )＝af _IG +bf _IL ；

wherein: m is M _midf A fusion module representing global features and local features; a represents a first fusion parameter; b represents a second fusion parameter.

Preferably, the remote sensing image description language feature encoder in the step 2 specifically includes:

the remote sensing image description language feature encoder F _enc-L Comprising a Bert-based text encoder M _bert The following is shown:

f _T ＝M _bert (T)；

Wherein: m is M _bert Representation ofA standard Bert model;

m during training _bert The feature expression characteristics of the model can be adjusted according to the labeling data, the feature expression capacity of the model is enhanced, the text data covered in the step 1 is used as training data, and the training data is input into the model M _bert In (a) and (b); entering application stage M after model training is completed _bert The user input text data is expressed as a feature vector with semantics.

Preferably, the multi-mode encoder based on the vision-language fusion model in the step 2 specifically includes:

multi-modal encoder F based on vision-language fusion model _enc-Mul Comprises two modules for calculating M of cross-modal vector distance _cms And M for vision-language fusion _vlf The following is shown:

F _enc-Mul ＝{M _cms ,M _vlf }；

wherein: m is M _cms The representation module is used for calculating the distance similarity of the image feature vector and the text feature vector, and measuring the distance by using a cosine distance; m is M _vlf The representation is used for vision-language fusion; f (F) _enc-Mul Representing a multimodal encoder based on a vision-language fusion model;

M _vlf the module initializes with the last 6 layers of the bert-based model and models visual-language interactions with additional layers of cross-attention.

Preferably, in the step 3, a multi-objective comprehensive supervision optimization method including four loss functions is constructed, which specifically includes:

Step 31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder _itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, building a th between the text feature encoder and the image local feature encoderTwo-triplet loss function L _itt2 For deep supervision, the following is indicated:

wherein: l (L) _itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t≡represents text features that do not match image I; i≡represents image features that do not match text T; l (L) _itt2 Representing a second triplet loss function; i _loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the internal value is larger than 0, the loss is taken as the loss, and when the internal value is smaller than 0, the loss is 0;

Step 32: constructing an image-text matching loss function L between a multi-modal encoder and an image feature encoder _itm The prediction of whether a pair of input images and texts are matched or not is realized by connecting a first fully-connected two classification layers; in the selection of positive/negative samples, the difficult samples closest to the positive samples within a single batch are calculated as negative samples by the similarity recall module to enhance the learning ability of the multi-modal module, the loss calculation form is as follows:

L _itm ＝-y ^itm log(p ^itm (I,T))+(1-y ^itm )log(1-p ^itm (I,T))；

wherein: l (L) _itm Representing a constructed image-text matching loss function between the multi-modal encoder and the image feature encoder; y is ^itm A matching tag representing the constructed image-text pair; p is p ^itm (I, T) represents the probability of matching of image-text;

step 33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder _mlm The text after mask processing is represented by T and the model prediction probability by p ^msk (I, T≡) as follows:

L _mlm ＝-y ^msk log(p ^msk (I,T^))+(1-y ^msk )log(1-p ^msk (I,T^))；

wherein: l (L) _mlm Representing a mask language model loss function constructed between the multi-modal encoder and the image local feature encoder; y is ^msk Representing a predictive probability of an image-text pair; p is p ^msk Representing model predictive probabilities;

step 34: the combination strategy of the four loss functions can influence the final expression effect of the model, and proper weight coefficients are required to be distributed for multi-objective co-optimization, so that single tasks are prevented from occupying dominant positions in joint learning; thus, a dynamic update strategy of the loss weights is employed, and for each goal of training, the ratio of the current loss to the initial loss is considered during each round of training, and a super parameter λ is introduced for balancing the effect of the weights, as follows:

Wherein: θ _t The weight of the task t calculated by the formula is represented; l (L) _i (t) represents the loss value of task t calculated during the current small batch iteration; l (L) ₀ (t) represents the loss value of task t during the initial iteration; λ represents the weight used to balance each task, set at 0.5; i represents the number of each training process.

Preferably, the image-to-text recall rate R in step 41 _i2t @ N and text-to-image recall R _t2i The @ N is specifically:

the image-to-text recall is as follows:

wherein: trext ₁ ,text ₂ … textN represents the 1 st, 2 nd to N th candidate text sample sets returned by the search algorithm respectively;

the text-to-image recall is as follows:

wherein: image device ₁ ,image ₂ .. the imageN represents the 1 st, 2 nd through nth candidate image sample sets, respectively, returned by the search algorithm.

Preferably, the cross-modal retrieval of the remote sensing image-text description in the step 5 includes a text search process and a text search process, specifically:

the text searching process comprises the following steps: when a text description is input, the built cross-modal retrieval model firstly utilizes a text coding module to calculate the characteristics of the input text, then utilizes a similarity judgment recall module to calculate the similarity between each image characteristic and the text characteristic in an image characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned image and the input text so as to realize fine adjustment on the primary retrieval result;

The graphic text searching process comprises the following steps: when an image is input, the built cross-mode retrieval model firstly utilizes an image coding module to calculate the characteristics of the input image, then utilizes a similarity judgment recall module to calculate the similarity between each text characteristic and the image characteristic in a text characteristic database built in the step 4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned text and the input image, so that fine adjustment on the primary retrieval result is realized.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the invention, a remote sensing image cross-mode retrieval framework comprising two single-mode encoders and one multi-mode encoder is designed, the single-mode encoder is utilized to respectively express image and text characteristics, the multi-mode encoder is utilized to perform fusion learning on the characteristics of two modes, the expression capacity of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through characteristic fusion and multi-task optimization training, and the cross-mode retrieval task is completed through similarity calculation of the semantic characteristics;

(2) In order to express the detail characteristics of the image and eliminate the training expense brought by a pre-trained target recognition model, the invention designs a shallow visual transducer model to extract the local characteristics of the image, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model;

(3) According to the invention, a set of multi-objective optimization strategies is designed for the model, and the whole model is trained under the strategies, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.

Drawings

FIG. 1 is a control block diagram of a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion in an embodiment of the invention;

FIG. 2 is a diagram of a model training process according to an embodiment of the present invention;

FIG. 3 is a cross-modal retrieval process diagram of an embodiment of the present invention;

FIG. 4 is a diagram of a text search process according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a text searching process according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

According to the embodiment of the invention, a picture is used for analyzing according to cases, in the design of the visual encoder, the capability of realizing target detection by a visual transducer module trained by a multi-mode encoder is introduced, the pipelined training process is converted into an end-to-end training process, the required training data amount and training time are greatly reduced, and the convenience and accuracy of model construction are reduced. In the model training process, the multi-mode fusion learning and the optimization of multi-target tasks enable the model to have the capability of processing multi-detail tasks, and compared with other related methods, the model has better retrieval performance. The whole retrieval flow is optimized, after the recall task is finished, a reordering model based on a language and vision fusion model is introduced, recall results can be further optimized under the condition of smaller calculation consumption, and top-1 and top-5 ordering performance can be improved. Fig. 1 is a control block diagram of a remote sensing image cross-mode retrieval method based on language and visual detail feature fusion according to an embodiment of the invention.

The embodiment of the invention provides a remote sensing image cross-modal retrieval method based on language and visual detail feature fusion, and a model training process diagram of the embodiment of the invention is shown in fig. 2; to illustrate the applicability of the invention, it is applied to examples, comprising in particular the following steps:

S1: and processing training data of the remote sensing image-text retrieval model.

The training data of the remote sensing image-text retrieval model is composed based on image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image.

The processing procedure of the image data is as follows:

s111: all image data is uniformly resized to 278x278x 3.

S112: the input image data is subjected to basic data enhancement according to 50% probability, including random rotation and random inversion so as to enhance the generalization capability of the model.

S113: the method comprises the steps of expanding the data quantity of input image data, randomly selecting two images I_a and I_b in the same category according to category labels by adopting an image data splicing method, overlapping pixel levels, directly splicing text description for a text part, and directly splicing T_b behind T_a.

S114: the image data after S112 and S113 is randomly cropped, and the cropping area is 256×256×3 in size to accommodate the input of the subsequent neural network model.

S115: carrying out normalization processing on the image data after the S114 to convert the gray scale range of the image to between 0 and 1;

the text data is processed as follows:

S121: the text data is subjected to de-stop word processing while setting the maximum word length to 64, the text data is truncated, and the portion exceeding the maximum word length is discarded.

S122: building a "random mask" in combination with a "directed mask" policy, as follows:

s1221: the Boolean type text data mask descriptor S_T is constructed, the length is consistent with the length of words in the processed text data, initialization is carried out by False, and no masking operation is carried out by default.

S1222: the 15% position in s_t was randomly selected according to the bernoulli distribution and marked with True.

S1223: the category labels imgs_cls of all images are recorded, text data are traversed, and the positions containing the imgs_cls are correspondingly marked with True on the S_T so as to focus on the category information of the target in the subsequent masking operation.

S1224: constructing a quantity information descriptor S_N, storing English digital words in 0-10, traversing text data, and correspondingly marking the position containing S_N on S_T with True to focus on the quantity information of the target in the subsequent masking operation.

S1225: for the positions marked by True in all S_T, performing [ MASK ] blank replacement on text data words according to the probability of 80%; replacing random words according to 10% probability, and replacing original words with random any other words; the remaining 10% of the words are not replaced.

First, the number len of words in text data is obtained _words And the number of stop words len _stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len _words ＝len _stops Len _words ＝len _stops +1 image-text dataFor avoiding the effect of training the deactivated word disturbance retrieval model in the image-text data pair; then, processing the image data and the corresponding text data; and finally, using the cleaned remote sensing image-text retrieval training data for training the local encoder and the global encoder of the image.

S2: and constructing a multi-detail language and vision fusion model.

The cross-modal retrieval based on image-text features mainly extracts and expresses features of image data and text data respectively, and features of the image data and the text data with semantic similarity are represented in a vector space with minimum distance by optimizing expression capacity of each extractor, as shown in fig. 3, which is a cross-modal retrieval process diagram of the embodiment of the invention; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F _enc-V Remote sensing image description language feature encoder F _enc-L And a multimodal encoder F based on a vision-language fusion model _enc-Mul The following is shown:

wherein: i represents input remote sensing image data; t represents input text data; f (f) _IL Representing local features of the remote sensing image; f (f) _IGL Representing local-global fusion characteristics of the remote sensing image; f (f) _T Representing the text characteristics of the remote sensing image description; s is S _distance Representing distance similarity between feature vectors; s is S _pairwise A matching probability value representing an image-text pair; f (F) _enc-V Representing a remote sensing image visual encoder; f (F) _enc-L Representing a remote sensing image description language feature encoder; f (F) _enc-Mul A multimodal encoder based on a vision-language fusion model is presented.

In the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are far higher than those of the corresponding text data, so that the image data are in a visual encoder F _enc-V Module M for extracting global features of image by medium design _cnn-mvsa And an extraction officeModule M of part characteristics _vit Simultaneously designing a module M fusing global features and local features _midf The method is specifically as follows:

wherein: f (f) _IG Representing global features of the remote sensing image; m is M _cnn-mvsa Representing a remote sensing image global feature extraction module; m is M _vit A local feature extraction module for representing a remote sensing image; m is M _midf Representing a module that fuses global features and local features.

Global feature extraction module M of remote sensing image _cnn-mvsa A ResNet-50 residual convolution neural network is used as a feature extractor, a multi-scale self-attention model is utilized to optimize the effect of feature extraction, and the whole image after cleaning and enhancement in S1 is input into M _cnn-mvsa In (1), f is obtained _IG 。

Local feature extraction module M of remote sensing image _vit Is characterized by using a 6-layer Vision Transformer model as a feature extractor, inputting the image data after serialization in S1 into M _vit Extracting features to obtain f _IL 。

Fusion module M of global features and local features _midf Is a linear function by combining the features f _IG And f _IL And performing linear addition, wherein a first fusion parameter a and a second fusion parameter b are obtained in the training process, and are as follows:

M _midf (f _IG ,f _IL )＝af _IG +bf _IL ；

Remote sensing image description language characteristic encoder F _enc-L Comprising a Bert-based text encoder M _bert The following is shown:

f _T ＝M _bert (T)；

in the middle of：M _bert Representing a standard Bert model.

M during training _bert The feature expression characteristics of the model can be adjusted according to the labeling data, the feature expression capacity of the model is enhanced, the text data covered in S1 is used as training data, and the training data is input into the model M _bert In (a) and (b); entering application stage M after model training is completed _bert The user input text data is expressed as a feature vector with semantics.

F _enc-Mul ＝{M _cms ,M _vlf }；

wherein: m is M _cms The representation module is used for calculating the distance similarity of the image feature vector and the text feature vector, and measuring the distance by using a cosine distance; m is M _vlf The representation is used for vision-language fusion; f (F) _enc-Mul A multimodal encoder based on a vision-language fusion model is presented.

S3: and training a detail language and vision fusion model of multi-objective optimization.

A multi-target comprehensive supervision optimization method comprising four loss functions is constructed, and a deep supervision strategy is introduced into the middle branch.

S31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder _itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, and a second triplet loss function L is constructed between the text feature encoder and the image local feature encoder _itt2 For deep supervision, the following is indicated:

wherein: l (L) _itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t (T) ^{^} Representing text features that do not match image I; i ^{^} Representing image features that do not match text T; l (L) _itt2 Representing a second triplet loss function; i _loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the value in the set is greater than 0, the loss is taken as the loss, and when the value is less than 0, the loss is 0.

S32: constructing an image-text matching loss function L between a multi-modal encoder and an image feature encoder _itm The prediction of whether a pair of input images and texts are matched or not is realized by connecting a first fully-connected two classification layers; in the selection of positive/negative samples, the difficult samples closest to the positive samples within a single batch are calculated as negative samples by the similarity recall module to enhance the learning ability of the multi-modal module, the loss calculation form is as follows:

L _itm ＝-y ^itm log(p ^itm (I,T))+(1-y ^itm )log(1-p ^itm (I,T))；

wherein: l (L) _itm Representing a constructed image-text matching loss function between the multi-modal encoder and the image feature encoder; y is ^itm A matching tag representing the constructed image-text pair; p is p ^itm (I, T) represents the probability of image-text matching.

S33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder _mlm T for text after mask processing ^{^} The model prediction probability is represented by p ^msk (I,T ^{^} ) The representation is as follows:

L _mlm ＝-y ^msk log(p ^msk (I,T ^{^} ))+(1-y ^msk )log(1-p ^msk (I,T ^{^} ))；

wherein: l (L) _mlm Representing a mask language model loss function constructed between the multi-modal encoder and the image local feature encoder; y is ^msk Representing a predictive probability of an image-text pair; p is p ^msk Representing model predictive probabilities.

S34: the combination strategy of the four loss functions can influence the final expression effect of the model, and proper weight coefficients are required to be distributed for multi-objective co-optimization, so that single tasks are prevented from occupying dominant positions in joint learning; thus, a dynamic update strategy of the loss weights is employed, and for each goal of training, the ratio of the current loss to the initial loss is considered during each round of training, and a super parameter λ is introduced for balancing the effect of the weights, as follows:

Completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed by the S1 into a training set, a verification set and a test set according to a first proportion value, setting the first proportion value to be 8:1:1 in a preferred embodiment, and sending the divided training set into a model constructed by the S2, wherein the model parameters are initialized by adopting normal distribution, and the pre-training parameters are not used; the image encoder portion is frozen when calculating the image-text match loss, focusing on optimizing the multi-modal encoder.

S4: and constructing a remote sensing image-text description feature library.

S41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R _i2t Representing image-to-text retrieval recall and R _t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set _i2t @ N and text-to-image recall R _t2i @N。

The image-to-text recall is as follows:

wherein: text (text) ₁ ,text ₂ …text _N Representing the 1 st, 2 nd through nth candidate text sample sets returned by the search algorithm, respectively.

Text-to-image recall is as follows:

wherein: image device ₁ ,image ₂ ...image _N Representing the 1 st, 2 nd through nth candidate image sample sets returned by the search algorithm, respectively.

Finally, calculating the recall rate R from the image to the text _i2t @ N and text-to-image recall R _t2i Average mR of @ N in test samples _i2t @N and mR _i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:

wherein: mR (mR) _i2t @N and mR _t2i @N represents the image-to-text recall rate R, respectively _i2t @ N and text-to-image recall R _t2i Average value of @ N in test samples; image _k Representing the kth image-text pair with similar semantics; text _k Representing a kth text-image pair having similar semantics; r is R _i2t @N(Image _k ) Representing an input Image _k ；R _t2i @N(Text _k ) Representing Text of input Text _k The method comprises the steps of carrying out a first treatment on the surface of the k represents the image and text pair number; m represents the total number of image and text pairs; n represents the search task number.

S42: and (3) constructing an image feature database, extracting features of all image data by using the trained image encoder in the step (S3), and storing the generated image features in the database so as to improve the retrieval efficiency in the subsequent application.

S43: and (3) constructing a text feature database, extracting features of all text data by using the trained text encoder in the step (S3), and storing the generated text features in the database.

S5: and (5) performing cross-modal retrieval of the remote sensing image-text description.

Cross-modal retrieval consists of four major modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; the cross-modal searching of the remote sensing image-text description is completed through the four modules, and comprises a text searching process and a graph searching process, and the cross-modal searching process specifically comprises the following steps:

the text searching process comprises the following steps: when a text description is input, the built cross-modal retrieval model firstly utilizes a text coding module to calculate the characteristics of the input text, then utilizes a similarity judgment recall module to calculate the similarity between each image characteristic and the text characteristic in an image characteristic database built in S4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned image and the input text, so that fine adjustment on the primary retrieval result is realized. Fig. 4 illustrates a text search process, taking a similar result of returning Top5 as an example in the embodiment of the present invention.

The text searching process by the graph is as follows: when an image is input, the built cross-mode retrieval model firstly utilizes an image coding module to calculate the characteristics of the input image, then utilizes a similarity judgment recall module to calculate the similarity between each text characteristic and the image characteristic in a text characteristic database built in S4, and returns the similar characteristics of top1, top5 and top10 as a preliminary retrieval result; the multi-mode reordering module is used for carrying out secondary correction on the primary retrieval result, and the module calculates the matching probability between each returned text and the input image, so that fine adjustment on the primary retrieval result is realized. Fig. 5 illustrates a graph search process according to an embodiment of the present invention, taking a similar result of returning Top5 as an example.

The method has the advantages of three aspects:

in the training stage, the method converts the pipelined training process into an end-to-end training process, converts the traditional two-stage model training into one stage, and can train without depending on an additional remote sensing image high-quality annotation data set based on target recognition, and the model training time is half of that of the two-stage training method.

In the model training process, the model has the capability of processing multi-detail tasks through multi-modal fusion learning and optimizing multi-target tasks, semantic alignment is carried out between images and texts, and compared with the traditional method in experimental data, the method has the advantages that the model is prepared in the specific mR _i2t @N and mR _t2i As can be seen from the analysis of examples below, the practical application effect of the method is better than that of the conventional methods 1 and 2, as shown in table 1 below, for more than a majority of the conventional methods in the evaluation indexes @ N, n=1, 5, 10.

TABLE 1 comparative analysis results of the present method and the conventional method

After the recall task is completed, a reordering model based on a language and vision fusion model is introduced, recall results can be further optimized under the condition of smaller calculation consumption, the ordering performance of top-1 and top-5 is further improved, and the optimization effect of the reordering model is due to the fact that the reordering model is not included as shown in the following table 2.

TABLE 2 comparative analysis results with and without reordering model

In conclusion, the prediction result of the remote sensing image cross-mode retrieval method based on the fusion of language and visual detail features proves to have a good effect.

(1) According to the embodiment of the invention, a remote sensing image cross-mode retrieval framework comprising two single-mode encoders and one multi-mode encoder is designed, the single-mode encoder is used for respectively representing image and text characteristics, the multi-mode encoder is used for carrying out fusion learning on the characteristics of two modes, the expression capacity of each encoder on fine granularity semantic characteristics of corresponding mode data is improved through characteristic fusion and multi-task optimization training, and the cross-mode retrieval task is completed through similarity calculation of the semantic characteristics.

(2) In order to express the detail characteristics of the image and eliminate the training expense brought by a pre-trained target recognition model, the embodiment of the invention extracts the local characteristics of the image by designing a shallow visual transducer model, and converts the pipelined 'target detection and retrieval' process into an end-to-end training process; the end-to-end framework makes up the gap between the target detector and the retrieval model training process, and reduces the training cost of the whole retrieval model.

(3) According to the embodiment of the invention, a set of multi-objective optimization strategy is designed for the model, and the whole model is trained under the strategy, so that the model has the characteristic expression capability of multiple details for remote sensing images and text description; the training result converged model completes the end-to-end text-image retrieval task without image tags.

The above examples are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the scope of protection defined by the claims of the present invention without departing from the spirit of the present invention.

Claims

1. A remote sensing image cross-mode retrieval method based on language and visual detail feature fusion is characterized by comprising the following steps:

the training data of the remote sensing image-text retrieval model comprises image data and text data, wherein the image data is an actual measurement image, and the text data is descriptive information corresponding to the remote sensing image; first, the number len of words in text data is obtained _words And the number of stop words len _stops The method comprises the steps of carrying out a first treatment on the surface of the Delete all satisfied len _words ＝len _stops Len _words ＝len _stops +1 image-text data pairs, avoiding disabling word disturbance retrieval model training in the image-text data pairs; then, processing the image data and the corresponding text data; finally, the cleaned remote sensing image-text retrieval training data are used for training an image local encoder and a global encoder;

step 2: constructing a multi-detail language and vision fusion model;

the cross-modal retrieval based on the image-text features is to extract and express the features of the image data and the text data respectively, and the features of the image data and the text data with semantic similarity are represented to have the minimum distance in a vector space by optimizing the expression capacity of each extractor; the overall architecture based on the multi-detail language and vision fusion model comprises the following steps: remote sensing image visual encoder F _enc-V Remote sensing image description language feature encoder F _enc-L And a multimodal encoder F based on a vision-language fusion model _enc-Mul The following is shown:

constructing a multi-target comprehensive supervision optimization method comprising four loss functions, and introducing a deep supervision strategy into the middle branch; completing a model training process, wherein the training process is mainly optimized for an image encoder, a text encoder and a multi-mode encoder; dividing the image data and the text data processed in the step 1 into a training set, a verification set and a test set according to a first proportion value, and sending the divided training set into a model constructed in the step 2, wherein model parameters are initialized by adopting normal distribution, and pre-training parameters are not used; freezing the image encoder portion while calculating the image-text match loss, focusing on optimizing the multi-modal encoder; 8:1:1 mode

Step 4: constructing a remote sensing image-text description feature library;

step 41: in the retrieval task, the recall rate is used for representing the proportion of correct samples in the returned N candidate samples by the retrieval algorithm; first using R _i2t Representing image-to-text retrieval recall and R _t2i Representing a retrieval recall rate of text to images; then calculate the image-to-text recall R to retrieve top1, top5, top10 for two tasks on the validation set _i2t @ N and text-to-image recall R _t2i @N; finally, calculating the recall rate R from the image to the text _i2t @ N and text-to-image recall R _t2i @ N in test samplesAverage value mR of (2) _i2t @N and mR _i2t And @ N, and storing a model with the highest recall rate for a subsequent retrieval task, wherein a specific calculation formula is as follows:

the cross-modal retrieval includes four modules: the device comprises an image coding module, a text coding module, a similarity judging recall module and a multi-mode reordering module; the image coding module is connected with the text coding module in parallel, and then is connected with the similarity judging recall module and the multi-mode reordering module in a cascading manner; and the cross-modal retrieval of the remote sensing image-text description is completed through the four modules.

2. The remote sensing image cross-mode retrieval method based on language and visual detail feature fusion according to claim 1, wherein the processing image data and the corresponding text data in the step 1 are specifically;

the processing procedure of the image data is as follows:

step 111: uniformly adjusting all image data to 278x278x 3;

the text data is processed as follows:

3. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the remote sensing image visual encoder in step 2 specifically comprises:

in the remote sensing image-text cross-modal retrieval task, the information quantity and complexity of semantic information contained in the image data are higher than those of the corresponding text data, so that the visual encoder F _enc-V Module M for extracting global features of image by medium design _cnn-mvsa And a module M for extracting local features _vit Simultaneously designing a module M fusing global features and local features _midf The method is specifically as follows:

global feature extraction module M of remote sensing image _cnn-mvsa Is characterized by taking a ResNet-50 residual convolution neural network as a feature extractor and utilizing multi-scale self-attentionOptimizing the effect of feature extraction by using a model, and inputting the whole image subjected to cleaning and enhancement in the step 1 into M _cnn-mvsa In (1), f is obtained _IG ；

Local feature extraction module M of remote sensing image _vit The image data after the serialization in the step 1 is input into M by a 6-layer Vision transducer model as a feature extractor _vit Extracting features to obtain f _IL ；

M _midf f _IG ,f _IL ＝af _IG +bf _IL ；

4. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the remote sensing image description language feature encoder in the step 2 specifically comprises:

f _T ＝M _bert T；

wherein: m is M _bert A Bert model representing a standard;

5. The remote sensing image cross-modal retrieval method based on the fusion of language and visual detail features according to claim 1, wherein the multi-modal encoder based on the visual-language fusion model in the step 2 specifically comprises:

F _enc-Mul ＝{M _cms ,M _vlf }；

6. The remote sensing image cross-modal retrieval method based on language and visual detail feature fusion according to claim 1, wherein the multi-objective comprehensive supervision optimization method comprising four loss functions is constructed in the step 3, and specifically comprises the following steps:

step 31: the triplet loss is used for learning the representation space of the image features and the text features, and the loss is calculated by comparing the distances between the three comparison sample features; constructing a first triplet loss function L between a text feature encoder and an image feature encoder _itt1 The method comprises the steps of carrying out a first treatment on the surface of the The features encoded by the image local encoder contain more image details that are critical to the final image representation and subsequent training of the multi-modal encoder, and a second triplet loss function L is constructed between the text feature encoder and the image local feature encoder _itt2 For deep supervision, the following is indicated:

wherein: l (L) _itt1 Representing a first triplet loss function; epsilon represents the minimum margin for expanding the gap between the reference sample and the positive/negative sample pair; sim represents a similarity recall module; (I, T) representing the matched image-text pair features, generated by the image encoder and the text encoder, respectively; t (T) ^{^} Representing text features that do not match image I; i ^{^} Representing image features that do not match text T; l (L) _itt2 Representing a second triplet loss function; i _loc Representing the image local encoder generation, the lower right hand corner of the equation has the meaning [ []When the internal value is larger than 0, the loss is taken as the loss, and when the internal value is smaller than 0, the loss is 0;

L _itm ＝-y ^itm log(p ^itm (I,T))+(1-y ^itm )log(1-p ^itm (I,T))；

step 33: constructing a mask language model loss function L between a multi-modal encoder and an image local feature encoder _mlm T for text after mask processing ^{^} The model prediction probability is represented by p ^msk (I,T ^{^} ) The representation is as follows:

7. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features as claimed in claim 1, wherein the image-to-text recall ratio R in step 41 _i2t @ N and text-to-image recall R _t2i The @ N is specifically:

the image-to-text recall is as follows:

wherein: text (text) ₁ ,text ₂ …text _N Respectively represent retrieval algorithm returnsA set of candidate text samples 1 st, 2 nd through nth back;

the text-to-image recall is as follows:

8. The method for cross-modal retrieval of remote sensing images based on fusion of language and visual detail features according to claim 1, wherein the cross-modal retrieval of remote sensing image-text description in the step 5 comprises a text-to-text process and a text-to-text process, specifically comprises the following steps: