CN114201621B

CN114201621B - Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Info

Publication number: CN114201621B
Application number: CN202111406136.9A
Authority: CN
Inventors: 单丽莉; 苏宇; 孙承杰; 林磊; 刘秉权
Original assignee: Harbin Institute of Technology; People Co Ltd
Current assignee: Harbin Institute of Technology; Konami Sports Club Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2024-04-02
Anticipated expiration: 2041-11-24
Also published as: CN114201621A

Abstract

The invention discloses a cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention, which comprises the following steps: acquiring a training image and a training text, and respectively extracting local features of an image sample and a text sample; mapping all local image features of the image sample and all local text features of the text sample into feature vectors respectively, and representing the feature vectors of the image sample and the text sample into matrixes respectively to obtain Key matrixes, query matrixes and Value matrixes respectively; based on the matrixes, calculating the cross-modal attention characteristics and intra-modal attention characteristics of the image sample and the text sample; fusing the cross-modal attention features and intra-modal attention features to obtain a global feature representation of the image sample and a global feature representation of the text sample; and training to obtain a cross-modal retrieval model based on the global feature representation. The invention can directly carry out similarity matching on the data of different modes and has higher matching accuracy.

Description

Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Technical Field

The invention relates to the technical field of cross-modal retrieval of image texts, in particular to a cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention.

Background

With the explosive growth of multimedia data in different modes such as texts and images on the internet, the single-mode search can not meet the requirements of users, and cross-mode search is generated. Cross-modal retrieval is the mutual retrieval of data of at least two modalities, and generally, related data of one modality is retrieved by taking the other modality as a query condition. For example, sales commodities displayed on an e-commerce website generally comprise text descriptions such as commodity categories, commodity names, commodity attributes, commodity concrete descriptions and commodity images, and when a user searches for interesting contents on the website, the user hopes to retrieve data of various modes such as texts and images related to the commodities so as to obtain more data of the commodities.

However, the data of different modes are inconsistent in representation form, so that the data of different modes cannot be directly measured for similarity, and the problem of low matching accuracy in the existing cross-mode retrieval method is caused.

Disclosure of Invention

The invention solves the problem that the data of different modes are inconsistent in representation form, so that the data of different modes cannot be directly measured for similarity, and the matching accuracy rate of the existing cross-mode searching method is low.

The invention provides a cross-modal retrieval model construction method based on graphic and text cooperative attention, which comprises the following steps:

acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with a class label;

extracting local image features of the image sample and extracting local text features of the text sample;

mapping all local image features of the image sample and all local text features of the text sample into feature vectors respectively, and representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes respectively, and obtaining respective Key matrixes, query matrixes and Value matrixes respectively through a full connection layer;

calculating the cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of the image sample and the text sample, and respectively generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;

calculating intra-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of each of the image sample and the text sample, and respectively generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores;

fusing the cross-modal attention feature and the intra-modal attention feature to obtain a global feature representation of the image sample and a global feature representation of the text sample respectively;

and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.

Optionally, the calculating the cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention score respectively includes:

respectively performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample, the Key matrix of the text sample and the Query matrix of the image sample, normalizing, and respectively calculating the weight matrix of the image sample to the text sample and the weight matrix of the text sample to the image sample through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;

taking the weight matrix of the text sample to the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention characteristic of the text sample;

and taking the weight matrix of the image sample to the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention characteristic of the image sample.

Optionally, the calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of each of the image sample and the text sample, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores respectively includes:

respectively carrying out inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample, normalizing, and respectively calculating the weight matrix of the image sample and the weight matrix of the text sample through softmax, wherein the intra-mode attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;

taking the weight matrix of the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain intra-mode attention characteristics of the image sample;

and taking the weight matrix of the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention characteristic of the text sample.

Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, which specifically includes:

inputting the global feature representation of the image sample or the global feature representation of the text sample into a full connection layer, outputting the probability of each label by using softmax, and taking the category label with the highest probability as a prediction label of the image sample or the text sample input into the full connection layer;

and calculating a loss function of label prediction based on the prediction label and the real class label of the image sample or the text sample band.

Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, which specifically includes:

constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises an anchor sample, a first preset number of positive samples of the anchor sample cross-mode and a second preset number of negative samples of the anchor sample cross-mode, and the anchor sample is the image sample or the text sample;

and respectively calculating the distances between the anchor sample and all positive samples and between the anchor sample and all negative samples according to the global characteristic representation of the anchor sample, the global characteristic representation of the positive sample of the anchor sample in a cross-mode manner and the global characteristic representation of the negative sample of the anchor sample in a cross-mode manner, and calculating a loss function learned by the measurement based on the distances.

Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes:

and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the label predicted loss function and the metric learned loss function.

Optionally, the extracting the local image feature of the image sample includes:

inputting the image sample into a pre-trained FaterRCNN model, extracting local image characteristics of the image sample, wherein the FaterRCNN model comprises a characteristic extraction network, an RPN network and a region of interest pooling network, the characteristic extraction network is used for extracting characteristics of the image sample, inputting an extracted characteristic image into the RPN network, selecting a preset number of regions of interest by the RPN network, marking the regions of interest by using rectangular frames, and the region of interest pooling network is used for extracting characteristics of the regions of interest based on the regions of interest marked by the RPN network to serve as local image characteristics of the image sample.

Optionally, the extracting the local text feature of the text sample includes:

word segmentation is carried out on the text sample;

word2Vec is used for obtaining Word vectors of each Word after Word segmentation;

and inputting the word vector into a Bi-LSTM network, and acquiring a feature vector representation of each word as a local text feature of the text sample.

The invention also provides a cross-modal retrieval method based on graphic and text cooperative attention, which comprises the following steps:

acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;

inputting the elements in the search condition and the preset search range into the cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention as described in any one of the above, and outputting a third preset number of elements with the highest similarity with the elements in the search condition in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.

The invention also provides a cross-modal retrieval device based on the graphic collaborative attention, which comprises a computer readable storage medium and a processor, wherein the computer readable storage medium stores a computer program, and the computer program realizes the cross-modal retrieval model construction method based on the graphic collaborative attention or the cross-modal retrieval method based on the graphic collaborative attention when being read and run by the processor.

According to the invention, the local image features of the image sample and the local text features of the text sample are extracted and mapped into feature vectors for representation, a cross-modal attention mechanism is utilized to capture the fine-granularity interaction relation of data among modes, a intra-modal attention mechanism is utilized to capture the association between image areas and the semantic association of text context, finally the cross-modal attention features and the intra-modal attention features are fused, and the global feature representation with consistent representation forms of the image and the text is obtained, so that the data of two different modes of the image and the text can be directly subjected to similarity measurement, and the trained cross-modal retrieval model can be directly subjected to similarity matching of the data of different modes, so that the matching accuracy is high.

Drawings

FIG. 1 is a schematic diagram of a cross-modal retrieval model construction method based on graphic collaborative attention in an embodiment of the invention;

FIG. 2 is a schematic diagram of an example of a text sample in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a cross-modal retrieval model construction method based on graphic collaborative attention according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a collaborative attention mechanism according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

Referring to fig. 1, in an embodiment of the present invention, the method for constructing a cross-modal search model based on graphic collaborative attention includes:

step S10, a training image and a training text are obtained, wherein the training image is an image sample with a class label, and the training text is a text sample with a class label.

The training data comprises a plurality of training images and training texts, each training image can be assigned an image ID, and each training text can be assigned a text ID so as to distinguish different training images from different training texts. Both the training image and the training text are provided with category labels, as shown in fig. 2, the training text can also contain an image ID corresponding to the text besides the text category labels and the text description, so that positive samples in the metric learning training task can be constructed later.

Step S20, extracting local image features of the image sample, and extracting local text features of the text sample.

Here, the local image features of the image sample refer to the region features of the image sample, and specifically refer to the features of a plurality of regions of the image sample. The extraction of local image features of the image sample can be realized through an R-CNN, FAST-RCNN or FASTER-RCNN algorithm.

Here, the local text feature of the text sample refers to a feature representation of each word of the text sample. The extraction of local text features of the text sample can be realized through Word2Vec, bi-LSTM and other algorithms.

And step S30, mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes, and respectively obtaining respective Key matrixes, query matrixes and Value matrixes through a full connection layer.

One image sample is provided with one or more local image features, one text sample is provided with one or more local text features, all local image features of one image sample are respectively subjected to feature vector mapping, the local image features can be input into a full-connection layer to be mapped to obtain feature vectors corresponding to the local image features, all local text features of one text sample are respectively subjected to feature vector mapping, and the local text features can be input into a full-connection layer to be mapped to obtain feature vectors corresponding to the local text features.

Optionally, the feature vector of the image sample and the feature vector of the text sample have the same dimension.

Referring to fig. 4, if one image sample has k local image features and one text sample has j local text features, and the feature vector dimension obtained by mapping the k local image features of the image sample is d, k d-dimensional feature vectors can be obtained after mapping the local image features of the image sample, and j d-dimensional feature vectors can be obtained after mapping the local text features of the text sample.

The local image features of the image samples and the local text features of the text samples are mapped into vectors with the same dimension, so that data interaction between the subsequent image samples and the text samples is realized, and similarity calculation between the image samples and the text samples is facilitated.

Referring to fig. 4, the k d-dimensional feature vectors of the image sample are represented as a matrix P, and Key matrix, query matrix and Value matrix of the matrix P are obtained through the Linear full connection layer, which may be specifically represented as:

P _K ＝Linear(P；θ _PK )；P _Q ＝Linear(P；θ _PQ )；P _V ＝Linear(P；θ _PV )，

wherein P is _K Key matrix, P, of finger matrix P _Q Query matrix, P, of finger matrix P _V Value matrix, θ, of finger matrix P _PK 、θ _PQ 、θ _PV Is a network weight parameter of the full connection layer.

Referring to fig. 4, j d-dimensional feature vectors of a text sample are represented as a matrix T, and Key matrix, query matrix and Value matrix of the matrix T are obtained through a Linear full connection layer, which may be specifically represented as:

T _K ＝Linear(T；θ _TK )；T _Q ＝Linear(T；θ _TQ )；T _V ＝Linear(T；θ _TV )，

wherein T is _K Key matrix T of finger matrix T _Q Query matrix, T, of finger matrix T _V Value matrix, θ, of the finger matrix T _TK 、θ _TQ 、θ _TV Is a network weight parameter of the full connection layer.

Step S40, calculating a cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention score, respectively.

Further, the step S40 includes:

respectively performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample, the Key matrix of the text sample and the Query matrix of the image sample, normalizing, and respectively calculating the weight matrix of the image sample to the text sample and the weight matrix of the text sample to the image sample through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample.

Wherein, the Key matrix of the image sample and the Query matrix of the text sample are subjected to inner product operation (expressed as P by a formula _K T _o ) And normalizes the obtained inner product operation result (expressed as) The weight matrix of image samples versus text samples can be expressed as:

the Key matrix of the text sample and the Query matrix of the image sample are subjected to inner product operation (expressed as T by a formula _K P _Q ) And normalizes the obtained inner product operation result (expressed as) The weight matrix of text samples versus image samples can be expressed as:

wherein P is _K Key matrix, T, of image sample _Q Query matrix, T, referring to text samples _K Key matrix, P, referring to text sample _Q The Query matrix of the image sample is referred to, and d refers to the feature vector dimension of the image sample and the text sample.

And taking the weight matrix of the image sample to the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention characteristic of the image sample. It can be expressed as:

P _inter ＝W _PT ×T _V ，

wherein P is _inter Refer to cross-modal attention features, W, of an image sample _PT For the weight matrix of the image sample to the text sample, T _V Is a Value matrix of text samples.

And taking the weight matrix of the text sample to the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention characteristic of the text sample. It can be expressed as:

T _inter ＝W _TP ×P _V ，

wherein T is _inter Refer to cross-modal attention features, W, of text samples _TP Refers to a weight matrix of text samples to image samples, P _V Refers to the Value matrix of the image samples.

Performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample to obtain attention parameters between each local image feature of the image sample and each local text feature of the text sample so as to determine which parts to pay attention to input and allocate limited information processing resources to important parts; by normalizing the result of the inner product operation, the finally obtained attention score is not affected by the feature vector dimension d.

Step S50, calculating intra-mode attention scores of the image sample and the text sample based on Key matrix, query matrix and Value matrix of each of the image sample and the text sample, and generating intra-mode attention features of the image sample and the text sample based on the intra-mode attention scores, respectively.

Further, the step S50 includes:

and respectively carrying out inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample, normalizing, and respectively calculating the weight matrix of the image sample and the weight matrix of the text sample through softmax, wherein the intra-mode attention score comprises the weight matrix of the image sample and the weight matrix of the text sample.

The Key matrix and the Query matrix of the image sample are subjected to inner product operation (the formula can be expressed as P _K P _Q ) Normalizing the result of the inner product operation (the formula can be expressed as) The weight matrix of the image samples can be expressed as:

wherein W is _PP Refers to the weight matrix of the image sample, P _K Key matrix, P, of image sample _Q Refers to the Query matrix of the image samples, and d refers to the feature vector dimension of the image samples.

The Key matrix and the Query matrix of the text sample are respectively subjected to inner product operation (the formula can be expressed as T _K T _o ) Normalizing the result of the inner product operation (the formula can be expressed as) The weight matrix of the text sample may be expressed as:

wherein W is _TT Weight matrix, T, referring to text samples _Q Query matrix, T, referring to text samples _K Refers to the Key matrix of the text sample, and d refers to the feature vector dimension of the text sample.

And taking the weight matrix of the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the intra-mode attention characteristic of the image sample. It can be expressed as:

P _intra ＝W _PP ×P _V ，

wherein P is _intra P, which is intra-modal attention feature of the image sample _V Refers to the Value matrix of the image samples.

And taking the weight matrix of the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention characteristic of the text sample. It can be expressed as:

T _intra ＝W _TT ×T _V ，

wherein T is _intra Is the intra-modal attention feature of the text sample, T _V Is a Value matrix of text samples.

And step S60, fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample.

Specifically, intra-modal attention features and inter-modal attention features of the image sample are fused, that is, the intra-modal attention features and inter-modal attention features are spliced, and then a plurality of local features of the image sample are reduced to 1 global feature through a maximum pooling layer, for example, k d-dimensional local features are reduced to 1 d-dimensional global feature representation. And fusing intra-modal attention features and inter-modal attention features of the text sample, and obtaining global feature representation of the text sample through a maximum pooling layer, wherein the maximum pooling layer has the same effect as the maximum pooling layer at the image sample, and the description is omitted here. It can be expressed as:

P _final ＝MaxPooling([P _inter ，P _intra ])，

T _final ＝MaxPooling([T _inter ，T _intra ])，

wherein P is _final Refers to the global feature representation, T, of an image sample _final Refers to a global feature representation of a text sample. As shown in FIG. 4, byCalculated P _intra And-> Calculated P _inter Is described.

Step S70, training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.

After the global feature representations of the image and text samples are obtained, a cross-acquisition modality retrieval model is trained using a label prediction task and a metric learning task.

The method comprises the steps of extracting local image features of an image sample and local text features of a text sample, mapping the local image features and the local text features as feature vectors to represent, capturing fine-grained interaction relation of data among modes by using a cross-mode attention mechanism, capturing association among image areas and semantic association of text context by using a intra-mode attention mechanism, finally fusing the cross-mode attention features and the intra-mode attention features to obtain global feature representations with consistent representation forms of the image and the text, enabling the data of the image and the text of two different modes to be subjected to similarity measurement directly, enabling a cross-mode retrieval model obtained through training to be subjected to similarity matching directly on the data of the different modes, and achieving high matching accuracy.

Optionally, as shown in fig. 3, step S70 includes a label prediction training task, specifically including:

The cross entropy loss function is utilized as the loss function for label prediction, which can be expressed as:

wherein L is _label Representing the loss function of label prediction, n represents the number of all samples in one batch, y _i True tags representing each sample, p _vi Representing predictive labels generated for image samples, p _ti Representing the predictive labels generated for the text samples.

By utilizing the label prediction task to ensure that samples with the same label have similar feature representations inside the modality, samples with different labels have different feature representations.

Optionally, as shown in fig. 3, step S70 includes a metric learning training task, specifically including:

constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises an anchor sample, a first preset number of positive samples of the anchor sample cross-mode and a second preset number of negative samples of the anchor sample cross-mode, and the anchor sample is the image sample or the text sample; and respectively calculating the distances between the anchor sample and all positive samples and between the anchor sample and all negative samples according to the global characteristic representation of the anchor sample, the global characteristic representation of the positive sample of the anchor sample in a cross-mode manner and the global characteristic representation of the negative sample of the anchor sample in a cross-mode manner, and calculating a loss function learned by the measurement based on the distances.

When the anchor sample is an image sample, the cross-modal positive sample refers to a text-form positive sample, and the cross-modal negative sample refers to a text-form negative sample; when the anchor sample is a text sample, its cross-modal positive sample refers to the positive sample in image form and its cross-modal negative sample refers to the negative sample in image form.

The first preset number may be greater than or equal to 1, with a preferred value of 1. The second preset number may be greater than or equal to 1, preferably with a value of m-1, where m refers to the number of class labels. Because, for the anchor samples, of all class labels, only one class (i.e., the class label of the anchor sample) is its positive sample, the other m-1 classes are negative samples, and each class selects one sample for calculating the loss function.

In one embodiment, for one batch of image and text data, an image sample is taken as an anchor sample, a text sample identical to a class label of the image sample is randomly sampled as a positive sample, and m-1 negative samples are obtained by sampling all text samples of different classes from the positive sample for data under m different classes of semantic labels, assuming that m different class labels are shared. Taking a text sample as an anchor sample, randomly sampling an image sample which is the same as a semantic label of the text sample as a positive sample, and sampling all image samples which are different from the anchor sample as negative samples to obtain m-1 negative samples.

And calculating a loss function of measurement learning based on the distances between the anchor sample and all the positive samples and all the negative samples, specifically, subtracting the distances between the anchor sample and all the negative samples from the distances between the anchor sample and all the positive samples as the loss function of measurement learning, training based on the loss function of measurement learning, and reducing the distance between the positive sample pairs and enlarging the distance between the negative sample pairs. Alternatively, the distance between samples is defined using cosine similarity.

Further, the loss function of metric learning is defined as follows:

L _metric ＝L(v)+L(t)，

where L (v) is a metric learning loss function of the image sample, v ^T Is a selected image sample, t ⁺ Is a text sample of the same category corresponding to the image sample(positive sample), M is the number of categories, t _i Is a text sample (negative sample) of a different class to which the image sample corresponds. L (t) is a metric learning loss function of the text sample, t ^T Is a selected text sample, v ⁺ Is the image sample (positive sample) of the same category corresponding to the text sample, v _i Is an image sample (negative sample) of a different category to which the text sample corresponds. L (L) _metric The total loss function is learned for the metric.

The samples with similar semantics in different modes are guaranteed to have similar feature representations through the measurement learning task, and the samples with different semantics in different modes have different feature representations.

The step S70 further includes: and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the label predicted loss function and the metric learned loss function.

Specifically, the definition of the loss function for the multitasking learning is as follows:

L＝αL _label +βL _metric ，

where α and β are superparameters that balance the weights of the above two task loss functions.

The cross-modal retrieval model is obtained through training in a multi-task learning mode, so that the cross-modal retrieval model obtained through training can be ensured to have similar characteristic representations in the interior of the modes, samples with the same label have different characteristic representations, samples with similar semantics among different modes have similar characteristic representations, and samples with different semantics among different modes have different characteristic representations.

Optionally, as shown in fig. 3, the extracting the local image feature of the image sample includes:

inputting the image sample into a pre-trained Faster RCNN model, extracting local image characteristics of the image sample, wherein the Faster RCNN model comprises a characteristic extraction network, an RPN network and a region of interest pooling network, the characteristic extraction network is used for extracting characteristics of the image sample, inputting an extracted characteristic image into the RPN network, selecting a preset number of regions of interest by the RPN network, marking the regions of interest by using rectangular frames, and the region of interest pooling network is used for extracting characteristics of the regions of interest based on the regions of interest marked by the RPN network to serve as local image characteristics of the image sample.

Wherein the feature extraction network consists of a set of convolution layers+relu activation function layers+pooling layers.

After the extracted feature map is input into an RPN network, the RPN network is divided into two parts, the first part selects k interested areas through classification, and particularly k interested areas can be selected through softmax classification; the second part marks the approximate location of these regions of interest with a rectangular box.

And (3) the region-of-interest pooling network marks the position of the region of interest based on the RPN network, and the feature extraction network extracts the feature of k regions, namely the local image features of the image sample.

By extracting local image features of an image sample using the fast RCNN model, local image features can be extracted from the image sample efficiently and quickly.

Optionally, as shown in fig. 3, the extracting the local text feature of the text sample includes:

and word segmentation is carried out on the text sample. The specific word segmentation method can select hidden Markov models, jieba word segmentation algorithms and the like, and the algorithms are all in the prior art and are not repeated here.

The Word vector of each Word after Word segmentation is obtained by using Word2Vec, and specifically, the Word vector of each Word after Word segmentation can be obtained through a Skip-Gram model in Word2 Vec. The related content is the prior art and is not described herein.

And inputting the word vector into a Bi-LSTM network, and acquiring a feature vector representation of each word as a local text feature of the text sample. Among them, bi-LSTM is one of RNN networks, and is suitable for modeling time series data, such as text data here, which can better capture long-distance dependency relationships by learning and memorizing which information and forgetting which information, and in addition, can better capture Bi-directional semantic dependencies by combining forward LSTM and backward LSTM.

acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text; inputting the elements in the search condition and the preset search range into the cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention, and outputting a third preset number of elements with the highest similarity with the elements in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.

After acquiring a given retrieval condition element, inputting the given retrieval condition element into a cross-modal retrieval model constructed/trained by a cross-modal retrieval model construction method based on graphic-text cooperative attention, simultaneously, inputting elements in a preset retrieval range into the cross-modal retrieval model, and outputting a third preset number of elements with highest similarity with the retrieval condition element in the retrieval range by the cross-modal retrieval model.

Wherein the preset search range may be defined to include only image elements, only text elements, or both image elements and text elements. In an embodiment, a given search condition element is an image, and a preset search range is a text element, so that a cross-mode search model outputs a preset number of texts with highest similarity with the search condition element in the search range; in another embodiment, given that the search condition element is text, and the preset search range is that only the image element is included, the cross-mode search model outputs an image with the highest similarity to the search condition element in the search range.

The method comprises the steps of inputting elements in a search condition and elements in a preset search range into a cross-modal search model constructed by a cross-modal search model construction method based on graphic-text collaborative attention, capturing fine-grained interaction relation of data among modes by using a cross-modal attention mechanism through the cross-modal search model, capturing association among image areas and semantic association of text context by using the intra-modal attention mechanism, and finally fusing the cross-modal attention feature and the intra-modal attention feature to obtain global feature representation with consistent representation forms of images and texts, so that the images and texts of different modes can be directly subjected to similarity measurement, and further accuracy of the cross-modal search is improved.

In an embodiment of the present invention, a cross-modal retrieval device based on graph-text cooperative attention includes a computer readable storage medium and a processor, where the computer program is stored, and when the computer program is read and executed by the processor, the cross-modal retrieval model construction method based on graph-text cooperative attention or the cross-modal retrieval method based on graph-text cooperative attention is implemented. Compared with the prior art, the cross-modal retrieval device based on the graphic and text cooperative attention has the advantages that the cross-modal retrieval device based on the graphic and text cooperative attention is consistent with the cross-modal retrieval method based on the graphic and text cooperative attention, and is not repeated here.

The reader will appreciate that in the description of this specification, a description of terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. The method for constructing the cross-modal retrieval model based on the graphic and text cooperative attention is characterized by comprising the following steps of:

2. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the calculating the cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and the generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention score respectively comprises:

3. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the calculating intra-modal attention scores of the image sample and the text sample based on Key matrix, query matrix and Value matrix of each of the image sample and the text sample, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores respectively comprises:

4. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, specifically including:

5. The method for constructing a cross-modal retrieval model based on graphic co-attention as set forth in claim 4, wherein the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, and specifically includes:

6. The method for constructing a cross-modal retrieval model based on graphic co-attention as recited in claim 5, wherein training the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises:

7. A method of constructing a cross-modal retrieval model based on teletext co-attention as claimed in any one of claims 1 to 6, wherein said extracting local image features of the image sample includes:

8. A method of constructing a cross-modal retrieval model based on teletext co-attention as claimed in any one of claims 1 to 6, wherein said extracting local text features of the text sample includes:

word segmentation is carried out on the text sample;

9. The cross-modal retrieval method based on graphic and text cooperative attention is characterized by comprising the following steps of:

inputting the elements in the search condition and the preset search range into a cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention according to any one of claims 1 to 8, and outputting a third preset number of elements with the highest similarity with the elements in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.

10. A cross-modal retrieval device based on teletext co-attention, comprising a computer readable storage medium storing a computer program and a processor, the computer program, when read and executed by the processor, implementing a cross-modal retrieval model construction method based on teletext co-attention as claimed in any one of claims 1 to 8, or a cross-modal retrieval method based on teletext co-attention as claimed in claim 9.