CN114201621A

CN114201621A - Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention

Info

Publication number: CN114201621A
Application number: CN202111406136.9A
Authority: CN
Inventors: 单丽莉; 苏宇; 孙承杰; 林磊; 刘秉权
Original assignee: Harbin Institute of Technology; People Co Ltd
Current assignee: Harbin Institute of Technology; Konami Sports Club Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-03-18
Anticipated expiration: 2041-11-24
Also published as: CN114201621B

Abstract

The invention discloses a cross-modal retrieval model construction and retrieval method based on image-text cooperative attention, which comprises the following steps: acquiring a training image and a training text, and respectively extracting local characteristics of an image sample and a text sample; respectively mapping all local image characteristics of the image sample and all local text characteristics of the text sample into characteristic vectors, respectively representing the characteristic vectors of the image sample and the text sample into matrixes, and then obtaining respective Key matrix, Query matrix and Value matrix; calculating cross-modal attention features and intra-modal attention features of the image sample and the text sample based on the plurality of matrixes; fusing the cross-modal attention feature and the intra-modal attention feature to obtain the global feature representation of the image sample and the global feature representation of the text sample; and training to obtain a cross-modal retrieval model based on the global feature representation. The method can directly carry out similarity matching on the data in different modes, and has higher matching accuracy.

Description

Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention

Technical Field

The invention relates to the technical field of cross-modal retrieval of image texts, in particular to a cross-modal retrieval model construction and retrieval method based on image-text cooperative attention.

Background

With the explosive growth of multimedia data of different modes such as texts, images and the like on the internet, the search of a single mode cannot meet the requirements of users, and cross-mode search is carried out. Cross-modality retrieval is retrieval of data of at least two modalities from each other, and is usually a method of retrieving related data of one modality as a query. For example, a sales commodity displayed on an e-commerce website usually includes text descriptions such as a commodity category, a commodity name, a commodity attribute, and a commodity specific description, and a commodity image, and when a user searches for a content of interest on the website, the user may wish to retrieve data of multiple modalities such as text and images related to the commodity to obtain more data of the commodity.

However, the representation forms of the data in different modalities are not consistent, so that the data in different modalities cannot be directly subjected to similarity measurement, and the problem that the matching accuracy is not high in the conventional cross-modality retrieval method is caused.

Disclosure of Invention

The problem solved by the invention is that the representation forms of the data in different modes are inconsistent, so that the data in different modes cannot be directly subjected to similarity measurement, and the matching accuracy of the conventional cross-mode retrieval method is low.

The invention provides a cross-modal retrieval model construction method based on image-text cooperative attention, which comprises the following steps:

acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with the class label;

extracting local image features of the image samples and extracting local text features of the text samples;

respectively mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes, and respectively obtaining a Key matrix, a Query matrix and a Value matrix of each through a full connection layer;

calculating cross-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;

calculating intra-modality attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating intra-modality attention features of the image sample and the text sample based on the intra-modality attention scores;

fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample;

and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.

Optionally, the calculating cross-modal attention scores for the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating cross-modal attention characteristics for the image sample and the text sample based on the cross-modal attention scores comprises:

performing inner product operation on a Key matrix of the image sample and a Query matrix of the text sample, a Key matrix of the text sample and the Query matrix of the image sample respectively, normalizing the inner product operation and the normalization operation, and calculating a weight matrix of the image sample to the text sample and a weight matrix of the text sample to the image sample respectively through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;

taking the weight matrix of the text sample to the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention feature of the text sample;

and taking the weight matrix of the image sample to the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention feature of the image sample.

Optionally, the calculating intra-modality attention scores for the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating intra-modality attention features for the image sample and the text sample based on the intra-modality attention scores includes:

performing inner product operation on a Key matrix and a Query matrix of the image sample and a Key matrix and a Query matrix of the text sample respectively, normalizing the inner product operation and calculating a weight matrix of the image sample and a weight matrix of the text sample respectively through softmax, wherein the intra-modal attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;

taking the weight matrix of the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the intra-modal attention feature of the image sample;

and taking the weight matrix of the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention feature of the text sample.

Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, which specifically includes:

inputting a global feature representation of the image sample or a global feature representation of the text sample into a fully-connected layer, outputting a probability of each label by utilizing softmax, and taking a category label with the highest probability as a prediction label of the image sample or the text sample input into the fully-connected layer;

calculating a loss function for label prediction based on the prediction label and a true class label of the image sample or the text sample band.

Optionally, the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, which specifically includes:

constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises anchor samples, a first preset number of cross-modal positive samples of the anchor samples and a second preset number of cross-modal negative samples of the anchor samples, and the anchor samples are the image samples or the text samples;

and respectively calculating the distances between the anchor sample and all positive samples and all negative samples according to the global feature representation of the anchor sample, the global feature representation of the positive sample of the cross-modal anchor sample and the global feature representation of the negative sample of the cross-modal anchor sample, and calculating a loss function of metric learning based on the distances.

Optionally, the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further includes:

and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the loss function predicted by the label and the loss function learned by the measurement.

Optionally, the extracting the local image feature of the image sample includes:

inputting the image sample into a pre-trained FasterRCNN model, and extracting local image features of the image sample, wherein the FasterRCNN model comprises a feature extraction network, an RPN network and an interest region pooling network, the feature extraction network is used for extracting features of the image sample, inputting an extracted feature map into the RPN network, selecting a preset number of interest regions from the RPN network, marking the interest regions by rectangular frames, and the interest region pooling network is used for extracting features of the interest regions based on the interest regions marked by the RPN network to serve as local image features of the image sample.

Optionally, the extracting the local text feature of the text sample includes:

performing word segmentation on the text sample;

obtaining a Word vector of each Word after Word segmentation by using Word2 Vec;

and inputting the word vector into a Bi-LSTM network, and acquiring the feature vector representation of each word as the local text feature of the text sample.

The invention also provides a cross-modal retrieval method based on image-text cooperative attention, which comprises the following steps:

acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;

inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the image-text cooperative attention based cross-modal retrieval model construction method, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as retrieval results by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.

The invention further provides a cross-modal retrieval device based on image-text cooperative attention, which comprises a computer readable storage medium and a processor, wherein the computer readable storage medium stores a computer program, and the computer program is read by the processor and runs to realize the above cross-modal retrieval model construction method based on image-text cooperative attention or the above cross-modal retrieval method based on image-text cooperative attention.

According to the method, the local image features of the image sample and the local text features of the text sample are extracted and mapped to feature vectors for representation, a cross-modal attention mechanism is used for capturing fine-grained interactive relations of data between modalities, an intra-modality attention mechanism is used for capturing correlation between image regions and semantic correlation of text contexts, the cross-modal attention feature and the intra-modal attention feature are finally fused to obtain global feature representation with consistent representation forms of the image and the text, so that the similarity measurement can be directly carried out on the data of the image and the text in two different modalities, the trained cross-modal retrieval model can directly carry out similarity matching on the data of the different modalities, and the method has high matching accuracy.

Drawings

FIG. 1 is a schematic diagram of a cross-modal retrieval model construction method flow based on image-text cooperative attention in an embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of a text sample in an embodiment of the invention;

FIG. 3 is a schematic diagram of an architecture of a cross-modal search model construction method based on image-text cooperative attention according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

As shown in fig. 1, in an embodiment of the present invention, the method for constructing a cross-modal search model based on image-text cooperative attention includes:

step S10, acquiring a training image and a training text, wherein the training image is an image sample with a category label, and the training text is a text sample with a category label.

The training data includes a plurality of training images and training texts, and an image ID may be assigned to each training image and a text ID may be assigned to each training text to distinguish between different training images and different training texts. The training images and the training texts are both provided with class labels, as shown in fig. 2, the training texts can also include image IDs corresponding to the texts in addition to the text class labels and the text descriptions, so as to facilitate the subsequent construction of positive samples in the metric learning training task.

Step S20, extracting local image features of the image sample, and extracting local text features of the text sample.

The local image feature of the image sample herein refers to a regional feature of the image sample, and specifically refers to features of a plurality of regions of the image sample. The extraction of the local image characteristics of the image sample can be realized through an R-CNN, FAST-RCNN or FASTER-RCNN algorithm.

Local text features of a text sample here refer to a feature representation of each word of the text sample. The extraction of the local text features of the text sample can be realized through the algorithms of Word2Vec, Bi-LSTM and the like.

Step S30, mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrices, and then obtaining a Key matrix, a Query matrix, and a Value matrix of each via a full connection layer.

The method comprises the steps that one image sample has one or more local image characteristics, one text sample has one or more local text characteristics, all the local image characteristics of one image sample are respectively subjected to characteristic vector mapping, the local image characteristics can be specifically input into a full connection layer to obtain a characteristic vector corresponding to the local image characteristics through mapping, all the local text characteristics of one text sample are respectively subjected to characteristic vector mapping, and the local text characteristics can be specifically input into a full connection layer to obtain a characteristic vector corresponding to the local text characteristics through mapping.

Optionally, the feature vector of the image sample and the feature vector of the text sample have the same dimension.

As shown in fig. 4, an image sample has k local image features, a text sample has j local text features, and the feature vector dimension obtained by mapping the two features is d, then k d-dimensional feature vectors can be obtained by mapping the local image features of the image sample, and j d-dimensional feature vectors can be obtained by mapping the local text features of the text sample.

By mapping the local image features of the image sample and the local text features of the text sample into vectors of the same dimension, the data interaction between the subsequent image sample and the text sample is facilitated, and the similarity calculation between the image sample and the text sample is facilitated.

As shown in fig. 4, k d-dimensional feature vectors of an image sample are represented as a matrix P, and a Key matrix, a Query matrix, and a Value matrix of the matrix P are obtained through a Linear full-link layer, which can be specifically represented as:

P_K＝Linear(P；θ_PK)；P_Q＝Linear(P；θ_PQ)；P_V＝Linear(P；θ_PV)，

wherein, P_KKey matrix of finger matrix P, P_QA Query matrix, P, referring to matrix P_VValue matrix, theta, referring to matrix P_PK、θ_PQ、θ_PVIs the network weight parameter of the full connection layer.

As shown in fig. 4, j d-dimensional feature vectors of a text sample are represented as a matrix T, and a Key matrix, a Query matrix, and a Value matrix of the matrix T are obtained through a Linear full-link layer, which can be specifically represented as:

T_K＝Linear(T；θ_TK)；T_Q＝Linear(T；θ_TQ)；T_V＝Linear(T；θ_TV)，

wherein, T_KKey matrix and T of finger matrix T_QQuery matrix, T, referring to matrix T_VValue matrix, theta, referring to matrix T_TK、θ_TQ、θ_TVThe net being a fully-connected layerA network weight parameter.

Step S40, calculating cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, and generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores, respectively.

Further, the step S40 includes:

performing inner product operation on a Key matrix of the image sample and a Query matrix of the text sample, a Key matrix of the text sample and the Query matrix of the image sample respectively, normalizing the inner product operation, and calculating a weight matrix of the image sample to the text sample and a weight matrix of the text sample to the image sample respectively through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample.

Performing inner product operation on a Key matrix of an image sample and a Query matrix of a text sample (formula is represented as P)_KT_o) And normalizing the obtained inner product operation result (formula is expressed as

) The weight matrix of the image sample to the text sample can be expressed as:

performing inner product operation on a Key matrix of a text sample and a Query matrix of an image sample (the formula is represented as T)_KP_Q) And normalizing the obtained inner product operation result (formula is expressed as

) The weight matrix of text samples to image samples can be expressed as:

wherein, P_KKey matrix, T, for a sample of an image_QQuery matrix, T, referring to text samples_KKey matrix, P, to text sample_QRefers to the Query matrix of the image sample, and d refers to the feature vector dimension of the image sample and the text sample.

And taking the weight matrix of the image sample to the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention feature of the image sample. It can be expressed as:

P_inter＝W_PT×T_V，

wherein, P_interRefers to the cross-modal attention feature, W, of the image sample_PTWeight matrix for image samples to text samples, T_VIs a Value matrix of text samples.

And taking the weight matrix of the text sample to the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention feature of the text sample. It can be expressed as:

T_inter＝W_TP×P_V，

wherein, T_interRefers to a cross-modal attention feature, W, of a text sample_TPRefers to a weight matrix of text samples to image samples, P_VRefers to the Value matrix of the image sample.

Performing inner product operation on a Key matrix of an image sample and a Query matrix of a text sample to obtain attention parameters between each local image characteristic of the image sample and each local text characteristic of the text sample so as to determine which parts need to be concerned about input and allocate limited information processing resources to important parts; by normalizing the result of the inner product operation, the finally obtained attention score is not influenced by the dimension d of the feature vector.

Step S50, calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores.

Further, the step S50 includes:

and performing inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample respectively, normalizing the inner product operation, and calculating the weight matrix of the image sample and the weight matrix of the text sample respectively through softmax, wherein the intra-modal attention score comprises the weight matrix of the image sample and the weight matrix of the text sample.

Performing inner product operation on a Key matrix and a Query matrix of an image sample (the formula can be expressed as P)_KP_Q) The result of the inner product operation is normalized (the formula can be expressed as

) The weight matrix for an image sample may be represented as:

wherein, W_PPWeight matrix, P, referring to image samples_KKey matrix, P, for a sample of an image_QRefers to the Query matrix of the image sample, and d refers to the feature vector dimension of the image sample.

Performing inner product operation on a Key matrix and a Query matrix of a text sample respectively (the formula can be represented as T)_KT_o) The result of the inner product operation is normalized (the formula can be expressed as

) The weight matrix for a text sample may be represented as:

wherein, W_TTWeight matrix, T, referring to text samples_QQuery matrix, T, referring to text samples_KThe Key matrix of the text sample is referred to, and d refers to the feature vector dimension of the text sample.

And taking the weight matrix of the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the intra-modal attention feature of the image sample. It can be expressed as:

P_intra＝W_PP×P_V，

wherein, P_intraFor intra-modal attention features of image samples, P_VRefers to the Value matrix of the image sample.

And taking the weight matrix of the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention feature of the text sample. It can be expressed as:

T_intra＝W_TT×T_V，

wherein, T_intraFor an intra-modal attention feature of a text sample, T_VIs a Value matrix of text samples.

And step S60, fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample.

Specifically, intra-modality attention features and cross-modality attention features of the image samples are fused, that is, two vectors of the intra-modality attention features and the cross-modality attention features are spliced, and then the dimensionality of a plurality of local features of the image samples is reduced to 1 global feature through a maximum pooling layer, for example, the dimensionality of k d-dimensional local features is reduced to 1 d-dimensional global feature for representation. And fusing the intra-modal attention features and the cross-modal attention features of the text sample, and obtaining the global feature representation of the text sample through the maximum pooling layer, wherein the maximum pooling layer has the same effect with the maximum pooling layer at the image sample, and the description is omitted here. It can be expressed as:

P_final＝MaxPooling([P_inter，P_intra])，

T_final＝MaxPooling([T_inter，T_intra])，

wherein, P_finalRefers to a global feature representation, T, of an image sample_finalRefers to a global feature representation of a text sample. As shown in fig. 4 by

Calculated P_intraAnd

calculated P_interThe example of fusion of (1).

And step S70, training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.

After global feature representations of the image samples and the text samples are obtained, a cross-modal search model is trained by using a label prediction task and a metric learning task.

The method comprises the steps of extracting local image features of an image sample and local text features of the text sample, mapping the local image features and the local text features of the text sample into feature vectors for representation, capturing fine-grained interactive relation of data between modes by using a cross-mode attention mechanism, capturing correlation between image regions and semantic correlation of text contexts by using an intra-mode attention mechanism, finally fusing the cross-mode attention feature and the intra-mode attention feature to obtain global feature representation with consistent representation forms of the image and the text, enabling the data of the image and the text in two different modes to be subjected to similarity measurement directly, enabling a trained cross-mode retrieval model to be capable of conducting similarity matching on the data of the different modes directly, and having high matching accuracy.

Optionally, as shown in fig. 3, step S70 includes a label prediction training task, which specifically includes:

Using the cross entropy loss function as the loss function for label prediction, it can be expressed as:

wherein L is_labelRepresents the loss function of the label prediction, n represents the number of all samples in a batch, y_iThe true label, p, representing each sample_viRepresenting predictive labels generated for image samples, p_tiRepresenting predictive labels generated for text samples.

By using the label prediction task to ensure that within a modality, samples with the same label have similar feature representations and samples with different labels have different feature representations.

Optionally, as shown in fig. 3, step S70 includes a metric learning training task, which specifically includes:

constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises anchor samples, a first preset number of cross-modal positive samples of the anchor samples and a second preset number of cross-modal negative samples of the anchor samples, and the anchor samples are the image samples or the text samples; and respectively calculating the distances between the anchor sample and all positive samples and all negative samples according to the global feature representation of the anchor sample, the global feature representation of the positive sample of the cross-modal anchor sample and the global feature representation of the negative sample of the cross-modal anchor sample, and calculating a loss function of metric learning based on the distances.

When the anchor sample is an image sample, the cross-modal positive sample refers to a text-form positive sample, and the cross-modal negative sample refers to a text-form negative sample; when the anchor sample is a text sample, a positive sample across modalities thereof refers to a positive sample in the form of an image, and a negative sample across modalities thereof refers to a negative sample in the form of an image.

The first predetermined amount may be greater than or equal to 1, with a preferred value of 1. The second predetermined number may be greater than or equal to 1, with a preferred value of m-1, where m refers to the number of category labels. For the anchor sample, only one class (i.e., the class label of the anchor sample) in all the class labels is the positive sample, the other m-1 classes are negative samples, and one sample in each class is selected for calculating the loss function.

In one embodiment, for image and text data of a batch, an image sample is taken as an anchor sample, a text sample with the same class label as the image sample is randomly sampled as a positive sample, and assuming that m different class labels exist, for data under m different class semantic labels, all text samples with different classes from the positive sample are sampled as negative samples, so that m-1 negative samples are obtained. And taking the text sample as an anchor sample, randomly sampling an image sample with the same semantic label as the text sample as a positive sample, and sampling all image samples with different categories from the anchor sample as negative samples to obtain m-1 negative samples.

The distance between the anchor sample and all the positive samples and the distance between the anchor sample and all the negative samples are calculated to obtain a loss function of metric learning, specifically, the distance between the anchor sample and all the positive samples minus the distance between the anchor sample and all the negative samples is used as the loss function of metric learning, and training is performed based on the loss function of metric learning, so that the distance between pairs of positive samples can be reduced, and the distance between pairs of negative samples can be enlarged. Optionally, the distance between samples is defined using cosine similarity.

Further, the loss function of metric learning is defined as follows:

L_metric＝L(v)+L(t)，

wherein L (v) is a metric learning loss function of the image samples, v^TIs the selected image sample, t⁺Is a text sample (positive sample) of the same class corresponding to the image sample, M is the number of classes, t_iAre the different categories of text samples (negative samples) to which the image sample corresponds. L (t) is a metric learning loss function of the text sample, t^TIs a selected text sample, v⁺Is the image sample of the same category (positive sample) to which the text sample corresponds, v_iAre the different classes of image samples (negative samples) to which the text sample corresponds. L is_metricThe overall loss function is learned for the metric.

Samples with similar semantics in different modalities have similar feature representations, while samples with different semantics in different modalities have different feature representations, which are guaranteed by a metric learning task.

The step S70 further includes: and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the loss function predicted by the label and the loss function learned by the measurement.

Specifically, the loss function of the multitask learning is defined as follows:

L＝αL_label+βL_metric，

where α and β are hyper-parameters that are used to balance the weights of the above two task loss functions.

The cross-modal retrieval model is obtained by training in a multi-task learning mode, so that the cross-modal retrieval model obtained by training can be ensured to have similar feature representations of samples with the same label, different feature representations of samples with different labels, similar feature representations of samples with similar semantics among different modalities, and different feature representations of samples with different semantics among different modalities.

Optionally, as shown in fig. 3, the extracting the local image feature of the image sample includes:

inputting the image sample into a pre-trained fast RCNN model, and extracting local image features of the image sample, wherein the fast RCNN model comprises a feature extraction network, an RPN network and an interest region pooling network, the feature extraction network is used for extracting features of the image sample, inputting an extracted feature map into the RPN network, selecting a preset number of interest regions by the RPN network, marking the interest regions by rectangular frames, and the interest region pooling network is used for extracting features of the interest regions based on the interest regions marked by the RPN network to serve as local image features of the image sample.

The feature extraction network consists of a group of convolution layers, a relu activation function layer and a pooling layer.

After the extracted feature map is input into an RPN, the RPN is divided into two parts, wherein the first part selects k interested areas through two classifications, and particularly selects k interested areas through softmax classification; the second part marks the approximate location of these regions of interest with a rectangular box.

And the interested region pooling network is used for marking the position of the interested region based on the RPN network and extracting the feature map of the feature extraction network, and extracting the features of the k regions, namely the local image features of the image sample.

By extracting the local image features of the image sample by using the fast RCNN model, the local image features can be efficiently and quickly extracted from the image sample.

Optionally, as shown in fig. 3, the extracting the local text feature of the text sample includes:

and performing word segmentation on the text sample. The specific word segmentation method can select algorithms such as hidden markov models, jieba word segmentation and the like, which are all in the prior art and are not described herein.

And obtaining the Word vector of each Word after Word segmentation by using Word2Vec, and specifically obtaining the Word vector of each Word after Word segmentation by using a Skip-Gram model in the Word2 Vec. Related contents are related to the prior art and are not described herein in detail.

And inputting the word vector into a Bi-LSTM network, and acquiring the feature vector representation of each word as the local text feature of the text sample. Among them, Bi-LSTM is a kind of RNN network, suitable for modeling of time series data such as text data herein, which can better capture longer distance dependency by learning what information to remember and what information to forget, and in addition, can better capture bidirectional semantic dependency by combining forward LSTM and backward LSTM.

acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text; inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the cross-modal retrieval model construction method based on image-text cooperative attention, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as retrieval results by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.

After obtaining given retrieval condition elements, inputting the given retrieval condition elements into a cross-modal retrieval model constructed/trained by a cross-modal retrieval model construction method based on image-text cooperative attention, simultaneously inputting elements in a preset retrieval range into the cross-modal retrieval model, and outputting a third preset number of elements with highest similarity to the retrieval condition elements in the retrieval range by the cross-modal retrieval model.

The preset search range may be limited to include only image elements, only text elements, or both image elements and text elements. In one embodiment, the given search condition element is an image, and the preset search range only includes text elements, then the cross-modal search model outputs a preset number of texts with the highest similarity to the search condition element within the search range; in another embodiment, if the given search condition element is a text and the preset search range includes only image elements, the cross-modal search model outputs an image having the highest similarity to the search condition element within the search range.

The retrieval condition elements and the elements in the preset retrieval range are input into a cross-modal retrieval model constructed by a cross-modal retrieval model construction method based on image-text cooperative attention, a cross-modal attention mechanism is used for capturing fine-grained interactive relation of data among the modalities through the cross-modal retrieval model, association between image regions and semantic association of text contexts are captured by using an intra-modality attention mechanism, and finally cross-modal attention features and intra-modality attention features are fused to obtain global feature representation with consistent representation forms of images and texts, so that images and texts in different modalities can be subjected to similarity measurement directly, and accuracy of cross-modal retrieval is improved.

In an embodiment of the present invention, the cross-modal retrieval apparatus based on image-text cooperative attention includes a computer-readable storage medium storing a computer program and a processor, where the computer program is read by the processor and executed to implement the above-mentioned cross-modal retrieval model construction method based on image-text cooperative attention, or the above-mentioned cross-modal retrieval method based on image-text cooperative attention. Compared with the prior art, the cross-modal retrieval device based on the image-text cooperative attention has the advantages that the cross-modal retrieval device based on the image-text cooperative attention is consistent with the cross-modal retrieval method based on the image-text cooperative attention, and the details are not repeated here.

The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example" or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A cross-modal retrieval model construction method based on image-text cooperative attention is characterized by comprising the following steps:

2. The method for constructing a cross-modal search model based on image-text cooperative attention according to claim 1, wherein the calculating cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating the cross-modal attention characteristics of the image sample and the text sample based on the cross-modal attention scores comprises:

3. The method for constructing a cross-modal search model based on image-text cooperative attention according to claim 1, wherein the calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, Query matrix and Value matrix of the image sample and the text sample, respectively, and the generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores comprises:

4. The method for constructing a cross-modal search model based on image-text cooperative attention as claimed in claim 1, wherein the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, specifically including:

5. The method for constructing a cross-modal search model based on image-text cooperative attention as claimed in claim 4, wherein the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises a metric learning training task, specifically comprising:

6. The method as claimed in claim 5, wherein the training of the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises:

7. The method for constructing a cross-modal search model based on image-text cooperative attention according to any one of claims 1 to 6, wherein the extracting the local image features of the image sample comprises:

8. The method for constructing a cross-modal search model based on image-text cooperative attention according to any one of claims 1 to 6, wherein the extracting the local text features of the text sample comprises:

performing word segmentation on the text sample;

9. A cross-modal retrieval method based on image-text cooperative attention is characterized by comprising the following steps:

inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the image-text cooperative attention based cross-modal retrieval model construction method according to any one of claims 1 to 8, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as a retrieval result by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.

10. A cross-modal retrieval apparatus based on image-text cooperative attention, comprising a computer-readable storage medium storing a computer program and a processor, wherein the computer program is read by the processor and executed to implement the cross-modal retrieval model construction method based on image-text cooperative attention according to any one of claims 1 to 8 or the cross-modal retrieval method based on image-text cooperative attention according to claim 9.