CN114201621B - Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention - Google Patents

Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention Download PDF

Info

Publication number
CN114201621B
CN114201621B CN202111406136.9A CN202111406136A CN114201621B CN 114201621 B CN114201621 B CN 114201621B CN 202111406136 A CN202111406136 A CN 202111406136A CN 114201621 B CN114201621 B CN 114201621B
Authority
CN
China
Prior art keywords
sample
text
image
cross
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111406136.9A
Other languages
Chinese (zh)
Other versions
CN114201621A (en
Inventor
单丽莉
苏宇
孙承杰
林磊
刘秉权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Konami Sports Club Co Ltd
Original Assignee
Harbin Institute of Technology
People Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, People Co Ltd filed Critical Harbin Institute of Technology
Priority to CN202111406136.9A priority Critical patent/CN114201621B/en
Publication of CN114201621A publication Critical patent/CN114201621A/en
Application granted granted Critical
Publication of CN114201621B publication Critical patent/CN114201621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention, which comprises the following steps: acquiring a training image and a training text, and respectively extracting local features of an image sample and a text sample; mapping all local image features of the image sample and all local text features of the text sample into feature vectors respectively, and representing the feature vectors of the image sample and the text sample into matrixes respectively to obtain Key matrixes, query matrixes and Value matrixes respectively; based on the matrixes, calculating the cross-modal attention characteristics and intra-modal attention characteristics of the image sample and the text sample; fusing the cross-modal attention features and intra-modal attention features to obtain a global feature representation of the image sample and a global feature representation of the text sample; and training to obtain a cross-modal retrieval model based on the global feature representation. The invention can directly carry out similarity matching on the data of different modes and has higher matching accuracy.

Description

Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention
Technical Field
The invention relates to the technical field of cross-modal retrieval of image texts, in particular to a cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention.
Background
With the explosive growth of multimedia data in different modes such as texts and images on the internet, the single-mode search can not meet the requirements of users, and cross-mode search is generated. Cross-modal retrieval is the mutual retrieval of data of at least two modalities, and generally, related data of one modality is retrieved by taking the other modality as a query condition. For example, sales commodities displayed on an e-commerce website generally comprise text descriptions such as commodity categories, commodity names, commodity attributes, commodity concrete descriptions and commodity images, and when a user searches for interesting contents on the website, the user hopes to retrieve data of various modes such as texts and images related to the commodities so as to obtain more data of the commodities.
However, the data of different modes are inconsistent in representation form, so that the data of different modes cannot be directly measured for similarity, and the problem of low matching accuracy in the existing cross-mode retrieval method is caused.
Disclosure of Invention
The invention solves the problem that the data of different modes are inconsistent in representation form, so that the data of different modes cannot be directly measured for similarity, and the matching accuracy rate of the existing cross-mode searching method is low.
The invention provides a cross-modal retrieval model construction method based on graphic and text cooperative attention, which comprises the following steps:
acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with a class label;
extracting local image features of the image sample and extracting local text features of the text sample;
mapping all local image features of the image sample and all local text features of the text sample into feature vectors respectively, and representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes respectively, and obtaining respective Key matrixes, query matrixes and Value matrixes respectively through a full connection layer;
calculating the cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of the image sample and the text sample, and respectively generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;
calculating intra-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of each of the image sample and the text sample, and respectively generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores;
fusing the cross-modal attention feature and the intra-modal attention feature to obtain a global feature representation of the image sample and a global feature representation of the text sample respectively;
and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
Optionally, the calculating the cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention score respectively includes:
respectively performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample, the Key matrix of the text sample and the Query matrix of the image sample, normalizing, and respectively calculating the weight matrix of the image sample to the text sample and the weight matrix of the text sample to the image sample through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;
taking the weight matrix of the text sample to the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention characteristic of the text sample;
and taking the weight matrix of the image sample to the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention characteristic of the image sample.
Optionally, the calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of each of the image sample and the text sample, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores respectively includes:
respectively carrying out inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample, normalizing, and respectively calculating the weight matrix of the image sample and the weight matrix of the text sample through softmax, wherein the intra-mode attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;
taking the weight matrix of the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain intra-mode attention characteristics of the image sample;
and taking the weight matrix of the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention characteristic of the text sample.
Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, which specifically includes:
inputting the global feature representation of the image sample or the global feature representation of the text sample into a full connection layer, outputting the probability of each label by using softmax, and taking the category label with the highest probability as a prediction label of the image sample or the text sample input into the full connection layer;
and calculating a loss function of label prediction based on the prediction label and the real class label of the image sample or the text sample band.
Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, which specifically includes:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises an anchor sample, a first preset number of positive samples of the anchor sample cross-mode and a second preset number of negative samples of the anchor sample cross-mode, and the anchor sample is the image sample or the text sample;
and respectively calculating the distances between the anchor sample and all positive samples and between the anchor sample and all negative samples according to the global characteristic representation of the anchor sample, the global characteristic representation of the positive sample of the anchor sample in a cross-mode manner and the global characteristic representation of the negative sample of the anchor sample in a cross-mode manner, and calculating a loss function learned by the measurement based on the distances.
Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes:
and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the label predicted loss function and the metric learned loss function.
Optionally, the extracting the local image feature of the image sample includes:
inputting the image sample into a pre-trained FaterRCNN model, extracting local image characteristics of the image sample, wherein the FaterRCNN model comprises a characteristic extraction network, an RPN network and a region of interest pooling network, the characteristic extraction network is used for extracting characteristics of the image sample, inputting an extracted characteristic image into the RPN network, selecting a preset number of regions of interest by the RPN network, marking the regions of interest by using rectangular frames, and the region of interest pooling network is used for extracting characteristics of the regions of interest based on the regions of interest marked by the RPN network to serve as local image characteristics of the image sample.
Optionally, the extracting the local text feature of the text sample includes:
word segmentation is carried out on the text sample;
word2Vec is used for obtaining Word vectors of each Word after Word segmentation;
and inputting the word vector into a Bi-LSTM network, and acquiring a feature vector representation of each word as a local text feature of the text sample.
The invention also provides a cross-modal retrieval method based on graphic and text cooperative attention, which comprises the following steps:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;
inputting the elements in the search condition and the preset search range into the cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention as described in any one of the above, and outputting a third preset number of elements with the highest similarity with the elements in the search condition in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.
The invention also provides a cross-modal retrieval device based on the graphic collaborative attention, which comprises a computer readable storage medium and a processor, wherein the computer readable storage medium stores a computer program, and the computer program realizes the cross-modal retrieval model construction method based on the graphic collaborative attention or the cross-modal retrieval method based on the graphic collaborative attention when being read and run by the processor.
According to the invention, the local image features of the image sample and the local text features of the text sample are extracted and mapped into feature vectors for representation, a cross-modal attention mechanism is utilized to capture the fine-granularity interaction relation of data among modes, a intra-modal attention mechanism is utilized to capture the association between image areas and the semantic association of text context, finally the cross-modal attention features and the intra-modal attention features are fused, and the global feature representation with consistent representation forms of the image and the text is obtained, so that the data of two different modes of the image and the text can be directly subjected to similarity measurement, and the trained cross-modal retrieval model can be directly subjected to similarity matching of the data of different modes, so that the matching accuracy is high.
Drawings
FIG. 1 is a schematic diagram of a cross-modal retrieval model construction method based on graphic collaborative attention in an embodiment of the invention;
FIG. 2 is a schematic diagram of an example of a text sample in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a cross-modal retrieval model construction method based on graphic collaborative attention according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a collaborative attention mechanism according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
Referring to fig. 1, in an embodiment of the present invention, the method for constructing a cross-modal search model based on graphic collaborative attention includes:
step S10, a training image and a training text are obtained, wherein the training image is an image sample with a class label, and the training text is a text sample with a class label.
The training data comprises a plurality of training images and training texts, each training image can be assigned an image ID, and each training text can be assigned a text ID so as to distinguish different training images from different training texts. Both the training image and the training text are provided with category labels, as shown in fig. 2, the training text can also contain an image ID corresponding to the text besides the text category labels and the text description, so that positive samples in the metric learning training task can be constructed later.
Step S20, extracting local image features of the image sample, and extracting local text features of the text sample.
Here, the local image features of the image sample refer to the region features of the image sample, and specifically refer to the features of a plurality of regions of the image sample. The extraction of local image features of the image sample can be realized through an R-CNN, FAST-RCNN or FASTER-RCNN algorithm.
Here, the local text feature of the text sample refers to a feature representation of each word of the text sample. The extraction of local text features of the text sample can be realized through Word2Vec, bi-LSTM and other algorithms.
And step S30, mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes, and respectively obtaining respective Key matrixes, query matrixes and Value matrixes through a full connection layer.
One image sample is provided with one or more local image features, one text sample is provided with one or more local text features, all local image features of one image sample are respectively subjected to feature vector mapping, the local image features can be input into a full-connection layer to be mapped to obtain feature vectors corresponding to the local image features, all local text features of one text sample are respectively subjected to feature vector mapping, and the local text features can be input into a full-connection layer to be mapped to obtain feature vectors corresponding to the local text features.
Optionally, the feature vector of the image sample and the feature vector of the text sample have the same dimension.
Referring to fig. 4, if one image sample has k local image features and one text sample has j local text features, and the feature vector dimension obtained by mapping the k local image features of the image sample is d, k d-dimensional feature vectors can be obtained after mapping the local image features of the image sample, and j d-dimensional feature vectors can be obtained after mapping the local text features of the text sample.
The local image features of the image samples and the local text features of the text samples are mapped into vectors with the same dimension, so that data interaction between the subsequent image samples and the text samples is realized, and similarity calculation between the image samples and the text samples is facilitated.
Referring to fig. 4, the k d-dimensional feature vectors of the image sample are represented as a matrix P, and Key matrix, query matrix and Value matrix of the matrix P are obtained through the Linear full connection layer, which may be specifically represented as:
P K =Linear(P;θ PK );P Q =Linear(P;θ PQ );P V =Linear(P;θ PV ),
wherein P is K Key matrix, P, of finger matrix P Q Query matrix, P, of finger matrix P V Value matrix, θ, of finger matrix P PK 、θ PQ 、θ PV Is a network weight parameter of the full connection layer.
Referring to fig. 4, j d-dimensional feature vectors of a text sample are represented as a matrix T, and Key matrix, query matrix and Value matrix of the matrix T are obtained through a Linear full connection layer, which may be specifically represented as:
T K =Linear(T;θ TK );T Q =Linear(T;θ TQ );T V =Linear(T;θ TV ),
wherein T is K Key matrix T of finger matrix T Q Query matrix, T, of finger matrix T V Value matrix, θ, of the finger matrix T TK 、θ TQ 、θ TV Is a network weight parameter of the full connection layer.
Step S40, calculating a cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention score, respectively.
Further, the step S40 includes:
respectively performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample, the Key matrix of the text sample and the Query matrix of the image sample, normalizing, and respectively calculating the weight matrix of the image sample to the text sample and the weight matrix of the text sample to the image sample through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample.
Wherein, the Key matrix of the image sample and the Query matrix of the text sample are subjected to inner product operation (expressed as P by a formula K T o ) And normalizes the obtained inner product operation result (expressed as) The weight matrix of image samples versus text samples can be expressed as:
the Key matrix of the text sample and the Query matrix of the image sample are subjected to inner product operation (expressed as T by a formula K P Q ) And normalizes the obtained inner product operation result (expressed as) The weight matrix of text samples versus image samples can be expressed as:
wherein P is K Key matrix, T, of image sample Q Query matrix, T, referring to text samples K Key matrix, P, referring to text sample Q The Query matrix of the image sample is referred to, and d refers to the feature vector dimension of the image sample and the text sample.
And taking the weight matrix of the image sample to the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention characteristic of the image sample. It can be expressed as:
P inter =W PT ×T V
wherein P is inter Refer to cross-modal attention features, W, of an image sample PT For the weight matrix of the image sample to the text sample, T V Is a Value matrix of text samples.
And taking the weight matrix of the text sample to the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention characteristic of the text sample. It can be expressed as:
T inter =W TP ×P V
wherein T is inter Refer to cross-modal attention features, W, of text samples TP Refers to a weight matrix of text samples to image samples, P V Refers to the Value matrix of the image samples.
Performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample to obtain attention parameters between each local image feature of the image sample and each local text feature of the text sample so as to determine which parts to pay attention to input and allocate limited information processing resources to important parts; by normalizing the result of the inner product operation, the finally obtained attention score is not affected by the feature vector dimension d.
Step S50, calculating intra-mode attention scores of the image sample and the text sample based on Key matrix, query matrix and Value matrix of each of the image sample and the text sample, and generating intra-mode attention features of the image sample and the text sample based on the intra-mode attention scores, respectively.
Further, the step S50 includes:
and respectively carrying out inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample, normalizing, and respectively calculating the weight matrix of the image sample and the weight matrix of the text sample through softmax, wherein the intra-mode attention score comprises the weight matrix of the image sample and the weight matrix of the text sample.
The Key matrix and the Query matrix of the image sample are subjected to inner product operation (the formula can be expressed as P K P Q ) Normalizing the result of the inner product operation (the formula can be expressed as) The weight matrix of the image samples can be expressed as:
wherein W is PP Refers to the weight matrix of the image sample, P K Key matrix, P, of image sample Q Refers to the Query matrix of the image samples, and d refers to the feature vector dimension of the image samples.
The Key matrix and the Query matrix of the text sample are respectively subjected to inner product operation (the formula can be expressed as T K T o ) Normalizing the result of the inner product operation (the formula can be expressed as) The weight matrix of the text sample may be expressed as:
wherein W is TT Weight matrix, T, referring to text samples Q Query matrix, T, referring to text samples K Refers to the Key matrix of the text sample, and d refers to the feature vector dimension of the text sample.
And taking the weight matrix of the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the intra-mode attention characteristic of the image sample. It can be expressed as:
P intra =W PP ×P V
wherein P is intra P, which is intra-modal attention feature of the image sample V Refers to the Value matrix of the image samples.
And taking the weight matrix of the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention characteristic of the text sample. It can be expressed as:
T intra =W TT ×T V
wherein T is intra Is the intra-modal attention feature of the text sample, T V Is a Value matrix of text samples.
And step S60, fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample.
Specifically, intra-modal attention features and inter-modal attention features of the image sample are fused, that is, the intra-modal attention features and inter-modal attention features are spliced, and then a plurality of local features of the image sample are reduced to 1 global feature through a maximum pooling layer, for example, k d-dimensional local features are reduced to 1 d-dimensional global feature representation. And fusing intra-modal attention features and inter-modal attention features of the text sample, and obtaining global feature representation of the text sample through a maximum pooling layer, wherein the maximum pooling layer has the same effect as the maximum pooling layer at the image sample, and the description is omitted here. It can be expressed as:
P final =MaxPooling([P inter ,P intra ]),
T final =MaxPooling([T inter ,T intra ]),
wherein P is final Refers to the global feature representation, T, of an image sample final Refers to a global feature representation of a text sample. As shown in FIG. 4, byCalculated P intra And-> Calculated P inter Is described.
Step S70, training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
After the global feature representations of the image and text samples are obtained, a cross-acquisition modality retrieval model is trained using a label prediction task and a metric learning task.
The method comprises the steps of extracting local image features of an image sample and local text features of a text sample, mapping the local image features and the local text features as feature vectors to represent, capturing fine-grained interaction relation of data among modes by using a cross-mode attention mechanism, capturing association among image areas and semantic association of text context by using a intra-mode attention mechanism, finally fusing the cross-mode attention features and the intra-mode attention features to obtain global feature representations with consistent representation forms of the image and the text, enabling the data of the image and the text of two different modes to be subjected to similarity measurement directly, enabling a cross-mode retrieval model obtained through training to be subjected to similarity matching directly on the data of the different modes, and achieving high matching accuracy.
Optionally, as shown in fig. 3, step S70 includes a label prediction training task, specifically including:
inputting the global feature representation of the image sample or the global feature representation of the text sample into a full connection layer, outputting the probability of each label by using softmax, and taking the category label with the highest probability as a prediction label of the image sample or the text sample input into the full connection layer;
and calculating a loss function of label prediction based on the prediction label and the real class label of the image sample or the text sample band.
The cross entropy loss function is utilized as the loss function for label prediction, which can be expressed as:
wherein L is label Representing the loss function of label prediction, n represents the number of all samples in one batch, y i True tags representing each sample, p vi Representing predictive labels generated for image samples, p ti Representing the predictive labels generated for the text samples.
By utilizing the label prediction task to ensure that samples with the same label have similar feature representations inside the modality, samples with different labels have different feature representations.
Optionally, as shown in fig. 3, step S70 includes a metric learning training task, specifically including:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises an anchor sample, a first preset number of positive samples of the anchor sample cross-mode and a second preset number of negative samples of the anchor sample cross-mode, and the anchor sample is the image sample or the text sample; and respectively calculating the distances between the anchor sample and all positive samples and between the anchor sample and all negative samples according to the global characteristic representation of the anchor sample, the global characteristic representation of the positive sample of the anchor sample in a cross-mode manner and the global characteristic representation of the negative sample of the anchor sample in a cross-mode manner, and calculating a loss function learned by the measurement based on the distances.
When the anchor sample is an image sample, the cross-modal positive sample refers to a text-form positive sample, and the cross-modal negative sample refers to a text-form negative sample; when the anchor sample is a text sample, its cross-modal positive sample refers to the positive sample in image form and its cross-modal negative sample refers to the negative sample in image form.
The first preset number may be greater than or equal to 1, with a preferred value of 1. The second preset number may be greater than or equal to 1, preferably with a value of m-1, where m refers to the number of class labels. Because, for the anchor samples, of all class labels, only one class (i.e., the class label of the anchor sample) is its positive sample, the other m-1 classes are negative samples, and each class selects one sample for calculating the loss function.
In one embodiment, for one batch of image and text data, an image sample is taken as an anchor sample, a text sample identical to a class label of the image sample is randomly sampled as a positive sample, and m-1 negative samples are obtained by sampling all text samples of different classes from the positive sample for data under m different classes of semantic labels, assuming that m different class labels are shared. Taking a text sample as an anchor sample, randomly sampling an image sample which is the same as a semantic label of the text sample as a positive sample, and sampling all image samples which are different from the anchor sample as negative samples to obtain m-1 negative samples.
And calculating a loss function of measurement learning based on the distances between the anchor sample and all the positive samples and all the negative samples, specifically, subtracting the distances between the anchor sample and all the negative samples from the distances between the anchor sample and all the positive samples as the loss function of measurement learning, training based on the loss function of measurement learning, and reducing the distance between the positive sample pairs and enlarging the distance between the negative sample pairs. Alternatively, the distance between samples is defined using cosine similarity.
Further, the loss function of metric learning is defined as follows:
L metric =L(v)+L(t),
where L (v) is a metric learning loss function of the image sample, v T Is a selected image sample, t + Is a text sample of the same category corresponding to the image sample(positive sample), M is the number of categories, t i Is a text sample (negative sample) of a different class to which the image sample corresponds. L (t) is a metric learning loss function of the text sample, t T Is a selected text sample, v + Is the image sample (positive sample) of the same category corresponding to the text sample, v i Is an image sample (negative sample) of a different category to which the text sample corresponds. L (L) metric The total loss function is learned for the metric.
The samples with similar semantics in different modes are guaranteed to have similar feature representations through the measurement learning task, and the samples with different semantics in different modes have different feature representations.
The step S70 further includes: and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the label predicted loss function and the metric learned loss function.
Specifically, the definition of the loss function for the multitasking learning is as follows:
L=αL label +βL metric
where α and β are superparameters that balance the weights of the above two task loss functions.
The cross-modal retrieval model is obtained through training in a multi-task learning mode, so that the cross-modal retrieval model obtained through training can be ensured to have similar characteristic representations in the interior of the modes, samples with the same label have different characteristic representations, samples with similar semantics among different modes have similar characteristic representations, and samples with different semantics among different modes have different characteristic representations.
Optionally, as shown in fig. 3, the extracting the local image feature of the image sample includes:
inputting the image sample into a pre-trained Faster RCNN model, extracting local image characteristics of the image sample, wherein the Faster RCNN model comprises a characteristic extraction network, an RPN network and a region of interest pooling network, the characteristic extraction network is used for extracting characteristics of the image sample, inputting an extracted characteristic image into the RPN network, selecting a preset number of regions of interest by the RPN network, marking the regions of interest by using rectangular frames, and the region of interest pooling network is used for extracting characteristics of the regions of interest based on the regions of interest marked by the RPN network to serve as local image characteristics of the image sample.
Wherein the feature extraction network consists of a set of convolution layers+relu activation function layers+pooling layers.
After the extracted feature map is input into an RPN network, the RPN network is divided into two parts, the first part selects k interested areas through classification, and particularly k interested areas can be selected through softmax classification; the second part marks the approximate location of these regions of interest with a rectangular box.
And (3) the region-of-interest pooling network marks the position of the region of interest based on the RPN network, and the feature extraction network extracts the feature of k regions, namely the local image features of the image sample.
By extracting local image features of an image sample using the fast RCNN model, local image features can be extracted from the image sample efficiently and quickly.
Optionally, as shown in fig. 3, the extracting the local text feature of the text sample includes:
and word segmentation is carried out on the text sample. The specific word segmentation method can select hidden Markov models, jieba word segmentation algorithms and the like, and the algorithms are all in the prior art and are not repeated here.
The Word vector of each Word after Word segmentation is obtained by using Word2Vec, and specifically, the Word vector of each Word after Word segmentation can be obtained through a Skip-Gram model in Word2 Vec. The related content is the prior art and is not described herein.
And inputting the word vector into a Bi-LSTM network, and acquiring a feature vector representation of each word as a local text feature of the text sample. Among them, bi-LSTM is one of RNN networks, and is suitable for modeling time series data, such as text data here, which can better capture long-distance dependency relationships by learning and memorizing which information and forgetting which information, and in addition, can better capture Bi-directional semantic dependencies by combining forward LSTM and backward LSTM.
The invention also provides a cross-modal retrieval method based on graphic and text cooperative attention, which comprises the following steps:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text; inputting the elements in the search condition and the preset search range into the cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention, and outputting a third preset number of elements with the highest similarity with the elements in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.
After acquiring a given retrieval condition element, inputting the given retrieval condition element into a cross-modal retrieval model constructed/trained by a cross-modal retrieval model construction method based on graphic-text cooperative attention, simultaneously, inputting elements in a preset retrieval range into the cross-modal retrieval model, and outputting a third preset number of elements with highest similarity with the retrieval condition element in the retrieval range by the cross-modal retrieval model.
Wherein the preset search range may be defined to include only image elements, only text elements, or both image elements and text elements. In an embodiment, a given search condition element is an image, and a preset search range is a text element, so that a cross-mode search model outputs a preset number of texts with highest similarity with the search condition element in the search range; in another embodiment, given that the search condition element is text, and the preset search range is that only the image element is included, the cross-mode search model outputs an image with the highest similarity to the search condition element in the search range.
The method comprises the steps of inputting elements in a search condition and elements in a preset search range into a cross-modal search model constructed by a cross-modal search model construction method based on graphic-text collaborative attention, capturing fine-grained interaction relation of data among modes by using a cross-modal attention mechanism through the cross-modal search model, capturing association among image areas and semantic association of text context by using the intra-modal attention mechanism, and finally fusing the cross-modal attention feature and the intra-modal attention feature to obtain global feature representation with consistent representation forms of images and texts, so that the images and texts of different modes can be directly subjected to similarity measurement, and further accuracy of the cross-modal search is improved.
In an embodiment of the present invention, a cross-modal retrieval device based on graph-text cooperative attention includes a computer readable storage medium and a processor, where the computer program is stored, and when the computer program is read and executed by the processor, the cross-modal retrieval model construction method based on graph-text cooperative attention or the cross-modal retrieval method based on graph-text cooperative attention is implemented. Compared with the prior art, the cross-modal retrieval device based on the graphic and text cooperative attention has the advantages that the cross-modal retrieval device based on the graphic and text cooperative attention is consistent with the cross-modal retrieval method based on the graphic and text cooperative attention, and is not repeated here.
The reader will appreciate that in the description of this specification, a description of terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. The method for constructing the cross-modal retrieval model based on the graphic and text cooperative attention is characterized by comprising the following steps of:
acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with a class label;
extracting local image features of the image sample and extracting local text features of the text sample;
mapping all local image features of the image sample and all local text features of the text sample into feature vectors respectively, and representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes respectively, and obtaining respective Key matrixes, query matrixes and Value matrixes respectively through a full connection layer;
calculating the cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of the image sample and the text sample, and respectively generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;
calculating intra-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of each of the image sample and the text sample, and respectively generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores;
fusing the cross-modal attention feature and the intra-modal attention feature to obtain a global feature representation of the image sample and a global feature representation of the text sample respectively;
and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
2. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the calculating the cross-modal attention score of the image sample and the text sample based on the Key matrix, the Query matrix and the Value matrix of each of the image sample and the text sample, and the generating the cross-modal attention features of the image sample and the text sample based on the cross-modal attention score respectively comprises:
respectively performing inner product operation on the Key matrix of the image sample and the Query matrix of the text sample, the Key matrix of the text sample and the Query matrix of the image sample, normalizing, and respectively calculating the weight matrix of the image sample to the text sample and the weight matrix of the text sample to the image sample through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;
taking the weight matrix of the text sample to the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention characteristic of the text sample;
and taking the weight matrix of the image sample to the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention characteristic of the image sample.
3. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the calculating intra-modal attention scores of the image sample and the text sample based on Key matrix, query matrix and Value matrix of each of the image sample and the text sample, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores respectively comprises:
respectively carrying out inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample, normalizing, and respectively calculating the weight matrix of the image sample and the weight matrix of the text sample through softmax, wherein the intra-mode attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;
taking the weight matrix of the image sample as a score, and carrying out weighted summation operation on the Value matrix of the image sample to obtain intra-mode attention characteristics of the image sample;
and taking the weight matrix of the text sample as a score, and carrying out weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention characteristic of the text sample.
4. The method for constructing a cross-modal retrieval model based on graphic co-attention as claimed in claim 1, wherein the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, specifically including:
inputting the global feature representation of the image sample or the global feature representation of the text sample into a full connection layer, outputting the probability of each label by using softmax, and taking the category label with the highest probability as a prediction label of the image sample or the text sample input into the full connection layer;
and calculating a loss function of label prediction based on the prediction label and the real class label of the image sample or the text sample band.
5. The method for constructing a cross-modal retrieval model based on graphic co-attention as set forth in claim 4, wherein the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, and specifically includes:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises an anchor sample, a first preset number of positive samples of the anchor sample cross-mode and a second preset number of negative samples of the anchor sample cross-mode, and the anchor sample is the image sample or the text sample;
and respectively calculating the distances between the anchor sample and all positive samples and between the anchor sample and all negative samples according to the global characteristic representation of the anchor sample, the global characteristic representation of the positive sample of the anchor sample in a cross-mode manner and the global characteristic representation of the negative sample of the anchor sample in a cross-mode manner, and calculating a loss function learned by the measurement based on the distances.
6. The method for constructing a cross-modal retrieval model based on graphic co-attention as recited in claim 5, wherein training the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises:
and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the label predicted loss function and the metric learned loss function.
7. A method of constructing a cross-modal retrieval model based on teletext co-attention as claimed in any one of claims 1 to 6, wherein said extracting local image features of the image sample includes:
inputting the image sample into a pre-trained Faster RCNN model, extracting local image characteristics of the image sample, wherein the Faster RCNN model comprises a characteristic extraction network, an RPN network and a region of interest pooling network, the characteristic extraction network is used for extracting characteristics of the image sample, inputting an extracted characteristic image into the RPN network, selecting a preset number of regions of interest by the RPN network, marking the regions of interest by using rectangular frames, and the region of interest pooling network is used for extracting characteristics of the regions of interest based on the regions of interest marked by the RPN network to serve as local image characteristics of the image sample.
8. A method of constructing a cross-modal retrieval model based on teletext co-attention as claimed in any one of claims 1 to 6, wherein said extracting local text features of the text sample includes:
word segmentation is carried out on the text sample;
word2Vec is used for obtaining Word vectors of each Word after Word segmentation;
and inputting the word vector into a Bi-LSTM network, and acquiring a feature vector representation of each word as a local text feature of the text sample.
9. The cross-modal retrieval method based on graphic and text cooperative attention is characterized by comprising the following steps of:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;
inputting the elements in the search condition and the preset search range into a cross-modal search model constructed by the cross-modal search model construction method based on the graphic collaborative attention according to any one of claims 1 to 8, and outputting a third preset number of elements with the highest similarity with the elements in the search range as a search result by the cross-modal search model, wherein the elements in the search range comprise images and/or texts.
10. A cross-modal retrieval device based on teletext co-attention, comprising a computer readable storage medium storing a computer program and a processor, the computer program, when read and executed by the processor, implementing a cross-modal retrieval model construction method based on teletext co-attention as claimed in any one of claims 1 to 8, or a cross-modal retrieval method based on teletext co-attention as claimed in claim 9.
CN202111406136.9A 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention Active CN114201621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111406136.9A CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111406136.9A CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Publications (2)

Publication Number Publication Date
CN114201621A CN114201621A (en) 2022-03-18
CN114201621B true CN114201621B (en) 2024-04-02

Family

ID=80648805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111406136.9A Active CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Country Status (1)

Country Link
CN (1) CN114201621B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841243B (en) * 2022-04-02 2023-04-07 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN114969405B (en) * 2022-04-30 2024-01-26 苏州浪潮智能科技有限公司 Cross-modal image-text mutual detection method
CN114663737B (en) * 2022-05-20 2022-12-02 浪潮电子信息产业股份有限公司 Object identification method and device, electronic equipment and computer readable storage medium
CN114691907B (en) * 2022-05-31 2022-09-16 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN114707007B (en) * 2022-06-07 2022-08-30 苏州大学 Image text retrieval method and device and computer storage medium
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN114973294B (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium
CN115017358B (en) * 2022-08-09 2022-11-04 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN115238130B (en) * 2022-09-21 2022-12-06 之江实验室 Time sequence language positioning method and device based on modal customization collaborative attention interaction
CN115658955B (en) * 2022-11-08 2023-03-14 苏州浪潮智能科技有限公司 Cross-media retrieval and model training method, device, equipment and menu retrieval system
CN115861995B (en) * 2023-02-08 2023-05-23 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN116433727B (en) * 2023-06-13 2023-10-27 北京科技大学 Scalable single-stream tracking method based on staged continuous learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113392254A (en) * 2021-03-29 2021-09-14 西安理工大学 Image text retrieval method based on context awareness

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526808B2 (en) * 2019-05-29 2022-12-13 The Board Of Trustees Of The Leland Stanford Junior University Machine learning based generation of ontology for structural and functional mapping

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN113392254A (en) * 2021-03-29 2021-09-14 西安理工大学 Image text retrieval method based on context awareness
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
面向跨模态检索的协同注意力网络模型;邓一姣;张凤荔;陈学勤;艾擎;余苏喆;;计算机科学;20201231(04);60-65 *
面向跨模态检索的音频数据库内容匹配方法研究;张天;靳聪;帖云;李小兵;;信号处理;20201231(06);180-190 *

Also Published As

Publication number Publication date
CN114201621A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN114201621B (en) Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention
US11222055B2 (en) System, computer-implemented method and computer program product for information retrieval
US11301732B2 (en) Processing image-bearing electronic documents using a multimodal fusion framework
CN112364204B (en) Video searching method, device, computer equipment and storage medium
CN113360701B (en) Sketch processing method and system based on knowledge distillation
US11663280B2 (en) Search engine using joint learning for multi-label classification
CN110083729B (en) Image searching method and system
CN113657425A (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN114840705B (en) Combined commodity retrieval method and system based on multi-mode pre-training model
CN112528053A (en) Multimedia library classified retrieval management system
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
US20210248425A1 (en) Reinforced text representation learning
CN112487199A (en) User characteristic prediction method based on user purchasing behavior
CN111651577B (en) Cross-media data association analysis model training and data association analysis method and system
CN110674388A (en) Mapping method and device for push item, storage medium and terminal equipment
CN117453859A (en) Agricultural pest and disease damage image-text retrieval method, system and electronic equipment
CN112182451A (en) Webpage content abstract generation method, equipment, storage medium and device
CN115797795A (en) Remote sensing image question-answering type retrieval system and method based on reinforcement learning
CN114969439A (en) Model training and information retrieval method and device
CN115292530A (en) Remote sensing image overall management system
CN115660695A (en) Customer service personnel label portrait construction method and device, electronic equipment and storage medium
CN114781390A (en) Aspect-level emotion analysis method and device
CN117938951B (en) Information pushing method, device, computer equipment and storage medium
Bastida et al. Multimodal object recognition using deep learning representations extracted from images and smartphone sensors
Horváth Object recognition based on Google's reverse image search and image similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant