CN114201621A - Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention - Google Patents

Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention Download PDF

Info

Publication number
CN114201621A
CN114201621A CN202111406136.9A CN202111406136A CN114201621A CN 114201621 A CN114201621 A CN 114201621A CN 202111406136 A CN202111406136 A CN 202111406136A CN 114201621 A CN114201621 A CN 114201621A
Authority
CN
China
Prior art keywords
sample
text
image
modal
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111406136.9A
Other languages
Chinese (zh)
Other versions
CN114201621B (en
Inventor
单丽莉
苏宇
孙承杰
林磊
刘秉权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Konami Sports Club Co Ltd
Original Assignee
Harbin Institute of Technology
People Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, People Co Ltd filed Critical Harbin Institute of Technology
Priority to CN202111406136.9A priority Critical patent/CN114201621B/en
Publication of CN114201621A publication Critical patent/CN114201621A/en
Application granted granted Critical
Publication of CN114201621B publication Critical patent/CN114201621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a cross-modal retrieval model construction and retrieval method based on image-text cooperative attention, which comprises the following steps: acquiring a training image and a training text, and respectively extracting local characteristics of an image sample and a text sample; respectively mapping all local image characteristics of the image sample and all local text characteristics of the text sample into characteristic vectors, respectively representing the characteristic vectors of the image sample and the text sample into matrixes, and then obtaining respective Key matrix, Query matrix and Value matrix; calculating cross-modal attention features and intra-modal attention features of the image sample and the text sample based on the plurality of matrixes; fusing the cross-modal attention feature and the intra-modal attention feature to obtain the global feature representation of the image sample and the global feature representation of the text sample; and training to obtain a cross-modal retrieval model based on the global feature representation. The method can directly carry out similarity matching on the data in different modes, and has higher matching accuracy.

Description

Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention
Technical Field
The invention relates to the technical field of cross-modal retrieval of image texts, in particular to a cross-modal retrieval model construction and retrieval method based on image-text cooperative attention.
Background
With the explosive growth of multimedia data of different modes such as texts, images and the like on the internet, the search of a single mode cannot meet the requirements of users, and cross-mode search is carried out. Cross-modality retrieval is retrieval of data of at least two modalities from each other, and is usually a method of retrieving related data of one modality as a query. For example, a sales commodity displayed on an e-commerce website usually includes text descriptions such as a commodity category, a commodity name, a commodity attribute, and a commodity specific description, and a commodity image, and when a user searches for a content of interest on the website, the user may wish to retrieve data of multiple modalities such as text and images related to the commodity to obtain more data of the commodity.
However, the representation forms of the data in different modalities are not consistent, so that the data in different modalities cannot be directly subjected to similarity measurement, and the problem that the matching accuracy is not high in the conventional cross-modality retrieval method is caused.
Disclosure of Invention
The problem solved by the invention is that the representation forms of the data in different modes are inconsistent, so that the data in different modes cannot be directly subjected to similarity measurement, and the matching accuracy of the conventional cross-mode retrieval method is low.
The invention provides a cross-modal retrieval model construction method based on image-text cooperative attention, which comprises the following steps:
acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with the class label;
extracting local image features of the image samples and extracting local text features of the text samples;
respectively mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes, and respectively obtaining a Key matrix, a Query matrix and a Value matrix of each through a full connection layer;
calculating cross-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;
calculating intra-modality attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating intra-modality attention features of the image sample and the text sample based on the intra-modality attention scores;
fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample;
and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
Optionally, the calculating cross-modal attention scores for the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating cross-modal attention characteristics for the image sample and the text sample based on the cross-modal attention scores comprises:
performing inner product operation on a Key matrix of the image sample and a Query matrix of the text sample, a Key matrix of the text sample and the Query matrix of the image sample respectively, normalizing the inner product operation and the normalization operation, and calculating a weight matrix of the image sample to the text sample and a weight matrix of the text sample to the image sample respectively through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;
taking the weight matrix of the text sample to the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention feature of the text sample;
and taking the weight matrix of the image sample to the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention feature of the image sample.
Optionally, the calculating intra-modality attention scores for the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating intra-modality attention features for the image sample and the text sample based on the intra-modality attention scores includes:
performing inner product operation on a Key matrix and a Query matrix of the image sample and a Key matrix and a Query matrix of the text sample respectively, normalizing the inner product operation and calculating a weight matrix of the image sample and a weight matrix of the text sample respectively through softmax, wherein the intra-modal attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;
taking the weight matrix of the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the intra-modal attention feature of the image sample;
and taking the weight matrix of the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention feature of the text sample.
Optionally, the training to obtain the cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, which specifically includes:
inputting a global feature representation of the image sample or a global feature representation of the text sample into a fully-connected layer, outputting a probability of each label by utilizing softmax, and taking a category label with the highest probability as a prediction label of the image sample or the text sample input into the fully-connected layer;
calculating a loss function for label prediction based on the prediction label and a true class label of the image sample or the text sample band.
Optionally, the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further includes a metric learning training task, which specifically includes:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises anchor samples, a first preset number of cross-modal positive samples of the anchor samples and a second preset number of cross-modal negative samples of the anchor samples, and the anchor samples are the image samples or the text samples;
and respectively calculating the distances between the anchor sample and all positive samples and all negative samples according to the global feature representation of the anchor sample, the global feature representation of the positive sample of the cross-modal anchor sample and the global feature representation of the negative sample of the cross-modal anchor sample, and calculating a loss function of metric learning based on the distances.
Optionally, the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further includes:
and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the loss function predicted by the label and the loss function learned by the measurement.
Optionally, the extracting the local image feature of the image sample includes:
inputting the image sample into a pre-trained FasterRCNN model, and extracting local image features of the image sample, wherein the FasterRCNN model comprises a feature extraction network, an RPN network and an interest region pooling network, the feature extraction network is used for extracting features of the image sample, inputting an extracted feature map into the RPN network, selecting a preset number of interest regions from the RPN network, marking the interest regions by rectangular frames, and the interest region pooling network is used for extracting features of the interest regions based on the interest regions marked by the RPN network to serve as local image features of the image sample.
Optionally, the extracting the local text feature of the text sample includes:
performing word segmentation on the text sample;
obtaining a Word vector of each Word after Word segmentation by using Word2 Vec;
and inputting the word vector into a Bi-LSTM network, and acquiring the feature vector representation of each word as the local text feature of the text sample.
The invention also provides a cross-modal retrieval method based on image-text cooperative attention, which comprises the following steps:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;
inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the image-text cooperative attention based cross-modal retrieval model construction method, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as retrieval results by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.
The invention further provides a cross-modal retrieval device based on image-text cooperative attention, which comprises a computer readable storage medium and a processor, wherein the computer readable storage medium stores a computer program, and the computer program is read by the processor and runs to realize the above cross-modal retrieval model construction method based on image-text cooperative attention or the above cross-modal retrieval method based on image-text cooperative attention.
According to the method, the local image features of the image sample and the local text features of the text sample are extracted and mapped to feature vectors for representation, a cross-modal attention mechanism is used for capturing fine-grained interactive relations of data between modalities, an intra-modality attention mechanism is used for capturing correlation between image regions and semantic correlation of text contexts, the cross-modal attention feature and the intra-modal attention feature are finally fused to obtain global feature representation with consistent representation forms of the image and the text, so that the similarity measurement can be directly carried out on the data of the image and the text in two different modalities, the trained cross-modal retrieval model can directly carry out similarity matching on the data of the different modalities, and the method has high matching accuracy.
Drawings
FIG. 1 is a schematic diagram of a cross-modal retrieval model construction method flow based on image-text cooperative attention in an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of a text sample in an embodiment of the invention;
FIG. 3 is a schematic diagram of an architecture of a cross-modal search model construction method based on image-text cooperative attention according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, in an embodiment of the present invention, the method for constructing a cross-modal search model based on image-text cooperative attention includes:
step S10, acquiring a training image and a training text, wherein the training image is an image sample with a category label, and the training text is a text sample with a category label.
The training data includes a plurality of training images and training texts, and an image ID may be assigned to each training image and a text ID may be assigned to each training text to distinguish between different training images and different training texts. The training images and the training texts are both provided with class labels, as shown in fig. 2, the training texts can also include image IDs corresponding to the texts in addition to the text class labels and the text descriptions, so as to facilitate the subsequent construction of positive samples in the metric learning training task.
Step S20, extracting local image features of the image sample, and extracting local text features of the text sample.
The local image feature of the image sample herein refers to a regional feature of the image sample, and specifically refers to features of a plurality of regions of the image sample. The extraction of the local image characteristics of the image sample can be realized through an R-CNN, FAST-RCNN or FASTER-RCNN algorithm.
Local text features of a text sample here refer to a feature representation of each word of the text sample. The extraction of the local text features of the text sample can be realized through the algorithms of Word2Vec, Bi-LSTM and the like.
Step S30, mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrices, and then obtaining a Key matrix, a Query matrix, and a Value matrix of each via a full connection layer.
The method comprises the steps that one image sample has one or more local image characteristics, one text sample has one or more local text characteristics, all the local image characteristics of one image sample are respectively subjected to characteristic vector mapping, the local image characteristics can be specifically input into a full connection layer to obtain a characteristic vector corresponding to the local image characteristics through mapping, all the local text characteristics of one text sample are respectively subjected to characteristic vector mapping, and the local text characteristics can be specifically input into a full connection layer to obtain a characteristic vector corresponding to the local text characteristics through mapping.
Optionally, the feature vector of the image sample and the feature vector of the text sample have the same dimension.
As shown in fig. 4, an image sample has k local image features, a text sample has j local text features, and the feature vector dimension obtained by mapping the two features is d, then k d-dimensional feature vectors can be obtained by mapping the local image features of the image sample, and j d-dimensional feature vectors can be obtained by mapping the local text features of the text sample.
By mapping the local image features of the image sample and the local text features of the text sample into vectors of the same dimension, the data interaction between the subsequent image sample and the text sample is facilitated, and the similarity calculation between the image sample and the text sample is facilitated.
As shown in fig. 4, k d-dimensional feature vectors of an image sample are represented as a matrix P, and a Key matrix, a Query matrix, and a Value matrix of the matrix P are obtained through a Linear full-link layer, which can be specifically represented as:
PK=Linear(P;θPK);PQ=Linear(P;θPQ);PV=Linear(P;θPV),
wherein, PKKey matrix of finger matrix P, PQA Query matrix, P, referring to matrix PVValue matrix, theta, referring to matrix PPK、θPQ、θPVIs the network weight parameter of the full connection layer.
As shown in fig. 4, j d-dimensional feature vectors of a text sample are represented as a matrix T, and a Key matrix, a Query matrix, and a Value matrix of the matrix T are obtained through a Linear full-link layer, which can be specifically represented as:
TK=Linear(T;θTK);TQ=Linear(T;θTQ);TV=Linear(T;θTV),
wherein, TKKey matrix and T of finger matrix TQQuery matrix, T, referring to matrix TVValue matrix, theta, referring to matrix TTK、θTQ、θTVThe net being a fully-connected layerA network weight parameter.
Step S40, calculating cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, and generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores, respectively.
Further, the step S40 includes:
performing inner product operation on a Key matrix of the image sample and a Query matrix of the text sample, a Key matrix of the text sample and the Query matrix of the image sample respectively, normalizing the inner product operation, and calculating a weight matrix of the image sample to the text sample and a weight matrix of the text sample to the image sample respectively through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample.
Performing inner product operation on a Key matrix of an image sample and a Query matrix of a text sample (formula is represented as P)KTo) And normalizing the obtained inner product operation result (formula is expressed as
Figure BDA0003372880830000071
) The weight matrix of the image sample to the text sample can be expressed as:
Figure BDA0003372880830000072
performing inner product operation on a Key matrix of a text sample and a Query matrix of an image sample (the formula is represented as T)KPQ) And normalizing the obtained inner product operation result (formula is expressed as
Figure BDA0003372880830000081
) The weight matrix of text samples to image samples can be expressed as:
Figure BDA0003372880830000082
wherein, PKKey matrix, T, for a sample of an imageQQuery matrix, T, referring to text samplesKKey matrix, P, to text sampleQRefers to the Query matrix of the image sample, and d refers to the feature vector dimension of the image sample and the text sample.
And taking the weight matrix of the image sample to the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention feature of the image sample. It can be expressed as:
Pinter=WPT×TV
wherein, PinterRefers to the cross-modal attention feature, W, of the image samplePTWeight matrix for image samples to text samples, TVIs a Value matrix of text samples.
And taking the weight matrix of the text sample to the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention feature of the text sample. It can be expressed as:
Tinter=WTP×PV
wherein, TinterRefers to a cross-modal attention feature, W, of a text sampleTPRefers to a weight matrix of text samples to image samples, PVRefers to the Value matrix of the image sample.
Performing inner product operation on a Key matrix of an image sample and a Query matrix of a text sample to obtain attention parameters between each local image characteristic of the image sample and each local text characteristic of the text sample so as to determine which parts need to be concerned about input and allocate limited information processing resources to important parts; by normalizing the result of the inner product operation, the finally obtained attention score is not influenced by the dimension d of the feature vector.
Step S50, calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores.
Further, the step S50 includes:
and performing inner product operation on the Key matrix and the Query matrix of the image sample and the Key matrix and the Query matrix of the text sample respectively, normalizing the inner product operation, and calculating the weight matrix of the image sample and the weight matrix of the text sample respectively through softmax, wherein the intra-modal attention score comprises the weight matrix of the image sample and the weight matrix of the text sample.
Performing inner product operation on a Key matrix and a Query matrix of an image sample (the formula can be expressed as P)KPQ) The result of the inner product operation is normalized (the formula can be expressed as
Figure BDA0003372880830000091
) The weight matrix for an image sample may be represented as:
Figure BDA0003372880830000092
wherein, WPPWeight matrix, P, referring to image samplesKKey matrix, P, for a sample of an imageQRefers to the Query matrix of the image sample, and d refers to the feature vector dimension of the image sample.
Performing inner product operation on a Key matrix and a Query matrix of a text sample respectively (the formula can be represented as T)KTo) The result of the inner product operation is normalized (the formula can be expressed as
Figure BDA0003372880830000093
) The weight matrix for a text sample may be represented as:
Figure BDA0003372880830000094
wherein, WTTWeight matrix, T, referring to text samplesQQuery matrix, T, referring to text samplesKThe Key matrix of the text sample is referred to, and d refers to the feature vector dimension of the text sample.
And taking the weight matrix of the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the intra-modal attention feature of the image sample. It can be expressed as:
Pintra=WPP×PV
wherein, PintraFor intra-modal attention features of image samples, PVRefers to the Value matrix of the image sample.
And taking the weight matrix of the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention feature of the text sample. It can be expressed as:
Tintra=WTT×TV
wherein, TintraFor an intra-modal attention feature of a text sample, TVIs a Value matrix of text samples.
And step S60, fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample.
Specifically, intra-modality attention features and cross-modality attention features of the image samples are fused, that is, two vectors of the intra-modality attention features and the cross-modality attention features are spliced, and then the dimensionality of a plurality of local features of the image samples is reduced to 1 global feature through a maximum pooling layer, for example, the dimensionality of k d-dimensional local features is reduced to 1 d-dimensional global feature for representation. And fusing the intra-modal attention features and the cross-modal attention features of the text sample, and obtaining the global feature representation of the text sample through the maximum pooling layer, wherein the maximum pooling layer has the same effect with the maximum pooling layer at the image sample, and the description is omitted here. It can be expressed as:
Pfinal=MaxPooling([Pinter,Pintra]),
Tfinal=MaxPooling([Tinter,Tintra]),
wherein, PfinalRefers to a global feature representation, T, of an image samplefinalRefers to a global feature representation of a text sample. As shown in fig. 4 by
Figure BDA0003372880830000101
Calculated PintraAnd
Figure BDA0003372880830000102
Figure BDA0003372880830000103
calculated PinterThe example of fusion of (1).
And step S70, training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
After global feature representations of the image samples and the text samples are obtained, a cross-modal search model is trained by using a label prediction task and a metric learning task.
The method comprises the steps of extracting local image features of an image sample and local text features of the text sample, mapping the local image features and the local text features of the text sample into feature vectors for representation, capturing fine-grained interactive relation of data between modes by using a cross-mode attention mechanism, capturing correlation between image regions and semantic correlation of text contexts by using an intra-mode attention mechanism, finally fusing the cross-mode attention feature and the intra-mode attention feature to obtain global feature representation with consistent representation forms of the image and the text, enabling the data of the image and the text in two different modes to be subjected to similarity measurement directly, enabling a trained cross-mode retrieval model to be capable of conducting similarity matching on the data of the different modes directly, and having high matching accuracy.
Optionally, as shown in fig. 3, step S70 includes a label prediction training task, which specifically includes:
inputting a global feature representation of the image sample or a global feature representation of the text sample into a fully-connected layer, outputting a probability of each label by utilizing softmax, and taking a category label with the highest probability as a prediction label of the image sample or the text sample input into the fully-connected layer;
calculating a loss function for label prediction based on the prediction label and a true class label of the image sample or the text sample band.
Using the cross entropy loss function as the loss function for label prediction, it can be expressed as:
Figure BDA0003372880830000111
wherein L islabelRepresents the loss function of the label prediction, n represents the number of all samples in a batch, yiThe true label, p, representing each sampleviRepresenting predictive labels generated for image samples, ptiRepresenting predictive labels generated for text samples.
By using the label prediction task to ensure that within a modality, samples with the same label have similar feature representations and samples with different labels have different feature representations.
Optionally, as shown in fig. 3, step S70 includes a metric learning training task, which specifically includes:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises anchor samples, a first preset number of cross-modal positive samples of the anchor samples and a second preset number of cross-modal negative samples of the anchor samples, and the anchor samples are the image samples or the text samples; and respectively calculating the distances between the anchor sample and all positive samples and all negative samples according to the global feature representation of the anchor sample, the global feature representation of the positive sample of the cross-modal anchor sample and the global feature representation of the negative sample of the cross-modal anchor sample, and calculating a loss function of metric learning based on the distances.
When the anchor sample is an image sample, the cross-modal positive sample refers to a text-form positive sample, and the cross-modal negative sample refers to a text-form negative sample; when the anchor sample is a text sample, a positive sample across modalities thereof refers to a positive sample in the form of an image, and a negative sample across modalities thereof refers to a negative sample in the form of an image.
The first predetermined amount may be greater than or equal to 1, with a preferred value of 1. The second predetermined number may be greater than or equal to 1, with a preferred value of m-1, where m refers to the number of category labels. For the anchor sample, only one class (i.e., the class label of the anchor sample) in all the class labels is the positive sample, the other m-1 classes are negative samples, and one sample in each class is selected for calculating the loss function.
In one embodiment, for image and text data of a batch, an image sample is taken as an anchor sample, a text sample with the same class label as the image sample is randomly sampled as a positive sample, and assuming that m different class labels exist, for data under m different class semantic labels, all text samples with different classes from the positive sample are sampled as negative samples, so that m-1 negative samples are obtained. And taking the text sample as an anchor sample, randomly sampling an image sample with the same semantic label as the text sample as a positive sample, and sampling all image samples with different categories from the anchor sample as negative samples to obtain m-1 negative samples.
The distance between the anchor sample and all the positive samples and the distance between the anchor sample and all the negative samples are calculated to obtain a loss function of metric learning, specifically, the distance between the anchor sample and all the positive samples minus the distance between the anchor sample and all the negative samples is used as the loss function of metric learning, and training is performed based on the loss function of metric learning, so that the distance between pairs of positive samples can be reduced, and the distance between pairs of negative samples can be enlarged. Optionally, the distance between samples is defined using cosine similarity.
Further, the loss function of metric learning is defined as follows:
Figure BDA0003372880830000121
Figure BDA0003372880830000122
Lmetric=L(v)+L(t),
wherein L (v) is a metric learning loss function of the image samples, vTIs the selected image sample, t+Is a text sample (positive sample) of the same class corresponding to the image sample, M is the number of classes, tiAre the different categories of text samples (negative samples) to which the image sample corresponds. L (t) is a metric learning loss function of the text sample, tTIs a selected text sample, v+Is the image sample of the same category (positive sample) to which the text sample corresponds, viAre the different classes of image samples (negative samples) to which the text sample corresponds. L ismetricThe overall loss function is learned for the metric.
Samples with similar semantics in different modalities have similar feature representations, while samples with different semantics in different modalities have different feature representations, which are guaranteed by a metric learning task.
The step S70 further includes: and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the loss function predicted by the label and the loss function learned by the measurement.
Specifically, the loss function of the multitask learning is defined as follows:
L=αLlabel+βLmetric
where α and β are hyper-parameters that are used to balance the weights of the above two task loss functions.
The cross-modal retrieval model is obtained by training in a multi-task learning mode, so that the cross-modal retrieval model obtained by training can be ensured to have similar feature representations of samples with the same label, different feature representations of samples with different labels, similar feature representations of samples with similar semantics among different modalities, and different feature representations of samples with different semantics among different modalities.
Optionally, as shown in fig. 3, the extracting the local image feature of the image sample includes:
inputting the image sample into a pre-trained fast RCNN model, and extracting local image features of the image sample, wherein the fast RCNN model comprises a feature extraction network, an RPN network and an interest region pooling network, the feature extraction network is used for extracting features of the image sample, inputting an extracted feature map into the RPN network, selecting a preset number of interest regions by the RPN network, marking the interest regions by rectangular frames, and the interest region pooling network is used for extracting features of the interest regions based on the interest regions marked by the RPN network to serve as local image features of the image sample.
The feature extraction network consists of a group of convolution layers, a relu activation function layer and a pooling layer.
After the extracted feature map is input into an RPN, the RPN is divided into two parts, wherein the first part selects k interested areas through two classifications, and particularly selects k interested areas through softmax classification; the second part marks the approximate location of these regions of interest with a rectangular box.
And the interested region pooling network is used for marking the position of the interested region based on the RPN network and extracting the feature map of the feature extraction network, and extracting the features of the k regions, namely the local image features of the image sample.
By extracting the local image features of the image sample by using the fast RCNN model, the local image features can be efficiently and quickly extracted from the image sample.
Optionally, as shown in fig. 3, the extracting the local text feature of the text sample includes:
and performing word segmentation on the text sample. The specific word segmentation method can select algorithms such as hidden markov models, jieba word segmentation and the like, which are all in the prior art and are not described herein.
And obtaining the Word vector of each Word after Word segmentation by using Word2Vec, and specifically obtaining the Word vector of each Word after Word segmentation by using a Skip-Gram model in the Word2 Vec. Related contents are related to the prior art and are not described herein in detail.
And inputting the word vector into a Bi-LSTM network, and acquiring the feature vector representation of each word as the local text feature of the text sample. Among them, Bi-LSTM is a kind of RNN network, suitable for modeling of time series data such as text data herein, which can better capture longer distance dependency by learning what information to remember and what information to forget, and in addition, can better capture bidirectional semantic dependency by combining forward LSTM and backward LSTM.
The invention also provides a cross-modal retrieval method based on image-text cooperative attention, which comprises the following steps:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text; inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the cross-modal retrieval model construction method based on image-text cooperative attention, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as retrieval results by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.
After obtaining given retrieval condition elements, inputting the given retrieval condition elements into a cross-modal retrieval model constructed/trained by a cross-modal retrieval model construction method based on image-text cooperative attention, simultaneously inputting elements in a preset retrieval range into the cross-modal retrieval model, and outputting a third preset number of elements with highest similarity to the retrieval condition elements in the retrieval range by the cross-modal retrieval model.
The preset search range may be limited to include only image elements, only text elements, or both image elements and text elements. In one embodiment, the given search condition element is an image, and the preset search range only includes text elements, then the cross-modal search model outputs a preset number of texts with the highest similarity to the search condition element within the search range; in another embodiment, if the given search condition element is a text and the preset search range includes only image elements, the cross-modal search model outputs an image having the highest similarity to the search condition element within the search range.
The retrieval condition elements and the elements in the preset retrieval range are input into a cross-modal retrieval model constructed by a cross-modal retrieval model construction method based on image-text cooperative attention, a cross-modal attention mechanism is used for capturing fine-grained interactive relation of data among the modalities through the cross-modal retrieval model, association between image regions and semantic association of text contexts are captured by using an intra-modality attention mechanism, and finally cross-modal attention features and intra-modality attention features are fused to obtain global feature representation with consistent representation forms of images and texts, so that images and texts in different modalities can be subjected to similarity measurement directly, and accuracy of cross-modal retrieval is improved.
In an embodiment of the present invention, the cross-modal retrieval apparatus based on image-text cooperative attention includes a computer-readable storage medium storing a computer program and a processor, where the computer program is read by the processor and executed to implement the above-mentioned cross-modal retrieval model construction method based on image-text cooperative attention, or the above-mentioned cross-modal retrieval method based on image-text cooperative attention. Compared with the prior art, the cross-modal retrieval device based on the image-text cooperative attention has the advantages that the cross-modal retrieval device based on the image-text cooperative attention is consistent with the cross-modal retrieval method based on the image-text cooperative attention, and the details are not repeated here.
The reader should understand that in the description of this specification, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example" or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A cross-modal retrieval model construction method based on image-text cooperative attention is characterized by comprising the following steps:
acquiring a training image and a training text, wherein the training image is an image sample with a class label, and the training text is a text sample with the class label;
extracting local image features of the image samples and extracting local text features of the text samples;
respectively mapping all local image features of the image sample and all local text features of the text sample into feature vectors, respectively representing the feature vectors of the image sample and the feature vectors of the text sample into matrixes, and respectively obtaining a Key matrix, a Query matrix and a Value matrix of each through a full connection layer;
calculating cross-modal attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating cross-modal attention features of the image sample and the text sample based on the cross-modal attention scores;
calculating intra-modality attention scores of the image sample and the text sample based on a Key matrix, a Query matrix and a Value matrix of the image sample and the text sample, respectively generating intra-modality attention features of the image sample and the text sample based on the intra-modality attention scores;
fusing the cross-modal attention feature and the intra-modal attention feature to respectively obtain a global feature representation of the image sample and a global feature representation of the text sample;
and training to obtain a cross-modal retrieval model based on the global feature representation of the image sample and the global feature representation of the text sample.
2. The method for constructing a cross-modal search model based on image-text cooperative attention according to claim 1, wherein the calculating cross-modal attention scores of the image sample and the text sample based on the Key matrix, the Query matrix, and the Value matrix of the image sample and the text sample, respectively, and the generating the cross-modal attention characteristics of the image sample and the text sample based on the cross-modal attention scores comprises:
performing inner product operation on a Key matrix of the image sample and a Query matrix of the text sample, a Key matrix of the text sample and the Query matrix of the image sample respectively, normalizing the inner product operation and the normalization operation, and calculating a weight matrix of the image sample to the text sample and a weight matrix of the text sample to the image sample respectively through softmax, wherein the cross-modal attention score comprises the weight matrix of the text sample to the image sample and the weight matrix of the image sample to the text sample;
taking the weight matrix of the text sample to the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the cross-modal attention feature of the text sample;
and taking the weight matrix of the image sample to the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the cross-modal attention feature of the image sample.
3. The method for constructing a cross-modal search model based on image-text cooperative attention according to claim 1, wherein the calculating intra-modal attention scores of the image sample and the text sample based on the Key matrix, Query matrix and Value matrix of the image sample and the text sample, respectively, and the generating intra-modal attention features of the image sample and the text sample based on the intra-modal attention scores comprises:
performing inner product operation on a Key matrix and a Query matrix of the image sample and a Key matrix and a Query matrix of the text sample respectively, normalizing the inner product operation and calculating a weight matrix of the image sample and a weight matrix of the text sample respectively through softmax, wherein the intra-modal attention score comprises the weight matrix of the image sample and the weight matrix of the text sample;
taking the weight matrix of the image sample as a score, and performing weighted summation operation on the Value matrix of the image sample to obtain the intra-modal attention feature of the image sample;
and taking the weight matrix of the text sample as a score, and performing weighted summation operation on the Value matrix of the text sample to obtain the intra-modal attention feature of the text sample.
4. The method for constructing a cross-modal search model based on image-text cooperative attention as claimed in claim 1, wherein the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample includes a label prediction training task, specifically including:
inputting a global feature representation of the image sample or a global feature representation of the text sample into a fully-connected layer, outputting a probability of each label by utilizing softmax, and taking a category label with the highest probability as a prediction label of the image sample or the text sample input into the fully-connected layer;
calculating a loss function for label prediction based on the prediction label and a true class label of the image sample or the text sample band.
5. The method for constructing a cross-modal search model based on image-text cooperative attention as claimed in claim 4, wherein the training to obtain the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises a metric learning training task, specifically comprising:
constructing a training sample set for metric learning, wherein one piece of training data in the training sample set comprises anchor samples, a first preset number of cross-modal positive samples of the anchor samples and a second preset number of cross-modal negative samples of the anchor samples, and the anchor samples are the image samples or the text samples;
and respectively calculating the distances between the anchor sample and all positive samples and all negative samples according to the global feature representation of the anchor sample, the global feature representation of the positive sample of the cross-modal anchor sample and the global feature representation of the negative sample of the cross-modal anchor sample, and calculating a loss function of metric learning based on the distances.
6. The method as claimed in claim 5, wherein the training of the cross-modal search model based on the global feature representation of the image sample and the global feature representation of the text sample further comprises:
and training to obtain the cross-modal retrieval model by adopting a multi-task learning mode based on the loss function predicted by the label and the loss function learned by the measurement.
7. The method for constructing a cross-modal search model based on image-text cooperative attention according to any one of claims 1 to 6, wherein the extracting the local image features of the image sample comprises:
inputting the image sample into a pre-trained fast RCNN model, and extracting local image features of the image sample, wherein the fast RCNN model comprises a feature extraction network, an RPN network and an interest region pooling network, the feature extraction network is used for extracting features of the image sample, inputting an extracted feature map into the RPN network, selecting a preset number of interest regions by the RPN network, marking the interest regions by rectangular frames, and the interest region pooling network is used for extracting features of the interest regions based on the interest regions marked by the RPN network to serve as local image features of the image sample.
8. The method for constructing a cross-modal search model based on image-text cooperative attention according to any one of claims 1 to 6, wherein the extracting the local text features of the text sample comprises:
performing word segmentation on the text sample;
obtaining a Word vector of each Word after Word segmentation by using Word2 Vec;
and inputting the word vector into a Bi-LSTM network, and acquiring the feature vector representation of each word as the local text feature of the text sample.
9. A cross-modal retrieval method based on image-text cooperative attention is characterized by comprising the following steps:
acquiring a given retrieval condition element, wherein the retrieval condition element is a retrieval image or a retrieval text;
inputting the retrieval condition elements and elements in a preset retrieval range into a cross-modal retrieval model constructed by the image-text cooperative attention based cross-modal retrieval model construction method according to any one of claims 1 to 8, and outputting a third preset number of elements with the highest similarity to the retrieval condition elements in the retrieval range as a retrieval result by the cross-modal retrieval model, wherein the elements in the retrieval range comprise images and/or texts.
10. A cross-modal retrieval apparatus based on image-text cooperative attention, comprising a computer-readable storage medium storing a computer program and a processor, wherein the computer program is read by the processor and executed to implement the cross-modal retrieval model construction method based on image-text cooperative attention according to any one of claims 1 to 8 or the cross-modal retrieval method based on image-text cooperative attention according to claim 9.
CN202111406136.9A 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention Active CN114201621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111406136.9A CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111406136.9A CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Publications (2)

Publication Number Publication Date
CN114201621A true CN114201621A (en) 2022-03-18
CN114201621B CN114201621B (en) 2024-04-02

Family

ID=80648805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111406136.9A Active CN114201621B (en) 2021-11-24 2021-11-24 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention

Country Status (1)

Country Link
CN (1) CN114201621B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663737A (en) * 2022-05-20 2022-06-24 浪潮电子信息产业股份有限公司 Object identification method and device, electronic equipment and computer readable storage medium
CN114691907A (en) * 2022-05-31 2022-07-01 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN114707007A (en) * 2022-06-07 2022-07-05 苏州大学 Image text retrieval method and device and computer storage medium
CN114841243A (en) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN114973294A (en) * 2022-07-28 2022-08-30 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium
CN114969405A (en) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 Cross-modal image-text mutual inspection method
CN115017358A (en) * 2022-08-09 2022-09-06 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Time sequence language positioning method and device based on modal customization cooperative attention interaction
CN115658955A (en) * 2022-11-08 2023-01-31 苏州浪潮智能科技有限公司 Cross-media retrieval and model training method, device, equipment and menu retrieval system
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN116433727A (en) * 2023-06-13 2023-07-14 北京科技大学 Scalable single-stream tracking method based on staged continuous learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
US20200401938A1 (en) * 2019-05-29 2020-12-24 The Board Of Trustees Of The Leland Stanford Junior University Machine learning based generation of ontology for structural and functional mapping
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113392254A (en) * 2021-03-29 2021-09-14 西安理工大学 Image text retrieval method based on context awareness

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
US20200401938A1 (en) * 2019-05-29 2020-12-24 The Board Of Trustees Of The Leland Stanford Junior University Machine learning based generation of ontology for structural and functional mapping
CN113392254A (en) * 2021-03-29 2021-09-14 西安理工大学 Image text retrieval method based on context awareness
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张天;靳聪;帖云;李小兵;: "面向跨模态检索的音频数据库内容匹配方法研究", 信号处理, no. 06, 31 December 2020 (2020-12-31), pages 180 - 190 *
邓一姣;张凤荔;陈学勤;艾擎;余苏?;: "面向跨模态检索的协同注意力网络模型", 计算机科学, no. 04, 31 December 2020 (2020-12-31), pages 60 - 65 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841243A (en) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN114969405B (en) * 2022-04-30 2024-01-26 苏州浪潮智能科技有限公司 Cross-modal image-text mutual detection method
CN114969405A (en) * 2022-04-30 2022-08-30 苏州浪潮智能科技有限公司 Cross-modal image-text mutual inspection method
CN114663737A (en) * 2022-05-20 2022-06-24 浪潮电子信息产业股份有限公司 Object identification method and device, electronic equipment and computer readable storage medium
CN114691907A (en) * 2022-05-31 2022-07-01 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN114707007A (en) * 2022-06-07 2022-07-05 苏州大学 Image text retrieval method and device and computer storage medium
CN114707007B (en) * 2022-06-07 2022-08-30 苏州大学 Image text retrieval method and device and computer storage medium
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN114973294B (en) * 2022-07-28 2022-10-21 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium
CN114973294A (en) * 2022-07-28 2022-08-30 平安科技(深圳)有限公司 Image-text matching method, device, equipment and storage medium
CN115017358B (en) * 2022-08-09 2022-11-04 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN115017358A (en) * 2022-08-09 2022-09-06 南京理工大学 Cross-modal retrieval method and system for multi-modal interaction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Time sequence language positioning method and device based on modal customization cooperative attention interaction
CN115238130B (en) * 2022-09-21 2022-12-06 之江实验室 Time sequence language positioning method and device based on modal customization collaborative attention interaction
CN115658955A (en) * 2022-11-08 2023-01-31 苏州浪潮智能科技有限公司 Cross-media retrieval and model training method, device, equipment and menu retrieval system
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN115861995B (en) * 2023-02-08 2023-05-23 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN116433727A (en) * 2023-06-13 2023-07-14 北京科技大学 Scalable single-stream tracking method based on staged continuous learning
CN116433727B (en) * 2023-06-13 2023-10-27 北京科技大学 Scalable single-stream tracking method based on staged continuous learning

Also Published As

Publication number Publication date
CN114201621B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN114201621B (en) Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN109388807B (en) Method, device and storage medium for identifying named entities of electronic medical records
WO2020034849A1 (en) Music recommendation method and apparatus, and computing device and medium
US9633045B2 (en) Image ranking based on attribute correlation
US11222055B2 (en) System, computer-implemented method and computer program product for information retrieval
US11514244B2 (en) Structured knowledge modeling and extraction from images
EP3143523B1 (en) Visual interactive search
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
US10606883B2 (en) Selection of initial document collection for visual interactive search
CN110362723B (en) Topic feature representation method, device and storage medium
US9037600B1 (en) Any-image labeling engine
JP6884116B2 (en) Information processing equipment, information processing methods, and programs
CN112364204B (en) Video searching method, device, computer equipment and storage medium
CN108846097B (en) User interest tag representation method, article recommendation device and equipment
CN113657425A (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
TW201220099A (en) Multi-modal approach to search query input
Yuan et al. Mining compositional features from GPS and visual cues for event recognition in photo collections
CN111539197A (en) Text matching method and device, computer system and readable storage medium
CN112528053A (en) Multimedia library classified retrieval management system
CN114332680A (en) Image processing method, video searching method, image processing device, video searching device, computer equipment and storage medium
Zhu et al. Multimodal sparse linear integration for content-based item recommendation
JP2012194691A (en) Re-learning method and program of discriminator, image recognition device
CN112182451A (en) Webpage content abstract generation method, equipment, storage medium and device
Aygun et al. Multimedia retrieval that works

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant