CN110647904A

CN110647904A - Cross-modal retrieval method and system based on unmarked data migration

Info

Publication number: CN110647904A
Application number: CN201910707010.1A
Authority: CN
Inventors: 朱福庆; 王雪如; 张卫博; 戴娇; 虎嵩林; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2020-01-03
Anticipated expiration: 2039-08-01
Also published as: CN110647904B

Abstract

The invention provides a cross-modal retrieval method and a cross-modal retrieval system based on label-free data migration. The invention well solves the problem of small data scale of the cross-modal data set, and better conforms to the condition that the actual user query is not in the predefined category range; meanwhile, upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the cross-mode retrieval accuracy is improved.

Description

Cross-modal retrieval method and system based on unmarked data migration

Technical Field

The invention relates to the technical field of cross-modal data retrieval, in particular to a cross-modal retrieval method and a cross-modal retrieval system based on unmarked data migration.

Background

Different modal data such as images and texts are widely present in the internet and show a trend of mutual fusion. The cross-modal retrieval task tries to break the boundary between different modal data and cross different modal data to realize information retrieval, namely, a certain modal sample is tried to retrieve samples of other modalities similar to the semantics of the certain modal sample, and the cross-modal retrieval task is widely applied to search engines and big data management. The existing cross-modal retrieval method tries to map feature representations of different modal data to a common space to learn a unified representation, and measures similarity by calculating the distance between corresponding unified representations. However, due to the heterogeneity of different modal data, the data distribution and characterization are inconsistent, semantic association is difficult to achieve, and cross-modal similarity is still difficult to measure.

Although the internet has a lot of images and text data, most of the images and text data are unmarked and difficult to utilize. The data contains rich semantic information, on one hand, data annotation requires a large amount of cost, on the other hand, internet information is updated constantly, and each new hot event is accompanied by a large amount of data such as images and texts of new categories, so that the data of all the categories cannot be annotated, and how to fully utilize the non-annotated data is a great challenge for the traditional cross-modal retrieval method.

For the above reasons, in an actual scenario, the query submitted by the user often does not necessarily fall within the predefined category range, and sometimes the training set and the test set do not share the same category. Existing cross-modality retrieval methods are generally only directed to cases where training data and test data are of the same category (non-extensible cross-modality retrieval). How to better construct a cross-modal common space, inputting a modal data, no matter the category of the data is known or unknown, the multi-modal data related to the data can be retrieved, which has important significance in practical application.

Disclosure of Invention

In order to solve the problems of data heterogeneity of different modes, excessive unmarked data, insufficient training data, inextensibility and the like, the invention provides a cross-modal retrieval method and a system based on unmarked data migration.

The technical scheme of the invention is as follows:

a cross-modal retrieval method based on unmarked data migration comprises the following steps:

inputting a sample to be retrieved into a trained cross-modal data retrieval model to obtain characteristic representation of the cross-modal data retrieval model;

calculating Euclidean distances between each sample to be retrieved and all other modal samples, and then sequencing, wherein the other modal samples with the distances smaller than a specified threshold value are retrieval results;

the training process of the cross-modal data retrieval model is as follows:

(1) setting pseudo labels for the unmarked images and the texts respectively by a clustering method;

(2) respectively transferring knowledge contained in the unmarked image and the text with the pseudo label to the image and text parts of the cross-modal data set, and learning the independent expression of the image and the text of the cross-modal data set;

(3) and transmitting the independent expressions of the images and the texts into the same network, and learning the common expression of the images and the texts in the same semantic space.

Further, the method for determining the threshold value comprises the following steps: loss in training process_cross-modalThe Loss value is the distance of the paired image text, in terms of Loss_cross-modalThe loss value is set to 10-20 initial thresholds, and the retrieved mAP value (mAP (mean average) is calculated under each threshold (the quality of the learned model on all queries is measured; the average precision of the AP is measured)Good or bad on each query), the threshold value that maximizes the value of mAP is the threshold value for retrieval; therein, Loss_cross-modalAs a loss function across modal knowledge:

where 16, 17 refer to two fully connected layers connected across the modal dataset image text, nl refers to the logarithm of the incoming image and text,

for the p-th image-text pair, the image and text are mapped into feature vectors using g ().

A cross-modal retrieval system based on unmarked data migration, comprising:

the system comprises a label-free data clustering module, a data migration module and a common space learning module, wherein a migration data set is constructed through the label-free data clustering module and is used as a migration source domain of the data migration module, and finally, the common space learning module is used for uniformly expressing the image and text learning obtained by the data migration module and establishing a similarity measurement basis of cross-modal data, so that cross-modal retrieval is realized.

Further, the label-free data clustering module comprises an image clustering submodule and a text clustering submodule. The module extracts the characteristics of all unmarked images/texts and then conducts unsupervised clustering to obtain a series of clustering centers; and classifying the image/text samples under the same cluster center into one class, and setting the samples as the same label, namely completing the construction of the migration data set.

Further, the data migration module comprises an image migration submodule and a text migration submodule, and migration only occurs in the same submodule. For each sub-module, the migration source domain is unmarked data after corresponding modal clustering, and the target domain is data of corresponding modal of the cross-modal data set. Transfer learning is achieved by minimizing the loss of distribution between the source domain and the target domain. The inputs of the cross-modal data set are all input in pairs and belong to the same category, the expressions generated finally should be similar, and the distance between the images and texts with the same semantic information is as close as possible and the distance between the images and texts with different semantics is as far as possible by minimizing the pair Euclidean distance between the two modal data sets, and the images and texts are independent of the modalities.

Furthermore, the common space learning module transmits the single expression of the image and the text obtained by the data migration module into the same network to learn the unified expression of data in different modes, the network comprises a plurality of shared full connection layers, word embedding vectors of cross-mode data set categories are added into the network, semantic association among different modes is increased, and semantic information is further enhanced.

The method has the beneficial effects that:

according to the method, a large number of unmarked monomodal data sets are clustered and are allocated with the pseudo labels, and the clustered unmarked data are transferred to the cross-modal data set, so that the problem of small data scale of the cross-modal data set is well solved, and the method is more suitable for the condition that the actual user query is not in the predefined category range. By the method, upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the cross-mode retrieval accuracy is improved. The method achieves good effects in both public data sets and practical applications.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a data migration flow diagram;

FIG. 3 is a flow diagram of a feature extraction system.

Detailed Description

The method mainly introduces a cross-modal retrieval network modeling based on transfer learning, label-free data clustering, data transfer, co-expression learning and testing process.

The method will be further described with reference to the accompanying drawings.

Modeling of a cross-modal retrieval network based on transfer learning:

clustering unlabeled data, i.e. giving unlabeled data sets S, using image clusteringClass algorithm C_iWill not have label image S_iPoly is k_iClass, using a text clustering algorithm C_tFor unlabelled text S_tPoly is k_tMarking the same pseudo label y on all images and texts in the same cluster center in each category_i. Migrating the clustered label-free data set S to a cross-modal data set D by using a migration learning algorithm T, and performing combined training to generate a single vector expression R of images and texts of the cross-modal data set_i，R_t. Finally, the pictures and the texts are expressed separately R_i，R_tAnd transmitting the word embedding vector V of the category into the same full-connection network F, and generating a common expression R of the image and the text in the same space. Wherein:

unlabeled dataset S ═ S_i,S_t}: as a source domain for transfer learning, where S_iFor unlabeled image datasets, S_tIs a non-labeled text data set.

Cross-modality dataset D ═ { D ═ D_i,D_t}：D_i，D_tImages, text across the modal dataset, images and text across the modal dataset are entered in pairs and correlated, for each image/text pair, the images and text are from the same article, or the text is a description of the image.

Word embedding vector V: all known classes across the modal dataset are converted to 300-dimensional Word vectors by the Word2vec model.

Text input: text is a description of an image and may be an article, paragraph, sentence, word, etc. Text vectors are extracted using Bert, with dimensions 768 dimensions.

Image input: in this network, the input of the image is the original image at 224 x 224.

Clustering algorithm C ═ { C ═ C_i，C_t}，C_iAs an image clustering algorithm, C_tA text clustering algorithm.

Number of clusters k_i，k_tThe method is obtained through experience and multiple calculations.

The migration algorithm T: is an algorithm that gains some knowledge through the source domain to promote the target task, where the source domain is not equal to the target domain or the source task is not the same as the target task.

Co-expression vector R: the resulting vector representation of the image and text.

A label-free data clustering module:

for unlabeled images containing rich semantic information, firstly, a pre-trained VGG network is used for extracting a feature vector of each image, and then, the KMeans method is used for clustering the images. The specific method comprises the following steps: setting initial cluster center number (namely k) according to the number and distribution of unmarked images_i) And randomly select k_iThe images serve as the initial cluster centers. And traversing all the images, distributing each image to the nearest cluster center, updating the mean value of each cluster to be used as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached. All samples of the same cluster are classified into one class and set as the same label for constructing a source domain data set for image migration.

For the unlabeled text, firstly using Bert to extract the characteristics of each text, then adopting the same unsupervised clustering method as the images to classify similar texts into the same cluster and marking the same labels for constructing a source domain data set of text migration.

Method for determining the number of suitable cluster centers: setting the initial value of k to be 5-15 according to the size of the unmarked data volume, clustering each value of k and recording the corresponding SSE (sum of the squared errors, SSE is the sum of the distances between each sample point and the corresponding clustering center of the sample point). With the increase of the number of clusters, the sample division is finer, the aggregation degree of each cluster is gradually improved, and the sum of squared errors SSE is gradually reduced. When k is smaller than the optimal clustering number, the increase of k greatly increases the aggregation degree of each cluster, and the descending extent of SSE is large, and when k reaches the optimal clustering number, the return of the aggregation degree obtained by increasing k is rapidly reduced, so that the descending extent of SSE is rapidly reduced and then becomes gentle as the k value continues to increase. The relationship between k and SSE is plotted, and the point at which the slope changes is the optimal value of k.

A data migration module:

the data migration module comprises two parts, namely monomodal knowledge migration and cross-modal knowledge sharing.

And the single-mode migration refers to migrating the clustered unlabeled images to images corresponding to the cross-mode data set and migrating the clustered unlabeled texts to texts corresponding to the cross-mode data set. Therefore, the module comprises two single-mode migration submodules of image and text.

Referring to fig. 2, for image migration, the source domain of the migration is the unlabeled image after clustering, and the target domain is the image portion of the cross-modal data. Firstly, pictures of a source domain and a target domain are transmitted into a network, the pictures pass through the first five convolutional layers of the AlexNet network, and then three full connection layers fc6, fc7 and fc8 are added, wherein the loss function of the source domain is SoftMax loss. The migration of knowledge of image modalities is achieved by minimizing the loss function MMD (Maximum Mean difference, MMD is used to measure the difference of two different but related distributions) of the source domain and the target domain. Defining the distribution of the image object field as X_iDistribution of source domains is Y_iThe migration loss of the image modality is:

whereinIndicating that this distance is measured by f () mapping the data into the regenerated hilbert space (RKHS), m being the number of samples of the source domain data and n being the number of samples of the target domain data.

The text migration and image migration processes are similar, a source domain of the migration is a clustered unlabeled text, a target domain is a text part of cross-modal data, the text feature vectors of the source domain and the target domain are respectively extracted by using an NLP model Bert disclosed by GOOGLE, and then the text feature vectors pass through three full-connection layers fc6, fc7 and fc8, wherein a loss function of the source domain is SoftMax loss, and a loss function of the migration is MMD loss. Defining a distribution of text target fields as X_tDistribution of source domains is Y_tThe migration loss of the text modality is:

the goal of setting the cross-modal knowledge sharing layer is to fully utilize similar semantic information among different modalities, overcome the heterogeneity difference among the modalities, and no matter which modality the data comes from, as long as the data contains the same semantic information, the data should have similar feature vectors, contain different semantic information, and the distance of the feature vectors should be longer. The similarity of vectors is measured using Euclidean distances (fc6-img/fc6-txt and fc7-img/fc7txt), the Euclidean distance of their features should be as small as possible for each pair of similar images and text input. The loss function of cross-modal knowledge is:

After passing through the two monomodal knowledge migration modules and the cross-modal knowledge sharing module, the model makes full use of unmarked data, has stronger semantic discrimination capability, and generates a separate representation for each sample in the cross-modal data set.

The final loss function of the migration module is:

Loss_transfer＝Loss_img+Loss_txt+Loss_cross-modal

a common space learning module:

the cross-modal target domain internal semantic association also provides key semantic information for the construction of a cross-modal common space, and in order to further enhance the semantic correlation of image and text features, a common space learning module is further designed to enhance the correlation. The module is a simple and efficient structure comprising two fully connected layers and a common classification layer. Word embedding (word embedding) vectors of image features, text features and categories are introduced into the module, and since the parameters of fc8 and fc9 are shared by two modalities, the semantic relevance of different modalities can be guaranteed by using supervision information in a cross-modality target domain. Considering the labels of two paired modalities in the target domain, the correlation penalty is:

wherein f is_sIn order to be a function of the SoftMax loss,

for the p-th relevant image-text pair input,/^pA category label for the image text pair.

The migration module and the common space learning module are a unified network structure, and the two modules are trained together and mutually promoted. The net penalty is therefore:

Loss＝Loss_transfer+Loss_common

example (b):

the invention comprises a training system, a feature extraction system and a retrieval three parts: the three modules are combined to form the overall structure (figure 1) of the invention, and training data are transmitted into a training system for training and are stored to obtain a training model. The parameters of the feature extraction system (fig. 3) are the same as those of the training system, but structures such as data migration and word embedding of categories are not required, and the test set is transmitted to the feature extraction system to obtain vector representation of each sample of the test set. And during retrieval, calculating the distance between the sample to be retrieved and all samples in other modes, wherein the distance smaller than a specified threshold value is a retrieval result.

A training system:

as shown in fig. 1, the three modules (the unlabeled data clustering module, the data migration module, and the co-expression learning module) are combined to form a training system. The specific training steps are as follows:

1. image source domain preprocessing: for each image in the label-free image set, extracting image features by using a pre-trained VGG network, selecting ki images as initial clustering centers, distributing each image to the nearest clustering center, updating the mean value of each cluster as a new clustering center, and iterating for multiple times until each cluster is not changed or the maximum iteration number is reached. Classify all samples of the same cluster into one class and set these samples to the same label/_i(l_iIn the range of 0 to k_iBetween-1) for constructing a migration data set. Storing the image path and the pseudo label into the same txt file, wherein each line represents a picture in the format of' image path l_i”。

2. Text source domain preprocessing: for each text in the label-free text set, extracting the characteristics of each text by using Bert, and setting the number of clusters as k_tThen, similar texts are classified into the same cluster by adopting the same unsupervised clustering method as the images, and the same label l is marked_t(l_tBetween 0 and t-1). Storing the text path and the pseudo label into the same txt file, wherein each line represents a text in the format of' text path l_t”。

3. Cross-modality data set preprocessing: and (4) the images and texts of the cross-modal data set correspond to each other one by one and are input in pairs. The images are stored in a txt document in the format "image path similarity", each line representing an image. The text is firstly converted into a vector, and the vector and the category label are stored in the lmdb file.

4. Setting the network learning rate to be fixed, setting the basic learning rate to be 0.01, iterating for 500 rounds, updating the network parameters, and using a random gradient descent algorithm.

5. And transmitting the image source domain and the text source domain into the model across the modal data set, and starting to train the model. After the images and the texts pass through the migration module and the common space learning module, the expression R of the images and the texts in the common space is obtained. The test system comprises:

the invention features extraction process block diagram is shown in fig. 3, the system has fewer word embedding vectors and SoftMax loss functions for migration source domains and classes than the training system, and no pair-wise input is required across modal datasets. The feature extraction system firstly extracts feature representation of the image/text, wherein the input mode of the image/text is consistent with the training process, the image/text is sent into a CNN model after learning optimization in the training process, and the response of the last but one full connection layer is taken as the feature representation of the image/text. And after the characteristic representation of the image/text is obtained, cross-modal retrieval is carried out.

And (3) retrieval:

1. transmitting the images and texts of all the test sets into a feature extraction system to obtain feature representations of the images and the texts;

2. realizing 'text searching of images' and 'text searching of images': and calculating Euclidean distances between each image and all texts, and sequencing, wherein a plurality of texts closest to the image are retrieval results. The text is also true.

Claims

1. A cross-modal retrieval method based on unmarked data migration comprises the following steps:

the training process of the cross-modal data retrieval model is as follows:

(2) respectively transferring knowledge contained in the unlabeled image and the text with the pseudo labels to the image and text parts of the cross-modal dataset to generate a single expression of the image and text of the cross-modal dataset;

2. The cross-modal retrieval method based on unlabeled data migration of claim 1, wherein the clustering is an unsupervised clustering method, including KMeans method.

3. The cross-modal retrieval method based on markerless data migration of claim 1, wherein the migration comprises unimodal knowledge migration and cross-modal knowledge sharing.

4. The cross-modal search method based on unmarked data migration of claim 3, wherein the migration Loss function Loss is_transferComprises the following steps:

Loss_transfer＝Loss_img+Loss_txt+Loss_cross-modal，

therein, Loss_imgA migration loss function for the image modality; loss_txtA migration loss function that is a text modality; loss_cross-modalIs a loss function across modal knowledge.

5. The cross-modal retrieval method based on label-free data migration according to claim 4, wherein the knowledge migration implementation method of the image modality comprises: firstly, transmitting pictures of a source domain and a target domain into a network, passing through the first five convolutional layers of the AlexNet network, and then adding three full-connection layers, wherein the loss function of the source domain is SoftMax loss; the knowledge transfer of the image modality is realized by minimizing the loss function MMD of the source domain and the target domain;

loss of migration of image modalities Loss_imgComprises the following steps:

wherein the content of the first and second substances,

represents the distance measured by mapping the data into the regenerated hilbert space by f (); x_iFor distribution of object fields of the image, Y_iAnd k is the distribution of the source domain, m is the number of the clustering centers, m is the number of samples of the source domain data, and n is the number of samples of the target domain data.

6. The cross-modal retrieval method based on unmarked data migration according to claim 4, wherein the knowledge migration implementation method of text modality comprises: respectively extracting text characteristic vectors of a source domain and a target domain by using Bert, and then passing through three full-connection layers, wherein a loss function of the source domain is SoftMax loss, and a loss function of migration is MMD loss;

loss of migration of text modalities Loss_txtComprises the following steps:

wherein the content of the first and second substances,

represents the distance measured by mapping the data into the regenerated hilbert space by f (); x_tFor distribution of object fields of the image, Y_tAnd k is the distribution of the source domain, m is the number of the clustering centers, m is the number of samples of the source domain data, and n is the number of samples of the target domain data.

7. The cross-modal search method based on unmarked data migration of claim 4, wherein the Loss function Loss of cross-modal knowledge_cross-modalComprises the following steps:

for the p-th image-text pair, sum the images using g ()The text is mapped into feature vectors.

8. The cross-modal search method based on markerless data migration of claim 1, wherein the common spatial learning Loss function Loss_commonComprises the following steps:

wherein f is_sIn order to be a function of the SoftMax loss,

for the p-th relevant image-text pair input,/^pN is the number of image text pairs as the category label of the image text pair.

9. The cross-modal retrieval method based on unmarked data migration as claimed in claim 1, wherein the threshold value determination method comprises: loss function Loss across modal knowledge during training_cross-modalThe Loss value is the distance of the paired image text, in terms of Loss_cross-modalThe loss value is set to 10-20 initial thresholds and the retrieved mAP value is calculated at each threshold such that the threshold at which the mAP value is the largest is the retrieved threshold.

10. A cross-modal retrieval system based on unmarked data migration, comprising:

the system comprises a label-free data clustering module, a data migration module and a common space learning module;

and finally, a common space learning module is used for learning and uniformly expressing images and texts obtained by the data migration module, and a similarity measurement basis of cross-modal data is established, so that cross-modal retrieval is realized.