CN113537304A

CN113537304A - Cross-modal semantic clustering method based on bidirectional CNN

Info

Publication number: CN113537304A
Application number: CN202110718799.8A
Authority: CN
Inventors: 颜成钢; 王超怡; 孙垚棋; 张继勇; 李宗鹏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-22

Abstract

The invention discloses a cross-modal semantic clustering method based on bidirectional CNN, which comprises the steps of firstly preprocessing data and pre-training text samples of a training set; then constructing a cross-modal retrieval network, training the cross-modal retrieval network through a training set, and calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model; and finally, testing the trained model through the test set, and calculating each evaluation index. The method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.

Description

Cross-modal semantic clustering method based on bidirectional CNN

Technical Field

The invention relates to the field of computer vision, in particular to a cross-modal retrieval method based on deep learning.

Background

In the era of new media information explosion, each new media user can publish various multimedia information with different modalities, such as pictures, music, videos or texts, anytime and anywhere. With the rapid development of multimedia information, it is difficult for a user to accurately acquire information desired by the user as the number and types of multimedia information increase, and information acquisition is always accompanied by other information having different degrees of correlation. The data is not only huge in quantity, but also mostly data without labels, and different modalities have 'heterogeneous gaps' among data, so that the main technical problem of cross-modality retrieval is to cross the 'gaps' among the data of the different modalities to extract the precision and accuracy of the retrieved data.

The core of the cross-modal retrieval technology is to measure the similarity between different data. Due to the existence of a "heterogeneous gap," the key to cross-modality retrieval is how to match information of different modalities. So far, most cross-modal retrieval maps samples of different modalities to the same subspace. It is also possible to classify into an unsupervised method and a supervised method according to the usage information. The supervision method uses the label information carried by the sample.

Although the cross-modal search is based on mapping to the same subspace, it has different efficiency and accuracy depending on the choice and arrangement of the loss function. In the invention, the loss of samples and clustering centers in a target space, the distribution difference loss of categories in different modes and the discriminant loss are designed to help semantic clustering, so that the identification capability among different categories is enhanced, and the correlation among different modes is enhanced.

Disclosure of Invention

The invention provides a cross-modal semantic clustering method based on bidirectional CNN. The method can effectively improve the efficiency and accuracy of cross-modal retrieval.

The method is respectively carried out by utilizing two CNN network structures, and one deep CNN is used for extracting the characteristic vector of the picture sample. And a shallow CNN, extracting feature vectors of the text samples by using multi-core convolution with different sizes.

In the traditional cross-modal retrieval by using label information, only the content similarity between the modalities is considered, and the invention provides a cross-modal retrieval mode and a novel semantic clustering mode. According to the fact that samples with the same category should have uniform distribution, in order to enable the samples to correspond to the corresponding category distribution in the target space, the clustering center of the target space is calculated. The loss function is defined as the loss of the sample and the cluster center in the target space, the distribution difference loss of the category in different modes and the discriminant loss.

The method specifically comprises the following steps:

step 1: and preprocessing the data, namely performing pre-training on the text samples of the training set.

The existing data set is divided into a training set and a testing set according to a set proportion, and text samples of the training set are pre-trained.

Step 2: and constructing a cross-modal retrieval network.

The cross-modal retrieval network is performed simultaneously using dual CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.

And step 3: the cross-modal search network is trained through a training set.

And 4, step 4: a loss function of the network is calculated. And performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameter. And training for multiple rounds to obtain a final network model.

Finding an effective transition matrix

The sample is projected from a source space to a target space. And after the samples are transferred, clustering in a target space corresponding to the category clustering center of the samples. Defining a loss function as being in a target spaceLoss of sample and cluster centers, loss of distribution variance of classes in different modalities, and discriminant loss. The loss of samples and cluster centers in the target space learns a dimensional invariant matrix to minimize the variance of the class distribution. The category distribution difference between different modalities is narrowed down by minimizing the MMD of the category distribution. And (4) judging the loss, namely label prediction loss, and predicting the class label of the public space seed sample by applying a classifier.

And 5: and (3) testing the network model:

and testing the trained model through the test set, and calculating each evaluation index.

The specific method of the step 2 is as follows:

the cross-modal retrieval network adopts a double-layer CNN structure, and comprises a ResNet-50 network and a text CNN network, namely textCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.

And extracting the information characteristic vector of the picture sample by adopting ResNet-50, and then carrying out common representation learning to obtain common representation of each picture.

Word embedding is a method of converting words in text into numeric vectors. The TextCNN comprises an embedding layer, a convolution layer, a pooling layer and a fully connected softmax layer. For each sentence, a two-dimensional sentence matrix is obtained according to the word vector, then filters with different sizes are selected to carry out convolution operation to obtain a plurality of features, then the maximum pooling is carried out, the features are spliced together, and finally classification is carried out through a softmax full-connected layer. Also, multiple fully connected layers are employed to learn a common representation of text.

With U ═ U₁,u₂,…,u_n],V＝[v₁,v₂,…,v_n]And Y ═ Y₁,y₂,…,y_n]An image representation matrix, a text representation matrix and a label matrix representing all instances, respectively, where n is the number of categories.

A transition matrix is represented that represents the transition matrix,

representing a dimension-invariant matrix for optimizing a loss function, wherein D_SIs the dimension of the source space, D_τIs the dimension of the target space.

The specific method in step 3 is as follows:

the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.

The specific method of the step 4 is as follows:

the loss function is set to a combination of the loss of the sample and the cluster center in the target space, the distribution difference loss of the class in different modalities, and the discriminant loss. In order to reduce the overlapping of different types of distribution in a target space, a matrix with invariable dimension is learned, the variance of the type distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced.

Firstly, calculating target centers of semantic clustering, and calculating c clustering centers by average class samples

Thus:

wherein X^τIs a set of cluster centers, N₀Is the number of samples, D_τIs the dimension of the target space and n is the number of classes.

It follows that the loss of samples and cluster centers in the target space is as follows:

represents the loss of samples and cluster centers in the target space, X samples in the target space.

Narrowing the category distribution difference between different modes by minimizing MMD of the category distribution by minimizing X^SW and X^τSquare of maximum mean difference between H

Wherein,

is N₀Vector of X1, X^SA sample representing the source domain.

And finally, calculating the prediction loss by using the cross entropy, namely the difference between the obtained result and the true value:

p_*,iis the probability distribution, y, generated for each image or text_iIs his true tag value.

The final common loss function is therefore expressed as:

where θ is a variable of the model to be optimized and λ is a weight coefficient.

And 5: testing the network model;

and inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the calculated average precision average (mAP). And finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.

The invention has the following beneficial effects:

the method of the invention improves the accuracy and efficiency of cross-modal retrieval by clustering the semantic information. The invention designs the loss of a sample and a clustering center in a target space, the distribution difference loss of categories in different modes and the discrimination loss to help semantic clustering, thereby not only enhancing the identification capability among different categories, but also enhancing the correlation among different modes.

Drawings

FIG. 1 is a schematic structural diagram of a cross-modal search network;

Detailed Description

The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings.

Step 2: and constructing a cross-modal retrieval network.

As shown in FIG. 1, the cross-modal search network adopts a two-layer CNN structure, including a ResNet-50 network and a text CNN network, i.e., TextCNN. The network structure is performed simultaneously by using double CNNs. And extracting the feature vector of the picture sample through a ResNet-50 network. For a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through TextCNN.

The ResNet-50 has the main idea that a direct connection channel is added in a network, original input information is allowed to be directly transmitted to a later layer, and the problems of information loss or gradient explosion and incapability of calculating due to too deep network are improved to a certain extent, so that the ResNet-50 is adopted for extracting information characteristic vectors of picture samples, and then common representation learning is carried out to obtain common representation of each picture.

A transition matrix is represented that represents the transition matrix,

And step 3: the bi-directional CNN network is trained through a training set, using the SGD optimizer, with a momentum of 0.9.

And 4, step 4: and constructing a loss function, calculating the error of each forward propagation, and updating the weight of the network through a back propagation algorithm.

First the target centers of the semantic clusters are calculated, the samples with the same concept should have a uniform distribution, but c cluster centers are calculated by averaging the class samples

Thus:

The distribution of samples of the same category but different modalities is not exactly the same, and the MMD can construct a statistical test to determine if the two samples are from different distributions. Therefore, the difference in the category distribution between different modes is reduced by minimizing the MMD of the category distribution, by minimizing X^SW and X^τSquare of maximum mean difference between H

Wherein,

is N₀Vector of X1, X^SA sample representing the source domain.

p_*,iis the probability distribution, y, generated for each image or text_iIs his true tagThe value is obtained.

The final common loss function is therefore expressed as:

And 5: testing the network model;

The dataset implemented in this example of implementation is the Pascal Senntence dataset. The data set consisted of 1000 images, divided into 20 categories, each image having 5 corresponding sentences. We selected 40 image-text sample pairs for training, 5 for testing, and 5 for validation.

The evaluation index adopted in the implementation process is the average precision average (mAP), which is a performance metric of the algorithm of predicting the target position and the category.

Claims

1. A cross-modal semantic clustering method based on bidirectional CNN is characterized by comprising the following steps:

step 1: preprocessing data, namely performing pre-training on a text sample of a training set;

dividing the existing data set into a training set and a testing set according to a set proportion, and pre-training text samples of the training set;

step 2: constructing a cross-modal retrieval network;

performing cross-modal retrieval network simultaneously by adopting double CNNs; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;

and step 3: training a cross-modal retrieval network through a training set;

and 4, step 4: calculating a loss function of the network; performing back propagation, and optimizing the connection weight through the selected optimizer and the corresponding parameters; training for multiple rounds to obtain a final network model;

finding an effective transition matrix

Projecting a sample from a source space to a target space; after the samples are transferred, clustering is carried out in a target space corresponding to the category clustering center of the samples; defining a loss function as the loss of a sample and a clustering center in a target space, the distribution difference loss of a category in different modes and the discrimination loss; the loss of the sample and the clustering center in the target space learns a dimensional invariant matrix, so that the variance of the category distribution is minimum; narrowing category distribution differences between different modalities by minimizing MMD of the category distribution; judging whether the loss is label prediction loss, and predicting the class label of the public space seed sample by applying a classifier;

and 5: and (3) testing the network model:

2. The bi-directional CNN-based cross-modal semantic clustering method according to claim 1, wherein the specific method in step 2 is as follows:

the cross-modal retrieval network adopts a double-layer CNN structure and comprises a ResNet-50 network and a text CNN network, namely textCNN; the network structure adopts double CNNs to carry out simultaneously; extracting a feature vector of the picture sample through a ResNet-50 network; for a text sample, Word2Vec is used for pre-training Word vectors, and then feature vectors of the text are extracted through textCNN;

extracting information characteristic vectors of the picture samples by adopting ResNet-50, and then carrying out public representation learning to obtain public representation of each picture;

word embedding is a method of converting words in text into digital vectors; the TextCNN comprises an embedded layer, a convolutional layer, a pooling layer and a fully-connected softmax layer; for each sentence, obtaining a two-dimensional sentence matrix according to the word vector, selecting filters with different sizes to perform convolution operation to obtain a plurality of characteristics, performing maximum pooling, splicing the characteristics, and finally classifying the characteristics through a softmax full-link layer; also, multiple fully connected layers are employed to learn a common representation of text;

with U ═ U₁,u₂,…,u_n],V＝[v₁,v₂,…,v_n]And Y ═ Y₁,y₂,…,y_n]An image representation matrix, a text representation matrix and a label matrix respectively representing all instances, wherein n is the number of categories;

a transition matrix is represented that represents the transition matrix,

3. The bi-directional CNN-based cross-modal semantic clustering method according to claim 2, wherein the specific method in step 3 is as follows:

4. The bi-directional CNN-based cross-modal semantic clustering method according to claim 3, wherein the specific method in step 4 is as follows:

setting a loss function as the combination of the loss of the sample and the clustering center in the target space, the distribution difference loss of the category in different modes and the discrimination loss; in order to reduce the overlapping of different categories in a target space, a matrix with invariable dimension is learned, the variance of category distribution is reduced, and the loss of semantic information and the difficulty of dimension selection can be effectively reduced;

Thus:

wherein X^τIs a set of cluster centers, N₀Is the number of samples, D_τIs the dimension of the target space, n is the number of classes;

representing the loss of samples and cluster centers in target space, X samples in target space;

Wherein,

is N₀Vector of X1, X^SA sample representing a source domain;

p_*,iis the probability distribution, y, generated for each image or text_iIs his true tag value;

the final common loss function is therefore expressed as:

5. The bi-directional CNN-based cross-modal semantic clustering method according to claim 4, wherein the specific method of step 5 is as follows;

inputting the image texts of the test set into the trained model to obtain high-level semantic representation of the predicted image texts, and evaluating the model through the average accuracy average (mAP) obtained through calculation; and finally, storing the trained model, testing through a test set pair, and calculating each evaluation index.