CN111753189A

CN111753189A - Common characterization learning method for few-sample cross-modal Hash retrieval

Info

Publication number: CN111753189A
Application number: CN202010476647.7A
Authority: CN
Inventors: 王少英; 赖韩江
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-09
Anticipated expiration: 2040-05-29

Abstract

The invention provides a few-sample cross-modal Hash retrieval common characterization learning method, which designs a Jixiei-Jixiei network. Jiehi-Jie-Bi networks primarily involve two major modules: an informed module and an informed module. The informed module can fully utilize information hidden in data, fuse features of different layers and extract features with more global property; on the basis of the known module, the known module models the correlation of all samples and captures the nonlinear dependence relationship between data so as to better learn the common characterization of different modal data. And finally, establishing a loss function for keeping intra-modal and inter-modal similarity, and training and optimizing the network. The invention can effectively solve the problem of data imbalance under the condition of few samples, and can learn more representative common representation, thereby greatly improving the cross-modal retrieval precision.

Description

Common characterization learning method for few-sample cross-modal Hash retrieval

Technical Field

The invention relates to the field of computer visual information retrieval, in particular to a common characterization learning method for cross-modal hash retrieval of few samples.

Background

Data of various different modes on the internet is increasing day by day, so that cross-mode retrieval is more and more widely applied. The cross-modal retrieval is to take data of one modality as a query item, perform retrieval on a database consisting of data of another modality, and return similar data. Since images and texts are two most common multimedia data, and in addition, the hash method maps high-dimensional data into low-dimensional binary codes, which can improve the retrieval speed and save the storage space, only the hash retrieval across images and texts is discussed.

In recent years, various cross-modal Hash retrieval algorithms based on deep learning are proposed by the academic community, and better retrieval performance is achieved. In the overview of the algorithms, a deep network is designed for data of one modality, training and learning are respectively performed, and data of different modalities are independently mapped to a common space. However, this approach treats each data sample as an independent individual, extracts feature representations from the corresponding data sample only, ignores the correlation information between different data, and when the number of samples in certain categories is small, the information of these small samples may be covered by other information with enough samples, so that when there are insufficient training samples of different modalities, the existing algorithm may make it difficult for the model to learn a better common characterization. Data of different modes have heterogeneity, and when the model can extract powerful common representation of the data of different modes, retrieval accuracy can be improved. Therefore, how to effectively utilize the information contained between different modality data and learn representative common characterization is a problem that needs to be solved by the sample-less cross-modality retrieval task.

The level of cross-modal retrieval accuracy is directly related to the common representation of data, and the fact that people know the knowledge and can never end up is inspired by the ancient sentence, and a more powerful feature representation for learning the knowledge-people network is provided. Depth feature extraction is decomposed into two subtasks: 1) a better representation is learned directly from the sample itself using the knowledge network. Since different network layers can encode different information, for example, the lower layers of the convolutional neural network tend to encode structural information, and the higher layers tend to extract semantic information. In addition, the reception field of the high-level network is larger, the characteristics of the large target can be better extracted, the reception field of the low-level network is smaller, and the characteristics of the small target are mainly extracted. If the features extracted from different layers are fused, not only more global information can be obtained, but also the problem of multiple scales can be solved. Based on the method, a cognitive network with self-perception capability is designed, so that multi-layer abstract features can be captured, and global information of each layer in the deep neural network is fully utilized; 2) the feature representation is further improved by using that network to know other samples as context information. When a human being is learning a new thing, the learning speed is faster if the new thing is similar to the thing learned before. The human thinking mode is fused into the model design, and the network has the relevance perception capability.

The patent specification with the application number of 201910983514.6 discloses a text hash retrieval method based on deep learning, which comprises the steps of extracting semantic codes corresponding to each original vocabulary data of a word embedded matrix by using a bidirectional LSTM model, then connecting a text convolutional neural network in parallel behind the bidirectional LSTM model, adding an attention mechanism, converting an output value of a second full-connection layer into corresponding hash codes by using a sign function, reconstructing category labels by using the hash codes, and finally searching vector data closest to the Hamming distance of the hash codes of the retrieved text in the hash codes of a text library to finish the hash retrieval process of the retrieved text data. However, this patent does not achieve the correlation of the captured data effectively, extracting representative common tokens.

Disclosure of Invention

The invention provides a common characterization learning method for few-sample cross-modal Hash retrieval with high cross-modal retrieval precision.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a few-sample cross-modal Hash retrieval common characterization learning method comprises the following steps:

s1: dividing a data set and preprocessing original images and text data;

s2: establishing two parallel deep network structures, and respectively extracting characteristic representations of the preprocessed image and the preprocessed text;

s3: establishing a hash layer, and mapping the characteristics of the image and the text to a public space to obtain hash codes of data in different modes;

s4: training, optimizing and testing the model by utilizing the triple loss function;

s5: and establishing a retrieval interface, inputting data of one mode, retrieving in a database formed by data of another mode, and returning the most similar top k sample as a retrieval result.

Further, the specific process of step S1 is:

s11: dividing the data set: forming image text pairs by images and texts with consistent label categories in the cross-modal dataset, and randomly selecting a plurality of categories, wherein training samples corresponding to the categories are few and are called as few sample categories; other classes in the data set are called basic classes, and the training samples of the classes are enough; dividing the data set into a training set and a testing set, wherein the number of samples of a few sample classes in the training set is much smaller than that of samples of a basic class, and the number of samples of different classes in the testing set is relatively balanced;

s12: respectively preprocessing the image and the text: unifying the image size and normalizing the image; for text, if the text form given by the data set is an independent word, converting the text into a word vector by using a word bag model; if the given text is a sentence or an article, namely words have time sequence, extracting the characteristics of the text by using a pre-trained Bert model, and converting the text into a vector.

Further, the specific process of step S2 is:

s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; because the convolutional neural network can extract rich characteristic information of the image, the VGG19 model with the last full connection layer removed is taken as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x₁,x₂,x₃(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y₁,y₂,y₃；

S22: because the features obtained by different network layers contain different information, for example, the lower-layer convolutional layer encodes visual information, and the higher-layer convolutional layer tends to encode semantic information, a knowledge module is designed to fuse the features of different layers, capture global information and obtain more representative features; the self-learning module is composed of a plurality of fully connected layers and non-local blocks, wherein the fully connected layers are used for mapping the feature vectors of different layers obtained at S21 into vectors with the same dimension, and the non-local blocks are used for calculating the correlation among the vectors with the same dimension;

s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by human thinking mode, a mutual understanding module is designed, other samples are used as context information, and the learned feature representation is further improved(ii) a For an input image q, after S22, a feature I containing the global information of the image itself is obtained_selfAnd then reducing the dimension of the features by using the full connection layer to obtain s^qIts dimension and text feature T_selfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;

s24: computing input image features s^qAnd

the correlation vector of (c):

wherein

Is a nerve tensor comprising m slices, which is updated during the training process, σ (·) is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficient

Wherein W_zAnd b_zIs a parameter of the fully-connected layer;

after the correlation coefficients of the image q and all the category vectors are processed by a softmax function, the normalized correlation coefficient is obtained

And finally, integrating other sample information into the image characteristics in a weighted summation mode to obtain a final characteristic representation:

the obtained feature representation not only contains the feature information of the feature representation, but also contains experience information of other samples; similarly, the obtained text is characterized by the following features after passing through the known module:

further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recorded as

In the training process, the category vectors are updated at intervals, so that all sample information can be utilized, and the calculation amount is not too large.

Further, in step S22, the non-local block is implemented as follows: for an image, the vector of the same dimension is still marked as x₁,x₂,x₃And x ═ x₁,x₂,x₃) Then the response of the informed module at the ith position is:

wherein G (x)_j) a non-linear mapping function, for the convenience of training, using 1 × 1 convolutional layer, N (x) is a normalization factor;

for calculating x_iAnd x_jThe correlation of (c); then the features are fused to obtain the output of an image knowledge module as

Wherein mean (-) refers to the mean operation, the output of the informed module in the text network is

Further, the specific process of step S3 is:

s31: taking a full connection layer as a hash layer, wherein the dimension of the full connection layer is consistent with the bit number of the hash code; mapping the image and text features obtained in step S24 to a common hash space, to obtain hash codes of the image and text, respectively;

wherein

And

and

parameters of a hash layer of an image network and a hash layer of a text network are respectively set;

s32: and respectively converting the hash codes of the image and the text into binary codes by utilizing a tanh function:

B^I＝tanh(H^I)

B^T＝tanh(H^T)。

further, the specific process of step S4 is:

s41: the objective of the triplet loss function is to make the distance between homogeneous samples smaller than the distance between heterogeneous samples, and the calculation formula is:

wherein e, e⁺,e^-Is a triplet, e and e, consisting of hash codes⁺Belong to the same class, eAnd e^-α is a threshold parameter;

calculating the Euclidean distance;

in the cross-modal retrieval task, not only the semantic similarity between the same-modal data but also the semantic similarity between different modalities are required to be kept, that is, if the labels of the data of different modalities are consistent, the learned feature representations are required to be similar as much as possible, so that the triple loss function between the same modality and the different modalities is designed as follows:

the overall loss function is: l ═ L_intra+l_interDuring training, taking a small batch of image text pairs as input each time, calculating a loss function value L, solving a gradient by using a chain method and reversely transmitting and updating network parameters until the model converges;

s42: and storing the trained model, and testing the model.

Further, the specific process of step S5 is:

establishing an input interface, taking data of one mode as input, and forming a database by a binary code obtained by model coding data of the other mode; and mapping the input data into binary codes by using the model trained in S41, calculating the Hamming distance between the binary codes and the binary codes in the database, sequencing the samples of the database according to the Hamming distance, and returning the first k most similar samples as a retrieval result.

Further, in step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.

Further, in step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the Chi-Chi network provided by the invention can fully utilize the information of the samples and other samples, and combine the information in each sample and all data, rather than treating each sample as a discrete unit so as to learn more powerful feature representation. Even under the condition that the number of samples in certain categories is small, the technical scheme can effectively capture the data correlation and extract representative common characteristics, so that the cross-modal retrieval accuracy is greatly improved.

Drawings

FIG. 1 is a network framework diagram of the present invention;

FIG. 2 is a flow chart of the steps of the present invention;

FIG. 3 is a graph comparing experimental results of the method of the present invention with those of the prior art.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, a method for learning a common characterization by using a few-sample cross-modal hash search includes the following steps:

s1: dividing a data set and preprocessing original images and text data;

The specific process of step S1 is:

The specific process of step S2 is:

s21: respectively extracting features of the image and the text by using two parallel deep network frameworks, wherein each deep network framework comprises three parts: the device comprises a primary feature extraction module, a knowing module and a knowing module; due to the fact thatThe convolutional neural network can extract rich characteristic information of the image, so that the VGG19 model with the last full connection layer removed is used as a primary characteristic extraction module of the image; taking the image obtained after the preprocessing of S12 as the input of a primary feature extraction module, and respectively recording the outputs of the last three rolling blocks in the module after passing through a global mean pooling layer as x₁,x₂,x₃(ii) a For the text, three full-connection layers are used as a primary feature extraction module of the text, and features of different layers extracted by different full-connection layers are respectively marked as y₁,y₂,y₃；

s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtained_selfAnd then reducing the dimension of the features by using the full connection layer to obtain s^qIts dimension and text feature T_selfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, using all samples wastes time and memory, so that all samples are divided by using the class information, the average feature of each class of sample is calculated, and the obtained average feature is calledThe category vectors are used for representing all sample characteristics;

s24: computing input image features s^qAnd

the correlation vector of (c):

wherein

Wherein W_zAnd b_zIs a parameter of the fully-connected layer;

further, in step S23, if there are n categories in total, there are n category vectors, and the ith image category vector is recordedIs composed of

The specific process of step S3 is:

wherein

And

and

B^I＝tanh(H^I)

B^T＝tanh(H^T)。

the specific process of step S4 is:

wherein e, e⁺,e^-Is a triplet, e and e, consisting of hash codes⁺Belong to the same class, e and e^-α is a threshold parameter;

calculating the Euclidean distance;

s42: and storing the trained model, and testing the model.

The specific process of step S5 is:

In step S42, the process of testing the model is as follows: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.

In step S11, the ratio of the number of the less sample classes to the number of the basic classes is about 1: 4.

The scheme of the invention adopts two parallel deep networks (called image network and text network) to respectively process the image and the text. Each deep network contains four parts: the primary feature extractor is used for extracting primary features of the sample, the image is in a VGG19 model, and the text is in a bag-of-words model or a Bert model; the knowing module fuses the characteristics of different layers to obtain more global information; the known module takes other samples as context information, calculates the correlation among the samples to capture the nonlinear dependence information among different samples, and thus obtains more representative feature representation; and the Hash layer maps the obtained image and text features to a common space to learn common characteristics. And finally, training the model by utilizing the triple loss function, and keeping the similarity in the modes and among the modes. During training, small batches of image text pairs are input each time, and Adam is used as an optimizer to perform optimization. And (5) carrying out iterative training for multiple times until the model is converged, and storing the model.

After training the model, the model performance is tested, and the flow is shown in fig. 2. Firstly, mapping samples in training sets of images and texts to be Hash codes by using a trained image network and a trained text network respectively, and then performing binarization operation by using a tanh function to obtain binary codes which are respectively used as an image database and a text database. If the performance of the image retrieval text is to be tested, samples in the image test set are used as query images, after the query images are mapped into binary codes, the Hamming distance between the binary codes and the text database is calculated, the samples corresponding to the text database are sequenced according to the Hamming distance, and the smaller the Hamming distance, the more similar the results are. And finally, calculating the mAP of the image retrieval text according to the sorting condition. The performance process of testing the text retrieval image is similar to the process of image retrieval text, but the text test set is used as a query set, and binary codes corresponding to the image training set are used as a database.

FIG. 3 shows mAP results on the Wikipedia dataset for the present invention and other methods. Image → Text in the table represents the Image retrieval Text task, Text → Image represents the Text retrieval Image task, K represents the number of samples of each class of small sample class in the training set, and 16bits represents that the number of bits of the binary code is 16 bits. From the table, it can be seen that the search performance of the invention on two tasks is higher than that of the other two methods, and the effectiveness of the invention is illustrated.

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A few-sample cross-modal Hash retrieval common characterization learning method is characterized by comprising the following steps:

s1: dividing a data set and preprocessing original images and text data;

2. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 1, wherein the specific process of step S1 is:

3. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 2, wherein the specific process of step S2 is:

s23: the training samples of few sample classes are few, which is not enough for the deep network to effectively learn better feature representation; when people are learning new things, if the new things are similar to the already known things, the learning is faster; inspired by a human thinking mode, a learning module is designed, and other samples are used as context information to further improve the learned feature representation; for an input image q, after S22, a feature I containing the global information of the image itself is obtained_selｆAnd then reducing the dimension of the features by using the full connection layer to obtain s^qIts dimension and text feature T_selfThe dimensions of the data are consistent; to integrate information from other samples, one naturally thinks of learning features using all samples as context information; however, the use of all samples wastes time and memory, so that all samples are divided by using the category information, the average characteristic of each type of sample is calculated, the obtained average characteristic is called a category vector, and the category vectors represent all sample characteristics;

s24: computing input image features s^qAnd

the correlation vector of (c):

wherein

Is a nerve tensor containing m slices, which will be in the training processUpdate it, σ () is the ReLU activation function; next, a full connection layer is used to map the correlation vector into the correlation coefficient

Wherein W_zAnd b_zIs a parameter of the fully-connected layer;

4. the method for learning common features for low-sample cross-modal hash search as claimed in claim 3, wherein in step S23, if there are n classes, there are n class vectors, and the ith image class vector is recorded as

5. The method for learning common features for low-sample cross-modal hash retrieval as claimed in claim 4, wherein in step S22, the method is not localThe block is implemented as follows: for an image, the vector of the same dimension is still marked as x₁,x₂，x₃And x ═ x₁,x₂,x₃) Then the response of the informed module at the ith position is:

6. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 5, wherein the specific process of step S3 is:

wherein

And

and

B^I＝tanh(H^I)

B^T＝tanh(H^T)。

7. the method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 6, wherein the specific process of step S4 is:

calculating the Euclidean distance;

s42: and storing the trained model, and testing the model.

8. The method for learning common characteristics of few-sample cross-modal hash retrieval according to claim 7, wherein the specific process of step S5 is:

9. The method for learning common characteristics of few-sample cross-modal hash search as claimed in claim 8, wherein in step S42, the model testing process is: taking a test set of one mode as a query set, and taking a training set of the other mode as a database; mapping the samples in the query set and the database into binary codes by using the trained model; sequencing the samples in the database according to the Hamming distance between the binary codes of the samples in the query set and the binary codes of the samples in the database, wherein the samples are more similar to the queried samples when the samples are closer to the front; the mAP is used as an evaluation index of the model, the value range of the mAP is [0,1], the index simultaneously considers the retrieval accuracy and the recall condition, and if the mAP is higher, the retrieval effect of the model is better.

10. The low-sample cross-modal hash retrieval common characterization learning method of claim 9, wherein in step S11, the ratio of the number of low-sample classes and the number of basic classes is about 1: 4.