CN111985548A

CN111985548A - Label-guided cross-modal deep hashing method

Info

Publication number: CN111985548A
Application number: CN202010802092.0A
Authority: CN
Inventors: 曾焕强; 阮海涛; 朱建清; 陈婧; 曹九稳; 廖昀
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-24

Abstract

The invention discloses a label-guided cross-modal deep hash method, which comprises the following steps: the method comprises the steps of constructing an image, a text and a feature extraction network corresponding to label information, designing a loss function, and carrying out common representation space learning and label space learning on two modes of the input text and the input image so as to eliminate semantic gaps among different modes. The invention especially considers the difficulty in the cross-modal retrieval field, the data of different modes has semantic gaps, namely, the data are expressed as high-level semantic correlation, and the bottom-level characteristics are heterogeneous, and the invention can effectively improve the cross-modal retrieval precision.

Description

Label-guided cross-modal deep hashing method

Technical Field

The invention relates to the field of multi-modal learning and multi-modal fusion, in particular to a label-guided cross-modal deep hash method.

Background

With the explosive growth of multi-modal, multimedia data, cross-modal retrieval has become an urgent issue. The cross-modality retrieval aims at retrieval among data of different modalities (images, texts, voice, video and the like), such as image retrieval texts, text retrieval audios, audio retrieval videos and the like, and has very important application value. The application scenarios of the cross-modal retrieval are very wide, such as highlight retrieval of video websites, personalized semantic short video retrieval and the like.

However, data of different modalities often have the characteristics that the underlying features are heterogeneous and the high-level semantics are related. For example, the semantic of tiger has SIFT, LBP, etc. on the representation of image features, but the representation of text features is dictionary vectors, etc., and it can be seen that the expression types of the same semantic on different modality data are completely different from the description of the features. Therefore, the search across modal search is very challenging.

Disclosure of Invention

The invention provides a label-guided cross-modal deep hash method aiming at the problem that different modal data have semantic gaps in cross-modal retrieval.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a label-guided cross-modal deep hashing method comprises a training process and a retrieval process, and comprises the following steps:

training process S1: inputting image text pairs with the same semantics and class label information of the image text pairs into a label-guided deep hash network model for training until the model converges, thereby obtaining a network model M;

the retrieval process S2: and respectively extracting the feature vectors of the image to be queried and each text in the candidate library by using the network model M obtained by training in the S1, thereby calculating the similarity between the image to be queried and the text in the candidate library, sequencing according to the similarity, and returning a retrieval result.

Preferably, the training process S1 includes the following steps:

step S11): image data v of different classes_iInputting the image characteristics into an image modal characteristic extraction network to extract image characteristics;

step S12): will be compared with the image data v in S11)_iCorresponding text data t_iInputting the text data into a text modal feature extraction network to extract text features;

step S13): from image data v_iAnd text data t_iExtracting a common subspace characteristic B and a characteristic L of a label space from the category information;

step S14): class label information l labeled on image data_i＝[l_i1,...,l_ic]Inputting the character into label character extracting network to extract character H in Hamming space^lAnd features L of the tag space^l；

Step S15): the feature vector B, L, H obtained above^l、L^lAnd respectively sending the label space and the common representation space to carry out joint learning, optimizing the label network loss by adopting an error back propagation algorithm to obtain a convergent label network, so as to guide updating of the image text network, and carrying out iteration to form a label-guided cross-mode deep hash network model M.

Preferably, in step S11), the image feature extraction network is composed of five convolutional layers, a pooling layer and three fully-connected layers, where the number of fully-connected layer hidden units in the last layer is N-K + c, that is, the number is composed of a hash code length and an image data class number, where K denotes the length of the hash code and c denotes the data class number.

Preferably, in step S12), the text feature extraction network is formed by an MS model and a three-layer feedforward neural network, and the whole is T → MS → 4096 → 512 → N, where T denotes an input layer of the text network, MS denotes a multi-scale model, 4096 and 512 denote the number of hidden neurons of the two previous layers of feedforward networks, respectively, and N ═ K + c is formed by a hash code length K and a text data class number c, where K denotes the length of the hash code and c denotes the data class number; the MS model consists of five pooling layers, and the size of the MS model is (1 × 1, 2 × 2, 3 × 3, 5 × 5 and 10 × 10).

Preferably, in step S13), from the image data v_iAnd text data t_iExtracting common subspace characteristics B and characteristics L of the label space from the category information of (A), wherein B is through a similarity matrix S_ijConstructed when S_ijWhen 1 denotes O_iAnd O_jSimilarly; when S is_ijWhen 0 denotes O_iAnd O_jDissimilar, L-shaped as_i＝[l_i1,...,l_ic]When the data sample O_iBelonging to class j, L_ij1, otherwise L_ij＝0。

Preferably, in step S14), the tag feature extraction network is composed of two four-layer feedforward neural networks, where the number of cells in an implicit layer in the feedforward neural network is (L → 4096 → 512 → N), where L denotes an input layer of the tag network, 4096 and 512 denote the number of implicit neurons in the feedforward neural network in the first two layers, N ═ K + c is composed of a hash code length K and a text data class number c, and L ═ K + c is_i＝[l_i1,...,l_ic]Represents a data sample O_iBelongs to class j, then L_ij1, otherwise L _ij0; and for the extracted Hamming spatial feature H^lAnd tag space characteristics L^l，H^lGenerated by the sign function: namely, it is

Wherein the symbol functionThe numerical formula is:

and L is^lThe method is generated by introducing a sigmod function into an active layer, wherein the function formula of the sigmod is as follows:

wherein the content of the first and second substances,

semantic features learned for the hamming space; f. of^v,t,lRepresenting a hash function, θ^v,t,lRepresenting network parameters to be learned;

preferably, in step S15), the label network is updated first, and the target function formula of the label network is:

wherein S is_ijIs a similarity matrix when S_ijWhen 1 denotes O_iAnd O_jSimilarly; when S is_ijWhen 0 denotes O_iAnd O_jAre not similar to each other, wherein

And H^lAnd L^lRespectively representing a predicted hash code and a predicted class label, where₁To preserve the similarity of semantic features, λ₂For ensuring that data instances having the same class label have similar hash codes, λ₃Hash code loss, λ, to optimize learning₄Is to optimize the loss of label space;

and then guiding the updating of the image text network and an objective function formula in the image and text feature learning process:

Alpha, gamma, mu, beta are hyper-parameters, the values are all 1, and H^lAnd L^lRespectively representing a predicted hash code and a predicted class label, where ξ₁To preserve the similarity, ξ, of semantic features₂To ensure that data instances with the same class label have similar hash codes, ξ₃Hash code loss, ξ, to optimize learning₄Optimizing the loss of the label space and optimizing the objective function to obtain the final model M.

Preferably, the step of the retrieving process S2 is as follows:

step S21): respectively extracting the hash code vector of the image to be inquired and the hash code vector of each text in the candidate library in the image retrieval text task by using the basic network model M obtained by training in the S1;

step S22): by Hamming distance dist_H＝(b_i·b_j) Calculating the similarity between the feature vector of the image to be queried and the feature vector of each text in the candidate library, b_i,b_jThe hash codes respectively represent the hash code of the query image i and the hash code of the jth text data in the candidate library, and (phi) represents the inner product operation.

Step S23): and performing descending sorting according to the obtained similarity, and returning a retrieval result.

The invention has the following beneficial effects:

the invention constructs a label-guided cross-modal deep hash network, obtains the depth characteristics of each modality by constructing the deep learning network of each modality, introduces a common subspace-Hamming space to measure heterogeneous data, and fully utilizes the pure semantics of label information to supervise and learn the image and text modalities. The cross-modal deep hash model obtained by training has higher accuracy in text detection by image and text detection by image; in the retrieval process, a network model obtained based on the method training is utilized to perform feature extraction and Hamming distance calculation on an image (text) to be queried and texts (images) in a candidate library, so that the similarity between the image to be queried and the text data in the candidate library is obtained, and cross-mode retrieval is realized. According to the method, the original features are mapped to the Hamming space, so that the calculation speed is greatly increased, the storage capacity is reduced, and the retrieval precision is improved.

The present invention is described in further detail with reference to the drawings and the embodiments, but the label-guided cross-modal deep hash network of the present invention is not limited to the embodiments.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

Referring to fig. 1, the invention relates to a label-guided cross-modal deep hash method, where the model includes a training process and a retrieval process, and specifically,

the training process S1 includes the following steps:

step S12): will be compared with the image data v in S11)_iCorresponding text data t_iInputting the data into a text modal characteristic extraction network to extract text data characteristics;

The steps of the retrieval process S2 are as follows:

step S22): by Hamming distance dist_H＝(b_i·b_j) Calculating the similarity between the feature vector of the image to be queried and the feature vector of each text in the candidate library, b_i,b_jRespectively representing the hash code of the query image i and the hash code of the jth text data in the candidate library, and (phi) representing inner product operation;

Further, the image feature extraction network in step S11) for adapting to the proposed idea principle is composed of five convolutional layers, a pooling layer and three fully-connected layers, where the number of fully-connected layer hidden units in the last layer is (N ═ K + c), that is, the network is composed of a hash code length and an image data class number, where K denotes the length of the hash code and c denotes the data class number;

further, in step S12), the text feature extraction network is formed by an MS model and a three-layer feedforward neural network, and the whole is (T → MS → 4096 → 512 → N), where T denotes an input layer of the text network, MS denotes a multi-scale model, 4096 and 512 denote the number of implicit neurons of the preceding two-layer feedforward network, respectively, N ═ K + c is formed by the hash code length K and the number of text data classes c, and the MS model is formed by five layers of pooling layers, and has a size of (1 × 1, 2 × 2, 3 × 3, 5 × 5, 10 × 10).

Further, extracting the reference common subspace feature B and the label space feature L from the training data in step S13), wherein B is through the similarity matrix S_ijConstructed when S_ijWhen 1 denotes O_iAnd O_jSimilarly; when S is_ijWhen 0 denotes O_iAnd O_jDissimilar, L-shaped as_i＝[l_i1,...,l_ic]When the data sample O_iBelonging to class j, L_ij1, otherwise L_ij＝0。

Further, in step S14), the tag feature extraction network is composed of two four-layer feedforward neural networks, where the number of cells in an implicit layer in the feedforward neural network is (L → 4096 → 512 → N), where L denotes an input layer of the tag network, 4096 and 512 denote the number of implicit neurons in the feedforward neural network in the first two layers, N ═ K + c is composed of a hash code length K and a text data class number c, and L ═ K + c is calculated by calculating a hash code length K and a text data class number c, where L is a function of the number of the implicit neurons in_i＝[l_i1,...,l_ic]Represents a data sample O_iBelongs to class j, then L_ij1, otherwise L _ij0. And for the extracted Hamming spatial feature H^lAnd tag space characteristics L^l，H^lGenerated by the sign function: namely, it is

Wherein the symbolic function formula is:

wherein

Semantic features learned for the hamming space: wherein f is^v,t,lIs a hash function, θ^v,t,lIs a network parameter to be learned;

further, in step S15), the label network is updated first, and the target function formula of the label network is:

And H^lAnd L^lRespectively representing a predicted hash code and a predicted class label, where₁To preserve the similarity of semantic features, λ₂For ensuring that data instances having the same class label have similar hash codes, λ₃Hash code loss, λ, to optimize learning₄Is to optimize the loss of label space.

Alpha, gamma, mu, beta are hyperparameters which all have a value of 1, and H^lAnd L^lRespectively representing a predicted hash code and a predicted class label, where ξ₁To preserve the similarity, ξ, of semantic features₂To ensure that data instances with the same class label have similar hash codes, ξ₃Hash code loss, ξ, to optimize learning₄Is to optimize the loss of label space.

The above is only one preferred embodiment of the present invention. However, the present invention is not limited to the above embodiments, and any equivalent changes and modifications made according to the present invention, which do not bring out the functional effects beyond the scope of the present invention, belong to the protection scope of the present invention.

Claims

1. A label-guided cross-modal deep hash method is characterized by comprising a training process and a retrieval process, and comprises the following steps:

2. The label-guided cross-modal deep hash method of claim 1, wherein the training procedure S1 comprises the following steps:

step S14): class label information l labeled on image data_i＝[l_i1,...,l_ic]Extracting features in hamming space from input into label feature extraction networkSign H^lAnd features L of the tag space^l；

3. The label-guided cross-modal depth hashing method according to claim 2, wherein in step S11), the image feature extraction network is composed of five convolutional layers, a pooling layer and three fully-connected layers, wherein the number of fully-connected layer hidden units in the last layer is N ═ K + c, that is, the number is composed of a hash code length and an image data class number, where K denotes the length of the hash code and c denotes the data class number.

4. The label-guided cross-mode deep hash method according to claim 2, wherein in step S12), the text feature extraction network is formed by an MS model and a three-layer feedforward neural network, which is T → MS → 4096 → 512 → N as a whole, where T denotes a text network input layer, MS denotes a multi-scale model, 4096 and 512 denote the number of hidden neurons of the two previous layers of feedforward networks, respectively, and N ═ K + c is composed of a hash code length K and a text data class number c, where K denotes a hash code length and c denotes a data class number; the MS model consists of five pooling layers, and the size of the MS model is (1 × 1, 2 × 2, 3 × 3, 5 × 5 and 10 × 10).

5. The label-guided cross-modal depth hashing method according to claim 2, wherein in step S13), the image data v is selected from_iAnd text data t_iExtracting common subspace characteristics B and characteristics L of the label space from the category information of (A), wherein B is through a similarity matrix S_ijConstructed when S_ijWhen 1 denotes O_iAnd O_jSimilarly; when S is_ijWhen 0 denotes O_iAnd O_jDissimilar, L-shaped as_i＝[l_i1,...,l_ic]When the data sample O_iBelonging to class j, L_ij1, otherwise L_ij＝0。

6. The label-guided cross-mode deep hash method according to claim 2, wherein in step S14), the label feature extraction network is composed of two four-layer feedforward neural networks, where the number of hidden layer units in the feedforward neural network is (L → 4096 → 512 → N), where L represents the input layer of the label network, 4096 and 512 represent the number of hidden neurons in the two previous layers of feedforward networks, N ═ K + c is composed of the hash code length K and the number of text data classes c, and L ═ K + c is_i＝[l_i1,...,l_ic]Represents a data sample O_iBelongs to class j, then L_ij1, otherwise L_ij0; and for the extracted Hamming spatial feature H^lAnd tag space characteristics L^l，H^lGenerated by the sign function: namely, it is

Wherein the symbolic function formula is:

wherein the content of the first and second substances,

semantic features learned for the hamming space; f. of^v,t,lRepresenting a hash function, θ^v,t,lIndicating the network parameters that need to be learned.

7. The label-oriented cross-modal deep hash method according to claim 2, wherein in step S15), the label network is updated first, and an objective function formula of the label network is:

Alpha, gamma, mu, beta are hyper-parameters, the values are all 1, and H^lAnd L^lRespectively representing a predicted hash code and a predicted class label, where ξ₁For preserving similarity of semantic featuresSex xi₂To ensure that data instances with the same class label have similar hash codes, ξ₃Hash code loss, ξ, to optimize learning₄Optimizing the loss of the label space and optimizing the objective function to obtain the final model M.

8. The label-guided cross-modal deep hash method of claim 1, wherein the retrieving process S2 comprises the following steps: