CN104462489B

CN104462489B - A kind of cross-module state search method based on Deep model

Info

Publication number: CN104462489B
Application number: CN201410800393.4A
Authority: CN
Inventors: 李睿凡; 张光卫; 鲁鹏; 芦效峰; 冯方向; 李蕾; 刘咏彬; 王小捷
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2018-02-23
Anticipated expiration: 2034-12-18
Also published as: CN104462489A

Abstract

The present invention proposes a kind of cross-module state search method based on Deep model, and this method includes：Obtain the rudimentary expression vector of target retrieval mode and each mode that is retrieved in search library respectively using feature extracting method；The rudimentary expression vector of target retrieval mode is retrieved the rudimentary expression vector of mode with each in search library respectively, passes through and the advanced expression vector that Boltzmann machine Corr RBMs Deep models obtain each mode that is retrieved in the advanced expression vector sum search library of target retrieval mode is limited corresponding to stacking；The distance of target retrieval mode and each mode that is retrieved in search library is calculated using the advanced expression vector of each mode that is retrieved in the advanced expression vector sum search library of target retrieval mode；At least one mode that is retrieved closest with target retrieval mode in search library is defined as to the object with target retrieval mode vectors correlation.

Description

Cross-modal retrieval method based on deep model

Technical Field

The invention relates to a multimedia retrieval technology, in particular to a cross-modal retrieval method based on a deep model.

Background

The development of the internet in recent years has led to an explosive growth in multimodal data. For example, products on an e-commerce website typically contain backbone text, a short textual description, and related pictures; pictures shared on social networking sites are often accompanied by tagged descriptors; some online news contains picture and video information that is more attractive than simple text reports, and the rapid growth of multi-modal data has brought about a huge demand for cross-modal retrieval.

Unlike traditional single-modality retrieval, cross-modality retrieval focuses more on relationships between different modalities. Thus, the cross-modal search problem includes two challenge problems: firstly, data from different modalities have completely different statistical characteristics, so that the incidence relation of the data of different modalities is difficult to obtain directly; secondly, the features extracted from different modal data usually have high dimensional characteristics and the size of the data set is very large, which makes efficient retrieval difficult to achieve.

Disclosure of Invention

In view of this, the present invention provides a deep model-based cross-modal retrieval method, which applies a deep model to solve the problem of processing cross-modal data, so that the cross-modal data processed by the deep model can be subjected to distance calculation efficiently, thereby obtaining a better retrieval result. The technical scheme provided by the invention is as follows:

a cross-modal retrieval method based on a deep model comprises the following steps:

respectively obtaining a low-level expression vector of each searched modal in a target searching modal and a searching library by using a characteristic extraction method;

the low-level expression vector of the target retrieval mode is respectively matched with the low-level expression vector of each retrieved mode in the retrieval library, and the high-level expression vector of the target retrieval mode and the high-level expression vector of each retrieved mode in the retrieval library are obtained by stacking corresponding limited Boltzmann machine Corr-RBMs deep models;

calculating the distance between the target retrieval modality and each retrieved modality in the retrieval library by using the high-level expression vector of the target retrieval modality and the high-level expression vector of each retrieved modality in the retrieval library;

determining at least one retrieved modality in the retrieval library which is closest to the target retrieval modality as an object matching the target retrieval modality.

In summary, the technical solution of the present invention provides a deep model-based cross-modal retrieval method, in which low-level expressions obtained by extracting features from cross-modal raw data are processed by stacking corresponding deep models of Corr-RBMs of a Restricted Boltzmann Machine (Corr-RBM) to obtain low-dimensional high-level expressions of the cross-modal data in the same representation space, and then distance calculation is performed on the low-dimensional high-level expressions of the cross-modal data, and a retrieval result is determined according to the distance.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a deep neural network model of Corr-RBMs according to the present invention;

FIG. 3 is a diagram of a neural network structure of a Corr-RBM model according to the present invention;

FIG. 4 is a block diagram of a restricted Boltzmann machine RBM model;

FIG. 5 is a flowchart of a method for determining Θ based on an objective function F;

FIG. 6 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to solve the problem of cross-modal retrieval, the invention provides a cross-modal retrieval method based on a deep layer model of Corr-RBMs, and a flow chart of the technical scheme of the invention is shown in figure 1 and comprises the following steps:

step 101: and respectively obtaining a low-level expression vector of any one of the target retrieval mode and the retrieval library by using a feature extraction method.

In this step, in order to search an object matched with a target search modality in a search library, a low-level expression vector of any one of the target search modality and the search library to be searched needs to be searched, the low-level expression vector obtained by the feature extraction method has a higher general dimension, and the low-level expression vector elements of different modalities are different and generally cannot be directly used for search operation.

Step 102: and respectively obtaining the low-level expression vector of the target retrieval mode and the low-level expression vector of each retrieved mode in the retrieval library by stacking the corresponding limited Boltzmann machine Corr-RBMs deep model.

In the step, the low-level expression vector of the target retrieval mode is combined with the low-level expression vector of each retrieved mode in the retrieval library respectively, and the high-level expression vector of the target retrieval mode and the high-level expression vector of each retrieved mode in the retrieval library are obtained by stacking corresponding limited Boltzmann machine Corr-RBMs deep models. The high-level expression vector of the target retrieval mode obtained through the Corr-RBMs deep model and the high-level expression vector of each retrieved mode in the retrieval library have the characteristics of low dimension, consistent spatial elements and the like, and can be used for efficiently performing retrieval operation.

Step 103: and calculating the distance between the target retrieval mode and any retrieved mode in the retrieval library by using the high-level expression vector of the target retrieval mode and the high-level expression vector of each retrieved mode in the retrieval library.

In particular, the distance of the target retrieval modality from each retrieved modality in the search library may be represented by a euclidean distance.

Step 104: and determining at least one retrieved modality which is closest to the target retrieval modality in the retrieval library as an object which is matched with the target retrieval modality.

In this step, the distances between each retrieved modality and the target retrieval modality in the retrieval library are sorted, and at least one retrieved modality closest to the target retrieval modality is selected and determined as an object matched with the target retrieval modality.

The invention has proposed a method for using and stacking the deep model of Corr-RBMs to carry on the retrieval of the cross modality, fig. 2 is the deep model neural network structure diagram of Corr-RBMs of the invention and stacking the Corr-RBMs, as shown in fig. 2, the deep model of Corr-RBMs is stacked by at least two layers of Corr-RBMs, the deep model of Corr-RBMs can obtain the high-level expression of the primitive data of the two different modalities from the low-level expression of the primitive data of the two different modalities; the structure diagram of the neural network of each layer of the Corr-RBM model is shown in fig. 3, the Corr-RBM model is built on the basis of a limited Boltzmann machine RBM, fig. 4 is the structure diagram of the neural network of the limited Boltzmann machine, and the details of the RBM model, the Corr-RBM model and the deep layer Corr-RBMs are respectively described below.

First, RBM model:

FIG. 4 is a diagram of a neural network structure of an RBM, and as shown in FIG. 4, a visible layer V of the RBM includes m neural units V ₁ ～v _m Each of the nerve units v _i Is biased by b _i No connection between the neural units of the visible layer; the hidden layer H comprises s nerve units H ₁ ～h _s Each of the nerve units h _j Is c _j There is no connection between the lamina neural units; visible layer nerve unit v _i And hidden layer nerve cell h _j The connection weight value of is w _ij . For the convenience of understanding, only the connection weights of the partial visible layer neural unit and the hidden layer neural unit are drawn in fig. 4.

RBM has an undirected graph structure with Logistic activation function δ (x) = 1/(1 + exp (-x)), then the joint probability distribution of the neural units of visible layer V and hidden layer H is:

wherein Z is a normalization constant, E (v, h) is an energy function defined by different configurations of the visible layer neural unit and the hidden layer neural unit of the RBM, and E (v, h) has different representations according to the different configurations of the visible layer neural unit and the hidden layer neural unit, that is, as long as the configuration of the visible layer neural unit and the configuration of the hidden layer neural unit of the RBM are determined, corresponding energy functions exist, which is not described in detail herein.

Visible layer nerve unit v of RBM _i Offset b of _i Is hiddenHidden layer nerve cell h _j Bias c of _j Visible layer nerve unit v _i And hidden layer nerve unit h _j Connection weight w _ij The learning of (1) can be obtained by comparing the divergence estimation algorithm, which is a mature prior art and will not be described in detail herein.

(II) a corresponding restricted Boltzmann machine Corr-RBM model:

fig. 3 is a structural diagram of a Corr-RBM model according to the present invention, and as shown in fig. 3, the Corr-RBM model includes a first mode RBM and a second mode RBM, the first mode RBM and the second mode RBM include the same number m of neural units of a visible layer and the same number s of neural units of a hidden layer, and there is a dependency constraint between the hidden layers of the first mode RBM and the second mode RBM.

Let Θ denote the parameter set of the Corr-RBM model, i.e. Θ = { W = { ^I ,C ^I ,B ^I ,W ^T ,C ^T ,B ^T Wherein superscript I denotes a first modality, superscript T denotes a second modality, in particular W ^I A set of connection weight parameters between the visible layer neural units and the hidden layer neural units of the RBM of the first modality, C ^I Set of visible layer neural unit bias parameters for RBM of first modality, B ^I Hidden layer neural cell bias parameter set, W, for RBM of first modality ^T A set of connection weight parameters between the visible layer neural units and the hidden layer neural units of the RBM of the second modality, C ^T Set of visible layer neural unit bias parameters for RBM of second modality, B ^T And (4) biasing a parameter set for a hidden layer neural unit of the second mode RBM.

The parameter set Θ of the Corr-RBM model is determined by the following parameter learning algorithm:

the objective function F is defined according to the following principle: the set of parameters Θ of the Corr-RBM model can minimize the distance of the first modality from the second modality over the shared representation space and minimize the negative log-likelihood function of the first modality and the second modality. The objective function F is F = l _D +αl _I +βl _T I.e. Θ is the set of parameters that minimizes F.

Wherein the content of the first and second substances,

wherein l _D Is the distance between the first mode and the second mode in the nesting space, l _I Negative log-likelihood function of the first modality, l _T A negative log likelihood function for the second modality, α and β being constants, α ∈ (0,1), β ∈ (0,1); f. of _I (. Is a first modality RBM visible layer to hidden layer mapping function, f _T () is a second modality RBM visible layer to hidden layer mapping function; p is a radical of _I (.) a joint probability distribution of the RBM visible layer and hidden layer neural units in the first modality, p _T (.) is the joint probability distribution of the second modality RBM visible layer and hidden layer neural units, | | | | is a two-norm map.

To determine Θ from the objective function F, an alternating iterative optimization procedure can be used, first of all for two likelihood functions l _I And l _T Updating by using a comparative divergence estimation algorithm, and then updating l by using a gradient descent method _D Convergence can be detected on the validation set using cross-modality retrieval, and in particular, fig. 5 is a flowchart for determining Θ according to the objective function F, including the following steps:

step 501: and updating the parameters of the RBM of the first modality by using a comparative divergence estimation algorithm.

Set of connection weight parameters between visible layer neural units and hidden layer neural units of first modality RBMVisible layer nerve unitIs offset fromAnd hidden layer nerve cellIs offset fromBy theta ^I Expressed uniformly according to the formula theta ^I ←θ ^I +τ·α·△θ ^I Updating, wherein tau is a learning rate, and tau is an element of 0,1; α ∈ (0,1);and the number of the first and second electrodes,

wherein the content of the first and second substances,<·> _data for the mathematical expectations under the empirical distribution,<·> _model is a mathematical expectation under the model distribution;

step 502: and updating the parameters of the RBM of the second modality by using a comparative divergence estimation algorithm.

Connection between visible layer neural unit and hidden layer neural unit of second modality RBMSet of weight parametersVisible layer nerve cellIs offset fromAnd hidden layer nerve cellIs offset fromBy theta ^T Expressed uniformly according to the formula theta ^T ←θ ^T +τ·β·△θ ^T Updating, wherein the beta epsilon (0,1);and the number of the first and second electrodes,

step 503: and updating the distance of the first modality from the second modality on the nesting space by using a gradient descent method.

Specifically, the distance l of the first modality from the second modality on the nesting space is updated by using a gradient descent method according to the following formula _D ：

Where δ' (. Cndot.) = δ (. Cndot.) (1- δ (. Cndot.)), and δ (. Cndot.)) is the Logistic activation function δ (x) = 1/(1 + exp (-x)).

Step 504: and repeating the steps 501-503 until the algorithm converges.

The parameter set theta of the Corr-RBM model can be obtained by the method.

(III) deep layer model of Corr-RBMs

Fig. 2 is a diagram of a neural network structure of deep Corr-RBMs models, as shown in fig. 2, the deep Corr-RBMs models are formed by stacking at least two corresponding restricted boltzmann machine Corr-RBM models, each deep Corr-RBMs model includes first-mode Corr-RBMs and second-mode Corr-RBMs, the first-mode Corr-RBMs processes low-level expression of a target retrieval mode, and the second-mode Corr-RBMs processes low-level expression of any retrieved mode in a retrieval library.

The input of the first mode RBM visible layer nerve unit of the bottom layer Corr-RBM is the low-level expression of the first mode obtained by feature extraction of the first mode original data, the input of the second mode RBM visible layer nerve unit of the bottom layer Corr-RBM is the low-level expression of the first mode obtained by feature extraction of the second mode original data, and the low-level expression obtained by specific extraction of the original data is the prior art and is not described in detail herein.

A first RBM hidden layer of the top layer Corr-RBM outputs a high level representation of a first modality and a second RBM hidden layer of the top layer Corr-RBM outputs a high level representation of a second modality.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

In this embodiment, it is assumed that the search library includes N searched modalities, and the technical solution of the present invention is described by taking an example of searching for an object related to a picture P in the search library, and fig. 6 is a flowchart of this embodiment, as shown in fig. 6, including the following steps:

step 601: and acquiring the low-level expression of each searched mode in the search library and the low-level expression of the picture P by adopting a feature extraction method.

In this step, the modality type of the retrieved modality in the search library is not limited, and may be an image modality, a text modality, or a voice modality, and the original data of different modalities all have a mature feature extraction method at present, for example, the image modality may apply MPEG-7 and Gist descriptors to perform feature extraction, the text modality may apply bag-of-words model to perform feature extraction, and the like, and the process of obtaining the picture P and the low-level expression of each retrieved modality in the search library is not described in detail here.

Step 602: and respectively processing the low-level expression of the picture P and the low-level expression of each searched mode in the search library through a Corr-RBMs deep layer model to obtain the high-level expression of the picture P and the high-level expression of each searched mode in the search library, and then performing Euclidean distance calculation by using the high-level expression of the picture P and the high-level expression of each searched mode in the search library to calculate the Euclidean distance between the picture P and each searched mode in the search library.

In the step, any retrieved mode in the retrieval library and the picture P are taken as a combination, the low-level expression of the retrieved mode and the low-level expression of the picture P in the combination are processed through a Corr-RBMs deep model to obtain the high-level expression of the retrieved mode and the high-level expression of the picture P in the combination, and then the Euclidean distance between the picture P and the retrieved mode is calculated according to an Euclidean distance calculation formula.

In general, for two points t and t in n-dimensional Euclidean spacey, their distance d is calculated asThe Euclidean distance between the picture P and any searched modality is calculated.

Step 603: and sorting the images P from low to high according to the Euclidean distance between the images P and each searched mode in the search library, and selecting K searched modes in the front row as search results to be output.

In the embodiment, the low-level expression of the picture modality and the low-level expression of each searched modality in the search library are processed through the Corr-RBMs deep model to obtain respective high-level expressions, and then the high-level expressions are used for carrying out Euclidean distance calculation to obtain the search result efficiently.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, in which any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A cross-modal retrieval method based on a deep model, wherein the deep model is a stack-up corresponding restricted Boltzmann machine (Corr-RBMs) deep model, and the method comprises the following steps:

respectively obtaining a target retrieval mode and a low-level expression vector of each retrieved mode in a retrieval library by using a feature extraction method;

determining at least one retrieved modality in the retrieval library which is closest to the target retrieval modality as an object matched with the target retrieval modality;

wherein, the first and the second end of the pipe are connected with each other,

the deep layer models of the Corr-RBMs are formed by stacking at least two layers of corresponding limited Boltzmann machine Corr-RBMs, each deep layer model of the Corr-RBMs comprises first-mode Corr-RBMs and second-mode Corr-RBMs, the first-mode Corr-RBMs process the low-level expression vectors of the target retrieval modes, and the second-mode Corr-RBMs process the low-level expression vectors of any retrieved mode in the retrieval library;

the Corr-RBM includes a first-mode restricted boltzmann machine RBM and a second-mode restricted boltzmann machine RBM, the first-mode RBM and the second-mode RBM include the same number of visible layer neural units and the same number of hidden layer neural units, and a hidden layer of the first-mode RBM and the second-mode RBM has a dependency constraint therebetween.

2. The method of claim 1, further comprising:

configuration parameters theta = { W of Corr-RBM ^I ,C ^I ,B ^I ,W ^T ,C ^T ,B ^T Wherein superscript I denotes a first modality, superscript T denotes a second modality, in particular W ^I A set of connection weight parameters between the visible layer neural units and the hidden layer neural units of the RBM of the first modality, C ^I Set of visible layer neural unit bias parameters for RBM of first modality, B ^I Hidden layer neural cell bias parameter set, W, for first modality RBM ^T A set of connection weight parameters between the visible layer neural units and the hidden layer neural units of the RBM of the second modality, C ^T Set of visible layer neural unit bias parameters for RBM of second modality, B ^T A hidden layer neural unit bias parameter set of a second mode RBM;

the configuration parameters theta of the corresponding restricted Boltzmann machine Corr-RBM are order objective functionsMinimum configuration parameters, and

wherein the content of the first and second substances,the distance between the first mode and the second mode on the nesting space,is a negative log-likelihood function of the first modality,a negative log-likelihood function for the second modality; α and β are constants, and α ∈ (0,1), β ∈ (0,1); f. of _I (. Is a first modality RBM visible layer to hidden layer mapping function, f _T () and a second modality RBM visible layer to hidden layer mapping function; p is a radical of _I (.) a joint probability distribution of the RBM visible layer and hidden layer neural units in the first modality, p _T () a joint probability distribution of the visible layer and hidden layer neural units for the second modality RBM; | | · | | is a two-norm mapping; v refers to a visible unit in the RBM, corresponding to a visible variable; m is the number of modal samples.

3. The method according to claim 2, wherein the algorithm for determining Θ from the objective function F is:

A. set of connection weight parameters between visible layer neural units and hidden layer neural units of first modality RBMVisible layer nerve cellIs offset fromAnd hidden layer nerve cellIs offset fromBy theta ^I Expressed uniformly according to the formula theta ^I ←θ ^I +τ·α·△θ ^I Updating, wherein tau is a learning rate and tau e (0,1); α ∈ (0,1);and the number of the first and second electrodes,

wherein the content of the first and second substances,<·> _data for the mathematical expectations under the empirical distribution,<·> _model is a mathematical expectation under a model distribution;

B. set of connection weight parameters between visible layer neural units and hidden layer neural units of second modality RBMVisible layer nerve cellIs offset fromAnd hidden layer nerve unitIs offset fromBy theta ^T Expressed uniformly according to the formula theta ^T ←θ ^T +τ·β·△θ ^T Updating, wherein the beta epsilon (0,1);and the number of the first and second electrodes,

C. updating using a gradient descent method according to the following formula

Wherein δ' (·) = δ (·) (1- δ (·)), and δ (·) is a Logistic activation function δ (x) = 1/(1 + exp (-x));

and repeating the steps A to C until the algorithm converges.