CN106649715A

CN106649715A - Cross-media retrieval method based on local sensitive hash algorithm and neural network

Info

Publication number: CN106649715A
Application number: CN201611190238.0A
Authority: CN
Inventors: 白亮; 贾玉华; 郭金林; 谢毓湘; 于天元
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2017-05-10
Anticipated expiration: 2036-12-21
Also published as: CN106649715B

Abstract

The invention discloses a cross-media retrieval method based on local sensitive hash algorithm and neural network, and relates to the technical field of cross-media retrieval. The method comprises two stages of local sensitive hash and hash function learning, wherein in the stage of local sensitive hash, the image data is mapped to hash buckets in m hash tables G = [g1, g2,...,gm] (which is an element of a set R<k*m>) by the local sensitive hash algorithm, wherein G is the set of m hash tables, gj is the jth hash table, and k is the length of the hash code corresponding to the hash bucket; and in the stage of hash function learning, the text data is respectively mapped to hash functions Ht = (Ht (1), Ht (2), ..., Ht (m), Ht (j)) in corresponding hash buckets in m hash tables by the neural network algorithm learning, wherein Ht (j), (1<=j<=m) represents the learned hash function Ht corresponding to the jth hash table. After getting the functions of these two phases, all the images and the documents are further coded and indexed for more accurate retrieval.

Description

It is a kind of based on local sensitivity hash algorithm and the cross-media retrieval method of neutral net

Technical field

The present invention relates to cross-media retrieval technical field, refer in particular to a kind of based on local sensitivity hash algorithm and neutral net Cross-media retrieval method.

Background technology

In across the media big data epoch, do not bring in the magnanimity multi-modal information for producing all the time huge across media Search Requirement, such as searches for image or video with text, and vice versa.For example, an entry on wikipedia is generally comprised Text is described and example image, and the retrieval of these information needs to build cross media indexing and learning method.With traditional single matchmaker Health check-up rope is compared, and the key problem of cross-media retrieval is between the identical or related semantic object for how excavating different media representations Association.

At present worldwide, numerous solutions are proposed for the key problem of the cross-media retrieval.It is existing Cross-media retrieval method be broadly divided into two classes, a class is based on the method for theme：Document [1] is by theme proportion grading to not It is modeled with the correlation between the data of mode；Document [2] by CORR-LDA excavate between image and text marking The relation of theme level；Document [3] is combined Markov random fields with tradition LDA methods, it is proposed that examined with brief word The built-up pattern (MDRF) of the oriented and undirected probability graph of rope image；Document [4] proposes one kind to using multiple medium types Micro-blog information come carry out obtain social event visualization summarize multimedia social event autoabstract framework.It is another kind of to be Method based on subspace：The core of the method for this class is to seek to make the maximized subspace of different modalities data dependence [5].Sharma et al. proposes a kind of general multi-modal feature extraction framework technology, referred to as the multi-view analysis GMA of broad sense [6].The viewpoint of semanteme is introduced in the T-V CCA models that document [7] is proposed, to improve subspace in different classes of multi-modal number According to classification accuracy.Document [8] proposes a kind of Bi-CMSRM methods, constructs from the angle of optimization bi-directional list sequencing problem Suitable for the computation model of cross-media retrieval.

[1]Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].the Journal of machine Learning research,2003,3:993-1022.

[2]Blei D M,Jordan M I.Modeling annotated data[C]//Proceedings of the 26th annual international ACM SIGIR conference on Research and developme nt in

information retrieval.ACM,2003:127-134.

[3]Jia Y,Salzmann M,Darrell T.Learning cross modality similarity for multinomial data[C]//Computer Vision(ICCV),2011IEEE International Conference on.

IEEE,2011:2407-2414.

[4]Bian J,Yang Y,Zhang H,et al.Multimedia Summarization for Social Events in Microblog Stream[J].IEEE Transactions on Multimedia,2015,17(2):216- 228.

[5]Hardoon D R,Szedmak S,ShaweTaylor J.Canonical correlation analysis:An overview with application to learning methods[J].Neural computation,2004,16(12):2639-2664.

[6]Abhishek Sharma,Abhishek Kumar,H Daume,and DavidWJacobs.2012.Generalized multi-view analysis:A discriminative latent space.In IEEE Conference on Computer Vision and Pattern Recognition.2160– 2167.

[7]Yunchao Gong,Qifa Ke,Michael Isard,and Svetlana Lazebnik.2013.A Multi-View Embedding Space for Modeling Internet Images,Tags,and Their Semantics.International Journal of Computer Vision(2013),1–24.

Wu F,Lu X,Zhang Z,et al.Cross-media semantic representation via bi- directional learning to rank[C]//Proceedings of the 21st ACM international conference on Multimedia.ACM,2013:877-886.

There is same technological deficiency in existing cross-media retrieval method, i.e., only only considered cross-media retrieval method Itself and have ignored the optimization processing feasible to some of document sets, due to exist in document sets in a large number with inquire about incoherent text Shelves, therefore document sets are pre-processed before accurately being inquired about, relevant documentation proportion is to carrying in raising document sets It is significant for high recall precision.

The content of the invention

For the technical problem existing for existing cross-media retrieval method, present invention proposition is a kind of can to improve retrieval The cross-media retrieval method based on local sensitivity hash algorithm and neutral net of accuracy.

The present invention concrete technical scheme be：

A kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net, the cross-media retrieval method Comprise the following steps：

1) FCMR (Fast Cross-Media Retrieval, FCMR) model, the training of the FCMR models are set up Journey includes local sensitivity Hash stage and hash function study stage；

2) hash function arrived using local sensitivity hash function and neural network learning is mapped in all texts and image Set up to Hamming space and index；

3) cross-media retrieval inquiry is carried out, including text query and image querying.

As the preferred technical solution of the present invention, step 1 of the present invention) in, the local sensitivity Hash stage includes Hash bucket is mapped the image data to using local sensitivity hash algorithm, is specifically included image by local sensitivity hash algorithm Data are mapped to m Hash table G=[g₁,g₂,...,g_m]∈R^k×mHash bucket in, wherein G is the set of m Hash table, g_j J-th Hash table is represented, k is the length of Hash bucket correspondence Hash codes.

As the preferred technical solution of the present invention, step 1 of the present invention) in, the hash function study stage includes Text data is mapped to using neural network algorithm study for the hash function Ht of Hash bucket, is specifically included and is calculated by neutral net Text data is respectively mapped to calligraphy learning the hash function Ht=(Ht in m Hash table in its corresponding Hash bucket⁽¹⁾,Ht⁽²⁾,...,Ht^(m)), Ht^(j), (1≤j≤m) represents the hash function corresponding to j-th Hash table for learning.

As the preferred technical solution of the present invention, step 3 of the present invention) in,

The text query is a given query text, by hash function Ht^(j)The query text is mapped to into m In Hash bucket in Hash table, then the image file of these Hash bucket memory storages just constitutes the arest neighbors of the query text, will The image pattern fallen in identical Hash bucket with query text as candidate result collection, and then in the arest neighbors of the query text In the range of accurately retrieved, calculate the distance between image for concentrating with candidate result of query text and simultaneously accurately examined Rope ranking；

Described image inquiry is mapped to the query image by local sensitivity hash function to give a query image In Hash bucket in m Hash table, then the text of these Hash bucket memory storages just constitutes the arest neighbors of the query image, And then carry out precise search in the arest neighbors scope of the query image.

Used as the preferred technical solution of the present invention, local sensitivity hash function of the present invention is defined as follows：

Wherein, hyperplane vectorMeet multi GaussianN (0,1) distributions；

Define a series of hash function h₁,h₂,...,h_nK function component function g (x) therein is randomly selected, if choosing It is h₁To h_k, then g (x)=(h₁(x),h₂(x),...,h_k(x)), choose m g (x) function：g₁(x),g₂(x),...,g_m(x), Then each g (x) function pair answers a Hash table；By m g (x) function by each image pattern p in image space_iPoint In not being mapped to m Hash table, so each image pattern p_iWill occur in certain Hash bucket of m Hash table；So p_i Corresponding Hash bucket can be expressed as in j-th Hash table：

g_j(p_i)=<h₁(p_i),h₂(p_i)...,h_k(p_i)>, (0 ＜ j≤m, 0 ＜ i≤n) (2)

As the preferred technical solution of the present invention, m neutral net NN arrived used in FCMR models of the present invention^(j),(j∈ 1,2 ..., m) there is identical structure；Each neutral net NN^(j)There are L layers, wherein input layer has dt neuron to correspond to The dimension of text feature, output layer has k position of the k neuron corresponding to Hash codes, the residue in addition to input layer and output layer L-2 layers be used for learn hash function；By each t_i∈ T are used as NN^(j)Input, neutral net each layer can be obtained OutputL+1 layers withFor input, output

WhereinWithThe respectively feature representation of l layers and l+1 layers；W^(l+1)It is transition matrix；f^(l+1)It is activation primitive；

The hash function Ht that neural network learning is arrived^(j)With t_iTo be input into and export length be k Hash codes：

Wherein,It is a k dimension real-valued vectors, will using sign functionIt is converted into Hash codes；

For training sampleHt^(j)(t_i) withShould be identical, that is, WithIt is as equal as possible.

Defining loss function based on minimum variance is：

Wherein,It is not put in marks Function Neural Network to t_iPredicted value,Represent p_iCorresponding to jth (0 ＜ j≤ M) Hash codes of the Hash bucket in individual Hash table；

Obtain training the training sample needed for neutral net from the local sensitivity Hash stage(i∈1,2,..., Nt, j ∈ 1,2 ..., m), by training neutral net NN^(j)Its study can be made to arrive t_iIt is mapped toHash function.

Used as the preferred technical solution of the present invention, the training of neutral net of the present invention is divided into pre-training and parameter adjustment, tool Body includes：

(1) stack self-encoding encoder (Stacked AutoEncoder, SAE) is applied to FCMR models sequentially to train Neutral net NN^(j)In each layer with initialization network parameter；

(2) based on the loss function formula (5), neutral net is trained to adjust network parameter by BP algorithm；

(3) variance and SSE based on all samples of text is devised shown in overall loss function such as formula (6)：

Compared with prior art, the invention has the beneficial effects as follows：

The present invention is based on local sensitivity hash algorithm and neutral net, by eliminating document content unrelated with inquiry in a large number And the arest neighbors of a group polling is obtained, finally more efficiently carry out retrieval tasks in the range of the arest neighbors of inquiry document.

Description of the drawings

Fig. 1 is the FCMR block schematic illustrations of the present invention.

Fig. 2 is the FCMR retrieval schematic diagrames of the present invention.

Specific embodiment

The present invention is elaborated in conjunction with Figure of description.

A kind of cross-media retrieval based on local sensitivity hash algorithm and neutral net that the specific embodiment of the invention is provided Method (Fast Cross-Media Retrieval, FCMR), the cross-media retrieval method mainly comprises the steps：

Wherein, in order that symbol and algorithm statement are more succinct, describe by taking two mode of text and image as an example carry below The FCMR models for going out, model can easily expand to other mode, and the FCMR models include local sensitivity Hash and Kazakhstan Two stages of uncommon function learning.

In the local sensitivity Hash stage, Hash bucket is mapped the image data to using local sensitivity hash algorithm, concrete bag Include and m Hash table G=[g is mapped the image data to by local sensitivity hash algorithm₁,g₂,...,g_m]∈R^k×mHash bucket Interior, wherein R represents real number field, and G is the set of m Hash table, g_jJ-th Hash table is represented, k is Hash bucket correspondence Hash codes Length；

Learn the stage in hash function, text data is mapped to using neural network algorithm study for the Hash letter of Hash bucket Number Ht, specifically include by neural network algorithm study by text data be respectively mapped in m Hash table text data it is right Hash function Ht=(Ht in the Hash bucket answered⁽¹⁾,Ht⁽²⁾,...,Ht^(m)), Ht^(j), (1≤j≤m) represents the correspondence for learning In the hash function of j-th Hash table.

The matrix description of text data is：T=[t₁,t₂,...,t_nt]∈R^dt×nt, wherein T is that the matrix of text data is retouched State.Accordingly, P=[p₁,p₂,...,p_np]∈R^dp×np, wherein P is the matrix description of view data.Wherein, t_iWith p_iOne a pair Should, the number of image text pair is n, i.e. nt=np=n, and nt and np is replaced with n in following content.

If obtaining m Hash table with local sensitivity hash algorithm, then need to design m god corresponding with Hash table Jing networks are so that text data is mapped in m Hash table in the Hash bucket corresponding to these text datas.Based on neutral net The local sensitivity hash function that the hash function for learning is used with the local sensitivity Hash stage, can set up to multi-modal data Index, so as to carry out efficient cross-media retrieval task.

After index is set up, a query text is given, by hash function Ht^(j)The query text is mapped to into m In Hash bucket in Hash table, then the image file of these Hash bucket memory storages just constitutes the arest neighbors of the query text, enters And accurately retrieved in the range of the arest neighbors of the query text；A query image is given, by local sensitivity Hash In the Hash bucket that function is mapped to the query image in m Hash table, then the text of these Hash bucket memory storages is with regard to group Into the arest neighbors of the query image, and then precise search is carried out in the arest neighbors scope of the query image.

The following detailed description of the local sensitivity hash algorithm in the specific embodiment of the invention, the local sensitivity hash algorithm It is mainly used to solve the approximate KNN search problem at higher dimensional space midpoint, local sensitivity hash function is defined as follows：

Wherein, hyperplane vectorMeet multi GaussianN (0,1) distributions.

Define a series of hash function h₁,h₂,...,h_nK function component function g (x) therein is randomly selected, if choosing It is h₁To h_k, then g (x)=(h₁(x),h₂(x),...,h_k(x)), choose m g (x) function：g₁(x),g₂(x),...,g_m(x), Then each g (x) function pair answers a Hash table.By m g (x) function by each image pattern p in image space_iPoint In not being mapped to m Hash table, so each image pattern p_iWill occur in certain Hash bucket of m Hash table.

So p_iCorresponding Hash bucket can be expressed as in j-th Hash table：

g_j(p_i)=<h₁(p_i),h₂(p_i)...,h_k(p_i)>, (0 ＜ j≤m, 0 ＜ i≤n) (2)

During inquiry, query text is given, using Ht^(j)Functional query text is mapped, and will be fallen identical with query text Hash bucket in image pattern as candidate result collection, calculate the distance between image that query text is concentrated with candidate result And accurately retrieved ranking.

By local sensitivity hash algorithm, the sample p of image space_i, during (0 ＜ i≤n) is mapped to m Hash table, and Each p_i, (0 ＜ i≤n) all can and the sample similar to its together with occur in certain Hash bucket of m Hash table.So, it is each Individual image pattern p_iAll establish contact with certain Hash bucket of jth (0 ＜ j≤m) individual Hash table.It is simultaneously mentioned above, Due to p in model_iAnd t_iIt is the description of same semantic different modalities, image pattern is one-to-one with samples of text, therefore, Each samples of text t_iAlso contact is established with certain Hash bucket of jth (0 ＜ j≤m) individual Hash table.So far, used By samples of text t in training neural network learning_iIt is mapped to jth (0 ＜ j≤m) individual Hash table Chinese version sample t_iCorrespondence Hash The training sample of the function of bucket：(i ∈ 1,2 ..., n, j ∈ 1,2 ..., m), whereinRepresent p_iCorresponding to The Hash codes of the Hash bucket in j (0 ＜ j≤m) individual Hash table.

The following detailed description of the local sensitivity hash algorithm in the specific embodiment of the invention, as shown in figure 1, Fig. 1 gives Hash function learns stage neural network structure, m neutral net NN arrived used in Fig. 1 models^(j),(j∈1,2,...,m) With identical structure；Each neutral net NN^(j)There are L layers, wherein input layer has dt neuron corresponding to text feature Dimension, output layer has k position of the k neuron corresponding to Hash codes, and remaining L-2 layers are used to learn hash function.By each t_i∈ T are used as NN^(j)Input, the output of each layer of neutral net can be obtainedL+1 layers withFor Input, output

WhereinWithThe respectively feature representation of l layers and l+1 layers；W^(l+1)It is transition matrix；f^(l+1)It is activation primitive.

Wherein,It is a k dimension real-valued vectors, will using sign functionIt is converted into Hash codes.

Due to sign function non-differentiability, it is difficult to optimize, therefore eliminated with the stage of neural network learning hash function Sign function, and add again in test phase.

Defining loss function based on minimum variance is：

Wherein,It is not put in marks Function Neural Network to t_iPredicted value.

Training sample according to needed for the local sensitivity Hash stage obtains training neutral net(i∈1, 2 ..., nt, j ∈ 1,2 ..., m), by training neutral net NN^(j)Its study can be made to arrive t_iIt is mapped toHash Function.

The training of neutral net is divided into pre-training and parameter adjustment, and pre-training preferably initialization network parameter and can be prevented Only network is absorbed in locally optimal solution, and the training of neutral net specifically includes following steps：

(1) stack self-encoding encoder (Stacked AutoEncoder, SAE) is applied to FCMR models sequentially to train Neutral net NN^(j)In each layer with initialization network parameter.

(2) based on loss function formula (5), network parameter is adjusted come training network by BP algorithm (back-propagation algorithm)；

In order that neutral net NN^(j)The function Ht for learning^(j)Well samples of text data can be mapped to into j In Hash table in its corresponding Hash bucket, the embodiment of the present invention trains neutral net NN using traditional back-propagation algorithm^(j), final hash function Ht is obtained eventually through formula (4) in test phase^(j)。

Wherein, the algorithmic procedure of the FCMR of the present embodiment is specific as follows：

When Fig. 2 shows only one of which Hash table, FCMR enters the schematic diagram of line retrieval, and multiple Hash tables need to only use all god Jing e-learnings to hash function map the text to Hamming space.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need to be exhaustive to all of embodiment.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims

1. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net, it is characterised in that described across matchmaker Body search method is comprised the following steps：

1) FCMR (Fast Cross-Media Retrieval, FCMR) model, the training process bag of the FCMR models are set up Include local sensitivity Hash stage and hash function study stage；

2) hash function for being arrived using local sensitivity hash function and neural network learning is by all text datas and view data It is mapped to Hamming space and sets up index；

2. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 1, Characterized in that, the step 1) in, the local sensitivity Hash stage includes adopting local sensitivity hash algorithm by picture number According to Hash bucket is mapped to, specifically includes and m Hash table G=[g is mapped the image data to by local sensitivity hash algorithm₁, g₂,...,g_m]∈R^k×mHash bucket in, wherein R represents real number field, and G is the set of m Hash table, g_jRepresent j-th Hash Table, k is the length of Hash bucket correspondence Hash codes.

3. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 2, Characterized in that, the step 1) in, the hash function study stage includes learning textual data using neural network algorithm According to the hash function Ht for being mapped to Hash bucket, specifically include and text data is respectively mapped to by m by neural network algorithm study Hash function Ht=(Ht in individual Hash table in its corresponding Hash bucket⁽¹⁾,Ht⁽²⁾,...,Ht^(m)), Ht^(j), (1≤j≤m) table The hash function corresponding to j-th Hash table that dendrography is practised.

4. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 3, Characterized in that, the step 3) in,

The text query is a given query text, by hash function Ht^(j)The query text is mapped to into m Hash In Hash bucket in table, then the image file of these Hash bucket memory storages just constitutes the arest neighbors of the query text, will with look into The image pattern that falls in identical Hash bucket of text is ask as candidate result collection, and then in the arest neighbors scope of the query text Inside accurately retrieved, calculate the distance between image that query text is concentrated with candidate result and carry out accurate retrieval row Name；

The query image is mapped to m by described image inquiry to give a query image by local sensitivity hash function In Hash bucket in Hash table, then the text of these Hash bucket memory storages just constitutes the arest neighbors of the query image, enters And carry out precise search in the arest neighbors scope of the query image.

5. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 3, Characterized in that, the local sensitivity hash function is defined as follows：

h_{\overset{&RightArrow;}{r}} (p_{i}) = \{\begin{matrix} 1, & \begin{matrix} i f & {\overset{&RightArrow;}{r}}^{T} p_{i} &GreaterEqual; 0 \end{matrix} \\ 0, & e l s e \end{matrix} - - - (1)

Wherein, hyperplane vectorMeet multi Gaussian N (0,1) distributions；

Define a series of hash function h₁,h₂,...,h_n, k function component function g (x) therein is randomly selected, if that choosing is h₁ To h_k, then g (x)=(h₁(x),h₂(x),...,h_k(x)), choose m g (x) function：g₁(x),g₂(x),...,g_m(x), then often Individual g (x) function pair answers a Hash table；By m g (x) function by each image pattern p in image space_iReflect respectively In being mapped to m Hash table, so each image pattern p_iWill occur in certain Hash bucket of m Hash table；So p_i Corresponding Hash bucket can be expressed as in j Hash table：

g_j(p_i)=<h₁(p_i),h₂(p_i)...,h_k(p_i)>, (0 ＜ j≤m, 0 ＜ i≤n) (2).

6. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 5, Characterized in that, m neutral net NN arrived used in FCMR models^(j), (j ∈ 1,2 ..., m) there is identical structure；Often One neutral net NN^(j)There are L layers, wherein input layer has dimension of the dt neuron corresponding to text feature, and output layer has k Corresponding to the k positions of Hash codes, remaining L-2 layers are used to learn hash function neuron in addition to input layer and output layer；Will be each Individual t_i∈ T are used as NN^(j)Input, the output of each layer of neutral net can be obtainedL+1 layers withFor Input, output

t_{i}^{(l + 1)} = f^{(l + 1)} (W^{(l + 1)} t_{i}^{(l)}) - - - (3)

WhereinWithThe respectively feature representation of l layers and l+1 layers；W^(l+1)It is transition matrix；f^(l+1) It is activation primitive；

{Ht}^{(j)} (t_{i}) = s i g n (t_{i}^{(L - 1)}) - - - (4)

Defining loss function based on minimum variance is：

S E (t_{i}^{(L - 1)}, Y_{i}^{(j)}) = \frac{1}{2} | | t_{i}^{(L - 1)} - Y_{i}^{(j)} | |_{F}^{2} - - - (5)

Wherein,It is not put in marks Function Neural Network to t_iPredicted value,Represent p_iIt is individual corresponding to jth (0 ＜ j≤m) The Hash codes of the Hash bucket in Hash table；

Obtain training the training sample needed for neutral net from the local sensitivity Hash stage By training neutral net NN^(j)Its study can be made to arrive t_iIt is mapped toHash function.

7. a kind of cross-media retrieval method based on local sensitivity hash algorithm and neutral net according to claim 6, Characterized in that, the training of neutral net is divided into pre-training and parameter adjustment, specifically include：

(1) stack self-encoding encoder (Stacked AutoEncoder, SAE) is applied to into the training NN that FCMR models carry out order^(j) In each layer with initialization network parameter；

(2) based on loss function formula (5), network parameter is adjusted come training network by BP algorithm；

S S E (t_{i}^{(L)}, Y_{i}^{(j)}) = \frac{1}{2} Σ_{i = 1}^{n} | | t_{i}^{(L)} - {Y^{(j)}}_{i} | |_{F}^{2} - - - (6) .