CN109299216B

CN109299216B - A kind of cross-module state Hash search method and system merging supervision message

Info

Publication number: CN109299216B
Application number: CN201811269037.9A
Authority: CN
Inventors: 张化祥; 王粒; 冯珊珊; 任玉伟; 刘丽; 张庆科; 朱磊
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2019-07-23
Anticipated expiration: 2038-10-29
Also published as: CN109299216A

Abstract

The invention discloses a kind of cross-module state Hash search methods and system for merging supervision message, which comprises building image network, text network and converged network；Image and text feature training sample pair are obtained, respectively input picture network and text network；Using image network and the output feature of text network as the input of the converged network, and define the output of the converged network；According to the output of the converged network and pair between similitude building learn the objective functions of unified Hash codes；The objective function is solved, unified Hash codes are obtained；Using the unified Hash codes as supervision message, in conjunction with semantic information, the Hash network of training modality-specific.The present invention is based on deep learning frame end to end, simultaneously learning characteristic is indicated and Hash encodes, and can more effectively be captured the correlation between different modalities data, be facilitated the raising of cross-module state retrieval precision.

Description

A kind of cross-module state Hash search method and system merging supervision message

Technical field

This disclosure relates to cross-module state search method, more specifically to a kind of cross-module state Hash for merging supervision message Search method and system.

Background technique

In recent years, with the sharp increase of different types of data on network, approximate KNN (ANN) is searched in related application In play increasingly important role.For example, information retrieval, data mining, computer vision etc..Hash technology is due to its calculating At low cost and storage efficiency is high, has become one of technology most popular in ANN search.The basic thought of Hash is to pass through The Hamming space that the data of higher-dimension are mapped to compact binary encoded by hash function is practised, while retaining luv space as far as possible Similarity Structure.Many hash methods applied in single mode scene have been suggested at present, however are had in real world There are the data of identical semanteme often to there are multiple modalities, for example, image, text, video etc..In order to make full use of isomeric data it Between relationship, ANN search in develop cross-module state Hash (CMH) method be necessary.Specifically, in cross-module state similitude In search, the mode for inquiring data is different from the mode for the data that are retrieved.The disclosure is with image inspection text (I2T) and text inspection figure As being analyzed and being tested for (T2I) task, while the method can extend to the retrieval between any other mode.

Most of existing cross-module state Hash (CMH) method is the feature based on manual processing, feature extraction and Hash Code learning process independently carries out.This differentiation that may will limit sample indicates, and then damages the standard of the Hash codes of study True property.Recently, the hash method based on deep learning propose a kind of learning framework end to end simultaneously learning characteristic indicate and Hash coding, can more effectively capture the non-linear dependencies between different modalities than shallow-layer learning method.As classics Method, depth cross-module state Hash (DCMH) expands to traditional depth model in the retrieval of cross-module state, and to each mode Execute the learning framework end to end with deep neural network.The depth Hash (PRDH) of relation guiding is further integrated between pair It is constrained between a variety of pairs, from the similitude for enhancing Hash codes between mode and in mode.

In the above-mentioned depth cross-module state Hash frame referred to, for the paired samples from two different modalities, they Hash codes be usually forced to be arranged to it is the same.Also, these methods are learned respectively by the deep neural network of every kind of mode The character representation of single sample is practised, minimizes the loss between different modalities feature then to establish the relationship of isomery.Thus It suffers from the drawback that only by simply applying constraint to the last layer of the neural network of different modalities, can not sufficiently dig Dig the complex relationship between multi-modal data.

Summary of the invention

To overcome above-mentioned the deficiencies in the prior art, present disclose provides a kind of cross-module state Hash retrievals for merging supervision message Method and system, the method is based on deep learning frame end to end, and simultaneously learning characteristic indicates and Hash encodes, can The correlation between different modalities data is more effectively captured than conventional learning algorithms, facilitates mentioning for cross-module state retrieval precision It is high.

To achieve the above object, one or more other embodiments of the present disclosure provide following technical solution:

A kind of cross-module state Hash search method merging supervision message, comprising the following steps:

Construct image network, text network and converged network；

Image and text feature training sample pair are obtained, respectively input picture network and text network；

Using image network and the output feature of text network as the input of the converged network, and define the fusion net The output of network；

According to the output of the converged network and pair between similitude building learn the objective functions of unified Hash codes；

The objective function is solved, unified Hash codes are obtained；

Using the unified Hash codes as supervision message, in conjunction with semantic information, the Hash network of training modality-specific.

Further, described image network includes 5 convolutional layers and 3 full articulamentums；Text network includes two and connects entirely Connect layer；Converged network includes two full articulamentums；Wherein, described image network and the hidden unit of text network the last layer Number is equal, and the second layer of converged network is Hash layer, and its activation primitive is discriminant function.

Further, the output feature of described image network and text network is obtained into institute by nonlinear activation function State the input of converged network.

Further, the objective function for learning unified Hash codes are as follows:

Wherein, embedded constraint item between first item is pair, and Wherein H_*i、H_*jRespectively indicate different training The converged network of sample pair exports, S={ s_ijIndicate pair between similarity matrix, B ∈ { -1,1 }^k×nIndicate unified Hash codes square Battle array, p (s_ij| B) when indicating given Hash codes B, s_ijConditional probability distribution, λ indicates super ginseng；Section 2 minimizes converged network Output and binary code between loss, H=h (Z；θ_z)∈R^k×nFor the output of converged network；Section 3 is Constraints of Equilibrium , for maximizing the information of each Hash codes, η indicates super ginseng,Indicate F norm.

Further, solving the objective function includes:

Initialisation image, text and converged network parameter θ={ θ_v,θ_t,θ_zAnd batch size；

Fixed network parameter θ={ θ_v,θ_t,θ_z, update unified Hash codes B；

Then B is fixed, small lot stochastic gradient descent method undated parameter θ={ θ is utilized_v,θ_t,θ_z}；

It constantly alternately updates, until convergence.

Further, in the Hash network of the modality-specific, image network include 5 convolutional layers, 2 full articulamentums and 1 Hash layer, text network include 1 full articulamentum and 1 Hash layer；Wherein, in described image network and text network The activation primitive of Hash layer is discriminant function.

Further, the Hash network of the trained modality-specific includes: to solve overall goal function, obtains image network With the parameter of text network；The objective function are as follows:

Wherein, α, β, γ respectively indicate super ginseng；J₁It is pairs of embedded constraint between mode,Wherein F_*i=f (v_i；θ_v) indicate the character representation of i-th of sample exported from image network, G_*j=g (t_j；θ_j) indicate to export from text network J-th of sample character representation；J₂The unified Hash codes for using the first stage to obtain train modality-specific as supervision message Hash network, B ∈ { -1,1 }^k×nIndicate that unified Hash codes matrix, F indicate picture feature output, G indicates that text feature is defeated Out；J₃Label information is linearly mapped to the network of modality-specific,WithRespectively indicate image and text The mapping matrix of this mode, Y indicate semantic matrix；J₄It is Constraints of Equilibrium, for maximizing each information.

Further, solving the overall goal function includes:

Initialisation image network parameter θ_v, text network parameter θ_tAnd batch size；

Preset parameter θ_vAnd θ_t, solve objective function and update W₁And W₂；

Then W is fixed₁And W₂, image parameter θ is updated respectively using small lot stochastic gradient descent method_vWith text parameter θ_t；

It constantly alternately updates, until convergence.

One or more embodiments provide a kind of computer system, including memory, processor and are stored in memory Computer program that is upper and can running on a processor, the processor realize that letter is supervised in the fusion when executing described program The cross-module state Hash search method of breath.

One or more embodiments provide a kind of computer readable storage medium, are stored thereon with computer program, should The cross-module state Hash search method of the fusion supervision message is realized when program is executed by processor.

One or more of above-mentioned technical proposal has the advantages that

1, the learning process of traditional cross-module state hash method, feature extraction and Hash coding is independent from each other, this It is open to be based on deep learning frame end to end, while learning characteristic indicates and Hash coding, can more effectively capture difference Correlation between modal data.

2, the feature of different modalities is input to converged network by the disclosure in couples, more to explore by nonlinear conversion Correlation between modal data, and the Hash codes of high quality are obtained to supervise the training of the Hash network of modality-specific；It utilizes The tactful solving optimization problem that iteration updates, and keep the discrete feature of Hash codes without carrying out pine to it in optimization process It relaxes, which reduces quantization errors；Affinity information and classification information are embedded in Hash under same manifold frame between pair Network maintains similitude and semantic consistency between mode well.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 is the flow diagram that the cross-module state Hash search method of supervision message is merged in embodiment one；

Fig. 2 is the flow diagram that the cross-module state Hash search method of supervision message is merged in embodiment one.

Specific embodiment

It is noted that described further below be all exemplary, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

In the absence of conflict, the features in the embodiments and the embodiments of the present application can be combined with each other.

Embodiment one

Present embodiment discloses a kind of cross-module state Hash search methods for merging supervision message, as shown in Figs. 1-2, including with Lower step:

First stage: unified Hash codes study

Step 1: three networks of building: image network, text network and converged network.(1) CNN-F that image network uses Network.Original CNN-F model shares 8 layers, including 5 convolutional layers and 3 full articulamentums.(2) for text modality, first will Each samples of text is expressed as bag-of-word (BOW) vector, and BOW vector is then input to tool, and there are two full articulamentums Text network.Particularly, the hidden unit number of image and text network the last layer is equal and long according to different codings Different values is arranged in degree and data set.(3) converged network is made of two full articulamentums, combines image and text in couples The output of network.In order to obtain unified Hash codes, the second layer for merging network is designed as the Hash with k hidden unit Layer, and its activation primitive is discriminant function.

Step 2: data-oriented collectionN represents the sum of training sample pair, v_iIndicate picture feature, t_i Indicate text feature, y_iIndicate semantic marker vector.In addition, S={ s_ijIndicate pair between similarity matrix.The target in this stage is Learn compact binary code b for each sample_i∈{-1,1}^k, B ∈ { -1,1 }^k×nIndicate unified Hash codes matrix.

Step 3:Indicate the character representation exported from image network,It indicates to export from text network Character representation.Pass through nonlinear activation function (tanh function)In conjunction with both the above mould The output of state obtains the input of converged network.Further, the output H=h (Z of converged network is defined；θ_z)∈R^k×n.For study system One Hash codes construct objective function:

Wherein, between first item is pair embedded constraint item and Wherein H_*i、H_*jRespectively indicate different training The converged network of sample pair exports.S={ s_ijIndicate pair between similarity matrix, B ∈ { -1,1 }^k×nIndicate unified Hash codes square Battle array, p (s_ij| B) when indicating given Hash codes B, s_ijConditional probability distribution.By minimizing the negative log-likelihood in first item Function carrys out the similitude in holding matrix S, that is, so that the similitude (inner product) between two similar samples is big as far as possible, and Similitude (inner product) between dissimilar sample is small as far as possible.Section 2 minimize converged network output and binary code it Between loss, so that the unified Hash codes learnt can keep the non-linear dependencies between training sample well. Section 3 is Constraints of Equilibrium item, and for maximizing the information of each Hash codes, that is, requiring each to have equal opportunity is 1 Or -1., λ indicates super and joins (and λ > 0), and η indicates super ginseng (and η > 0),Indicate F norm.

Step 4: for the optimization problem of formula (1), being solved using iteration more new strategy.Pass through fixed network parameter θ={ θ_v,θ_t,θ_zThe unified Hash codes B of study, B is then fixed, small lot stochastic gradient descent method (SGD) undated parameter is utilized θ={ θ_v,θ_t,θ_z, by constantly alternately updating, until convergence, acquires optimal unified Hash codes B.Specifically, including it is following Step:

Fixed network parameter θ={ θ_v,θ_t,θ_z, unified Hash codes B is updated according to the following formula；

B=sign (λ H)

Then B is fixed, small lot stochastic gradient descent method undated parameter θ={ θ is utilized_v,θ_t,θ_z, its ladder is calculated as follows Degree；

It constantly alternately updates, until convergence.

Second stage: modality-specific Hash network training

Step 1: image network and text network are redesigned, for training the Hash network of modality-specific.In addition to that will scheme The full articulamentum of the last one of picture and text network replace with Hash layer (have k hidden unit) and using discriminant function as Its activation primitive, the setting of other layers with it is identical on last stage.

Step 2: in this stage, main training image network f (V；θ_v) and text network g (T；θ_t) corresponding to obtain Hash function h^v() and h^t() encodes the sample outside training data.

Step 3: define overall goal function:

Wherein, J₁It is pairs of embedded constraint between mode, for keeping the cross-module state between image and the output of text network Similitude；J₂The unified Hash codes for using the first stage to obtain train the Hash network of modality-specific as supervision message；J₃Directly The network that label information is linearly mapped to modality-specific is connect, sufficiently to excavate semantic information.J₄It is Constraints of Equilibrium, is used to most Change the information of each greatly.They are defined as follows:

Step 4: for the optimization problem of formula (2), being used in the same manner iteration more new strategy and solved: by fixing it His parameter updates some parameter therein.Particularly, using small lot stochastic gradient descent, and pass through backpropagation (BP) Algorithm carrys out undated parameter θ_vAnd θ_t.Specifically, comprising the following steps:

Preset parameter θ_vAnd θ_t, solve objective function and update W respectively according to the following formula₁And W₂；

Then W is fixed₁And W₂, image parameter θ is updated respectively using small lot stochastic gradient descent method_vWith text parameter θ_t, Its gradient is calculated as follows；

It constantly alternately updates, until convergence.

We test in MIRFLICKR-25K and NUS-WIDE two datasets respectively.

MIRFLICKR-25K data set includes 25,000 sample collected from the website Flickr, each sample packet Containing a picture and some text labels.And 24 labels are given in total, each sample is by one, at least label therein Mark.We select the sample of at least 20 label for labelling for testing, and altogether include 20,015 image-text pair.Wherein, Text modality is represented as the BoW vector of 1386 dimensions, and directly uses original pixels as input for image modalities.It is testing In, we take 2,000 sample as inquiry at random, remaining is as the database being retrieved.In order to reduce calculating cost, Wo Mencong Take 5,000 samples for training in database.

It includes 269,648 samples that NUS-WIDE, which is the picture database of a true webpage, they are by 81 theme marks Label mark.Each each sample includes a picture and text label associated with it.In an experiment, we choose maximum 10 Class constitutes a subset, altogether includes 186,577 image-texts pair.For each sample, text modality is expressed as 1, The BoW vector of 000 dimension, image modalities directly use original pixels as input.On this data set, our stochastical samplings 2, 000 each sample is as inquiry, remaining is as database.Similarly, take 5,000 data point for instructing from database at random Practice.

The present embodiment is implemented under MatConvNet frame.For image network, our uses are in ImageNet number It is initialized according to the CNN-F network for collecting upper pre-training.For the parameter of other deep neural networks, we are random to be carried out initially Change.In addition, its dimension is arranged in we on MIRFLICKR-25K data set for having the text network there are two full articulamentum It is [8192 → 2500]；And on NUS-WIDE data set, when code length is 16 and 32, dimension is set as [8192 → 1000], when code length is 64, it is set as [8192 → 600].It is defeated for combining image and text network in couples Converged network out, the dimension that its full articulamentum is arranged in we on all data sets is all [4096 → k].In an experiment, Empirically value is 1 to all parameters, and learning rate is from 10^-1.5To 10^-3Change, the outer loop the number of iterations in algorithm is set as 500 times.Algorithm realizes that process is as follows.

1st stage: unified Hash codes study

Input: pictures V and text set T；Similarity matrix S between couple；Parameter γ, β, α；Code length k

Output: unified Hash matrix B

Initialization: initialisation image, text and converged network parameter θ={ θ_v,θ_t,θ_z, batch size N_v=N_t=128,

Cycle-index

Circulation executes following sentence

1. preset parameter θ={ θ_v,θ_t,θ_z, B is updated according to formula B=sign (λ H)

2.for iter=1,2 ... t_z{

1. sampling N respectively from V and T at random_vAnd N_tA data point constructs small lot

2. for sample v pairs of in small lot_iAnd t_i, calculated separately by propagated forward And h (z_i；θ_z)

3. the gradient of top layer is calculated, according to following formula:

4. carrying out backpropagation, undated parameter θ={ θ to image, text and converged network_v,θ_t,θ_z}}

Until convergence

2nd stage: modality-specific Hash network training

Input: pictures V and text set T；Similarity matrix S between couple；Mark matrix Y；The Hash matrix B of study；

Parameter γ, β, α；Code length k

Output: the Hash network parameter θ of modality-specific_vAnd θ_t

Initialization: initialisation image, text network parameter θ_vAnd θ_t, batch size N_v=N_t=128, cycle-index

Circulation executes following sentence

1. preset parameter θ_vAnd θ_t, according to formulaUpdate W₁, according to formulaUpdate W₂

2.for iter=1,2 ... t_v{

1. N is sampled from V at random_vA data point constructs small lot

2. for each sample v_i, f (v is calculated by propagated forward_i；θ_v)

3. the derivative in backpropagation following formula, undated parameter θ_v

3.for iter=1,2 ... t_t{

1. N is sampled from T at random_tA data point constructs small lot

2. for each sample v_t, g (t is calculated by propagated forward_i；θ_t)

3. the derivative in backpropagation following formula, undated parameter θ_t

Until convergence

Tested on both data sets, and compared other current popular 6 kinds of methods (LSSH, CMFH, DCH,SCM,SePHkm,DCMH).In order to guarantee the fairness compared, the 7th layer of extraction of our image networks from this method CNN feature is used for the control methods of shallow-layer.From table 1-2 it can be seen that method provided in this embodiment on different data sets all Show the retrieval performance better than other methods.

Table 1

Table 2

Embodiment two

The purpose of the present embodiment is to provide a kind of computing device.

A kind of computer system can be run on a memory and on a processor including memory, processor and storage Computer program, the processor are realized when executing described program:

Construct image network, text network and converged network；

The objective function is solved, unified Hash codes are obtained；

Embodiment three

The purpose of the present embodiment is to provide a kind of computer readable storage medium.

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Following steps:

Construct image network, text network and converged network；

The objective function is solved, unified Hash codes are obtained；

Each step involved in above embodiments two and three is corresponding with embodiment of the method one, and specific embodiment can be found in The related description part of embodiment one.Term " computer readable storage medium " is construed as including one or more instruction set Single medium or multiple media；It should also be understood as including any medium, any medium can be stored, encodes or be held It carries instruction set for being executed by processor and processor is made either to execute in the disclosure method.

The above one or more embodiment has the advantages that

It will be understood by those skilled in the art that each module or each step of above-mentioned the application can be filled with general computer It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.The application be not limited to any specific hardware and The combination of software.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the application, model not is protected to the application The limitation enclosed, those skilled in the art should understand that, on the basis of the technical solution of the application, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within the protection scope of the application.

Claims

1. a kind of cross-module state Hash search method for merging supervision message, which comprises the following steps:

Construct image network, text network and converged network；

Using image network and the output feature of text network as the input of the converged network, and define the converged network Output；

The objective function is solved, unified Hash codes are obtained；

Using the unified Hash codes as supervision message, in conjunction with semantic information, the Hash network of training modality-specific；

The objective function for learning unified Hash codes are as follows:

Wherein, embedded constraint item between first item is pair, and

Wherein H_*i、H_*jRespectively indicate the converged network output of different training samples pair, S={ s_ijExpression pair Between similarity matrix, B ∈ { -1,1 }^k×nIndicate unified Hash codes matrix, p (s_ij| B) when indicating given Hash codes B, s_ijItem Part probability distribution, λ indicate super ginseng；Section 2 minimizes the loss between the output and binary code of converged network, H=h (Z； θ_z)∈R^k×nFor the output of converged network；Section 3 is Constraints of Equilibrium item, for maximizing the information of each Hash codes, η table Show super ginseng,Indicate F norm；

The Hash network of the trained modality-specific includes: to solve overall goal function, obtains image network and text network Parameter；The overall goal function are as follows:

Wherein, α, β, γ respectively indicate super ginseng；J₁It is pairs of embedded constraint between mode,Wherein F_*i=f (v_i；θ_v) Indicate the character representation of i-th of the sample exported from image network, G_*j=g (t_j；θ_j) indicate j-th exported from text network The character representation of sample；J₂The unified Hash codes for using the first stage to obtain train the Hash of modality-specific as supervision message Network, B ∈ { -1,1 }^k×nIndicate that unified Hash codes matrix, F indicate picture feature output, G indicates text feature output；J₃It will Label information is linearly mapped to the network of modality-specific,WithRespectively indicate image and text modality Mapping matrix, Y indicate semantic matrix；J₄It is Constraints of Equilibrium, for maximizing each information.

2. a kind of cross-module state Hash search method for merging supervision message as described in claim 1, which is characterized in that the figure As network includes 5 convolutional layers and 3 full articulamentums；Text network includes two full articulamentums；Converged network includes two complete Articulamentum；Wherein, the hidden unit number of described image network and text network the last layer is equal, the second layer of converged network For Hash layer, and its activation primitive is discriminant function.

3. a kind of cross-module state Hash search method for merging supervision message as described in claim 1, which is characterized in that will be described Image network and the output feature of text network obtain the input of the converged network by nonlinear activation function.

4. a kind of cross-module state Hash search method for merging supervision message as described in claim 1, which is characterized in that solve institute Stating objective function includes:

Fixed network parameter θ={ θ_v,θ_t,θ_z, update unified Hash codes B；

It constantly alternately updates, until convergence.

5. a kind of cross-module state Hash search method for merging supervision message as described in claim 1, which is characterized in that the spy In the Hash network of cover half state, image network includes 5 convolutional layers, 2 full articulamentums and 1 Hash layer, and text network includes 1 A full articulamentum and 1 Hash layer；Wherein, the activation primitive of the Hash layer in described image network and text network is to differentiate letter Number.

6. a kind of cross-module state Hash search method for merging supervision message as described in claim 1, which is characterized in that solve institute Stating overall goal function includes:

Preset parameter θ_vAnd θ_t, solve objective function and obtain W1 and W2；

Then W1 and W2 is fixed, updates network parameter using small lot stochastic gradient descent method；

It constantly alternately updates, until convergence.

7. a kind of computer system including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes fusion as claimed in any one of claims 1 to 6 when executing described program The cross-module state Hash search method of supervision message.

8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The cross-module state Hash search method of fusion supervision message as claimed in any one of claims 1 to 6 is realized when row.