CN109299342A

CN109299342A - A kind of cross-module state search method based on circulation production confrontation network

Info

Publication number: CN109299342A
Application number: CN201811455802.6A
Authority: CN
Inventors: 倪立昊; 王骞; 邹勤; 李明慧
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2019-02-01
Anticipated expiration: 2038-11-30
Also published as: CN109299342B

Abstract

The invention discloses a kind of cross-module state search method based on circulation production confrontation network, this method devises a kind of novel binary channels circulation production confrontation neural network, and establishes the semantic dependency across modal data by the training neural network.Given different modalities data can two-way flow in a network, each modal data fights network by one group of production and generates another modal data, generate input of the data again as next group of production confrontation network, to realize that the bidirectional circulating of data generates, network continuously learns to across the semantic relation between modal data.In order to improve effectiveness of retrieval, the result of generator middle layer is approximately also corresponding binary system Hash codes using threshold function table and approximate function by this method, and a variety of constraint conditions are devised to guarantee the otherness of data between the similitude of same mode, homogeneous data and cross-module state, class, to further improve the Stability and veracity of retrieval.

Description

A kind of cross-module state search method based on circulation production confrontation network

Technical field

The invention belongs to technical field of multimedia information retrieval, and in particular to a kind of to fight network based on circulation production Cross-module state search method.

Technical background

With the arrival of Internet era, people can be touched whenever and wherever possible including picture, video, text, audio etc. The massive information of multiple modalities, the content that oneself needs how is got from these massive informations are concerned about as Internet user Emphasis, user frequently rely on Google, Baidu, must should wait search engines provide accurate retrieval service.However it is traditional mutual Network search service largely also rests on the degree of single mode retrieval, for the retrieval across modal data using less, retrieval Efficiency, accuracy, stability it is all to be improved, and largely all rely on existing data label, can not accomplish no mark Sign the cross-module state retrieval of data.Therefore, studying novel cross-module state search method has very strong realistic meaning and practical value, Key is the semantic relation by establishing between multi-modal isomeric data directly to retrieve other similar modal datas, is being not necessarily to It realizes in the case where marking all modal datas across the direct retrieval between modal data, finally further increases the property of retrieval Energy.

Summary of the invention

In view of the deficiencies of the prior art, the present invention provides a kind of cross-module state retrievals based on circulation production confrontation network Method can effectively submit the performance of existing cross-module state retrieval technique.

To achieve the goals above, the cross-module state retrieval side based on circulation production confrontation network designed by the present invention Method, which comprises the following steps:

Two loop modules are designed, described two loop modules share two generators with identical network structure, and Hash coding is carried out to the output data of generator middle layer, the purpose of generator is as true as possible by training generation Across modal data,；

One of loop module realizes mode m → mode t → mode m process by two generators, separately One loop module realizes mode t → mode m → mode t process also by two generators；

Respective arbiter is designed for different modalities in each loop module, the arbiter attempts to the mode It generates data and initial data is classified, and carry out dynamic confrontation with generator, ultimately generate device and arbiter in given instruction Reach dynamic equilibrium under the conditions of white silk.

Further, it for the multi-modal multi-class characteristic of data flow, is constrained under the conditions of non-supervisory using manifold to protect Demonstrate,prove the data similarity and otherness between mode between classification；Since class label is given under surveillance requirements, three are used Tuple constraint minimizes the characteristic distance between similar different modalities between data, maximizes both inhomogeneity or the data of different modalities Between characteristic distance.

Further, the loss function of the arbiter specifically:

The circulation loss function with the generation data of mode compared with initial data relatively to obtain ultimately produced are as follows:

Wherein i indicates the data that i-th calculates, a total of n training sample data, and arbiter in the training process can not Disconnectedly towards reduction L^discDirection iterative learning, D_imgAnd D_txtRespectively indicate two arbiters, (m^ori, t^ori) respectively indicate mould The original feature vector of state m and mode t, (m^cyc, t^cyc) respectively indicate the feature that mode m and mode t are generated by recirculating network Vector.

Still further, the loss function of the generator specifically:

Wherein θ₁It is the hyper parameter of network, | | * | |₂L2 distance is sought in expression.

Further, if the feature vector of two generator middle layers output is m_comAnd t_com, generate Hash coding Formula are as follows:

m_hash=sgn (m_com-0.5)

t_hash=sgn (t_com-0.5)

Wherein sgn is threshold function table, and formula is meant that each floating number in middle layer floating type feature vector, value Corresponding hash code bit is set as+1 when greater than 0.5, and corresponding hash code bit is set as -1 when value is less than 0.5.

Still further, the approximate error of the Hash intersymbol for the generation of quantization characteristic vector sum, this method devise phase The loss function of pass is as constraint, specifically used likelihood functions of Hash codes under the conditions of feature vector, with i-th sample Hash codes jth positionWith feature vector jth positionFor (sample is either image is also possible to text):

WhereinIt is the relevant sigmoid function of feature vector:

It is close between feature vector and the Hash codes of generation to assess that loss function is further designed according to likelihood function Like error:

Wherein n is total sample number, d_hashFor vector digit.

Still further, classification constraint is carried out to generator middle layer feature vector in the present invention, to design classification Loss function formula are as follows:

WhereinIt is the feature vector of i-th of sampleThe sample predictions classification obtained by small sortation network, c_iIt is the actual class label of the sample, what classification loss function actually calculated is L2 distance between the two.

Constraint for the homogeneous data to cross-module state to similitude is carried out, this method is by training image sample data and it Similar samples of text data establish connection, and design loss function and constrain the homogeneous data of cross-module state, lose letter Number formula is as follows:

WithIt is generator G respectively_t→m, G_m→tGenerate the feature vector of image and the public subspace of text, damage Mistake function calculates the L2 distance between semantic similar corresponding cross-module state homogeneous data.

In the case where there is the data training of supervision, since data all have class label, constrain to come using triple Minimize the distance between the cross-module state data vector under identical semantic label, the ternary loss function of design are as follows:

Wherein m, t respectively represent image and text data, α, and β represents two categories label, and * representative is to generate data, i generation The data of table i-th calculating；For non-supervisory training, this method devise manifold constraint come guarantee same mode and across The similarity of semantic similarity data in modal data establishes similarity to the data to be constrained after calculating kNN matrix Then matrix carries out manifold constraint to feature vector in public subspace, design manifold constraint loss function is as follows:

Wherein neib, non respectively represent neighbouring and not proximity data, other symbol meanings are with before.

Further, loss function design in summary, the generator loss function in the case of Training is set It is calculated as:

Generator loss function design in the case of unsupervised training are as follows:

θ₂, θ₃, θ₄, θ₅It is the weight hyper parameter of network.Whole network is calculated using the optimization of RMSProp stochastic gradient descent Method is trained iteration, iterative formula are as follows:

Since the decline of arbiter gradient is very fast in real process, every S generator of trained iteration of network of this method design Arbiter of ability iteration, and use hyper parameter c^gen, c^discNetwork weight is trimmed, prevents network weight excessive.

The present invention has the advantages that

The present invention fights network by the circulation production using two groups of generators and arbiter building preferably to establish Semantic relation between multi-modal data, and a variety of constraint conditions are devised to improve the Stability and veracity of retrieval, it uses Binary system Hash codes substitute primitive character to improve effectiveness of retrieval, study and explore a kind of novel generate based on circulation Formula fights the cross-module state search method of network, retrieves specific to the cross-module state between image and text.

Detailed description of the invention

Fig. 1 is the neural network general frame figure of the embodiment of the present invention.

Fig. 2 is the triple constraint schematic diagram of the embodiment of the present invention.

Fig. 3 is that the manifold constraint book of the embodiment of the present invention is intended to.

Specific embodiment

The present invention is described in further detail in the following with reference to the drawings and specific embodiments:

In recent years, along with the upsurge of artificial intelligence, depth learning technology gradually rises and has influenced computer science Various fields, also have more and more people in technical field of multimedia information retrieval and improve existing inspection using deep learning The Stability and veracity of rope.Production confrontation network (the generative adversarial used in this method It network is) a kind of new neural network for estimating to generate model by antagonistic process being widely used in recent years, in network The generator (generator) for learning data distribution and arbiter for differentiating the data true and false are had trained simultaneously (discriminator), generator and arbiter are confronted with each other in the training process, are finally reached dynamic equilibrium.Production pair Anti- network is widely used in the various fields such as image generation, semantic segmentation, data enhancing, can be well according to loss function Learn the data distribution rule to training sample, and generates new data similar with training sample.This method utilizes two groups of generations Formula fights network and forms novel recirculating network, and improves network by Hash codes and a variety of constraint conditions and be used for multi-modal retrieval When efficiency, Stability and veracity.

Cross-module state search method provided by the invention based on circulation production confrontation network, mainly devises a kind of new The neural network of type, main overall structure is referring to Fig. 1.Embodiment is by taking the mutual retrieval between image and text data as an example to this The neural network framework and flow chart of data processing of invention are specifically described, as follows:

First in embodiment, original two dimensional image data actual needs passes through preliminary processing, and the present embodiment is selected deep Spend 19 layers of VGGNet of learning areas prevalence, and the 4096 dimensional feature vectors original as input that the fc7 of VGGNet layer is exported Beginning characteristics of image m^ori, i.e. characteristics of image dimension d_imgIt is 4096.Simultaneously, the urtext data of input also will be by processing Become preliminary feature vector, the present embodiment handles textual data using conventional bag of words (Bag-of-Words) model According to the processing method that the length of obtained BoW vector is selected with text data and specifically is related, in order to which implementation reference rises See, the BoW vector dimension in the present embodiment is set as 2000 dimensions, i.e. text feature dimension d_txtBe 2000, and using the vector as The urtext feature t of input^ori。

Step 1, it designs first group of production and fights network, contain generator G_m→tWith arbiter D_txt, according to input Original image-urtext data are to (m^ori, t^ori) obtain generating text data t^gen, to extract raw according to image data At the mapping mode of text data, to obtain the semantic relation between image-text data.Specific implementation process is described as follows:

As shown in Figure 1, the network of top half is considered as first group of production confrontation network, generator is mainly contained G_m→tWith arbiter D_txt, at this moment input is original image-urtext data to (m^ori, t^ori).Data flow in a network, former Beginning image m^oriPass through generator G_m→tIt obtains generating text t^gen, i.e. t^gen=G_m→t(m^ori), and it is desirable that generate text t^genAs far as possible With urtext t^oriIt is similar.Generator G_m→tIt is made of the one-dimensional convolutional layer of multilayer, feature vector dimension variation therein is d_img→ 512→d_hash→100→d_txt。d_imgThe dimension for indicating the primitive image features of input, is in the present embodiment 4096；d_hashFor It will be used for the dimension of the middle layer feature of Hash codes generation, size determines by required Hash code length, can be 64, 128,256 etc. is a variety of；d_txtFor the dimension of the urtext feature inputted in network, and the characteristic length of text is generated, at this It is 2000 in embodiment.Arbiter D at the same time_txtWith generator G_m→tDynamic confrontation is carried out, urtext spy is distinguished in trial Levy t^genWith generation text feature t^ori.Arbiter D_txtIt is the feedforward neural network of full articulamentum composition, characteristic dimension therein becomes Turn to d_txt→512→16.When generator and arbiter reach dynamic equilibrium under given training condition, generator G_m→tEnergy The mapping mode that text data is generated according to image data is extracted well, to obtain original image-generation text data Between semantic relation.

Step 2, it designs second group of production and fights network, contain generator G_t→mWith arbiter D_img, inputting is upper one Obtained original image-generation text data is walked to (m^ori, t^gen), obtain chain image m^cycAnd it extracts according to text data The mapping mode for generating image data, to obtain the semantic relation between text-image data.Specific implementation process illustrates such as Under:

As shown in Figure 1, the network of lower half portion is considered as second group of production confrontation network, generator is mainly contained G_t→mWith arbiter D_img, at this moment input is original image-generation text data to (m^ori, t^gen).Data flow in a network, raw At text t^genPass through generator G_t→mObtain chain image m^cyc, i.e. m^cyc=G_t→m(t^gen)=G_t→m(G_m→t(m^ori)), and it is desirable that Chain image feature m^cycWith primitive image features m^oriIt is similar as much as possible.Generator G_t→mIt is made of the one-dimensional inverse convolutional layer of multilayer, Feature vector dimension variation therein is d_txt→100→d_hash→512→d_img。d_txtFor the urtext spy inputted in network The dimension of sign is in the present embodiment 2000；d_hashIt is big for the dimension that will be used for the middle layer feature that Hash codes generate It is small to be determined by required Hash code length, it is a variety of to can be 64,128,256 etc., and to fight network with first group of production In Hash code length it is identical；d_imgIndicate the dimension of the primitive image features of input, and the chain image feature ultimately produced Length is in the present embodiment 4096.Arbiter D at the same time_imgWith generator G_t→mDynamic confrontation is carried out, trial, which is distinguished, to follow Ring characteristics of image m^cycWith primitive image features m^ori.Arbiter D_imgIt is the feedforward neural network of full articulamentum composition, spy therein Sign dimension variation is d_img→512→100→16.When generator and arbiter reach dynamic equilibrium under given training condition, The mapping mode that image data is generated according to text data can be extracted well, to obtain generating text-chain image Semantic relation between data.

Step 3, network is fought using two groups of productions of above two step design, it equally can be anti-by data flow direction Turn, it is final to realize the mapping mode that text data is generated by image data, so that the semanteme between obtaining image-text data closes System.I.e. comprehensive the first two steps fight network for the urtext feature t of input first with second group of production^oriIt is generated as Generate characteristics of image m^gen, obtain the semantic relation between text-image data；Recycle first group of production confrontation network that will give birth to At characteristics of image m^genIt is generated as circulation text feature t^cyc, obtain the semantic relation between image-text data.It has been finally reached instruction Image data and text data circulate in two groups of production confrontation networks, generate confrontation, continue to optimize network when practicing Purpose, specific implementation process are described as follows:

Input data is still original image-urtext data to (m^ori, t^ori), the sequence phase executed with two steps above Instead, first with the generator generator G of second group of production confrontation network_t→mBy the urtext feature t of input^oriIt is generated as Generate characteristics of image G_t→m, i.e. m^gen=G_t→m(t^ori), generator G_t→mIn feature vector dimension variation it is as before, be d_txt→100→d_hash→512→d_img.Arbiter D at the same time_imgWith generator G_t→mDynamic confrontation is carried out, attempts to distinguish Primitive image features m^oriWith generation characteristics of image m^gen.Confrontation reaches generator G after dynamic equilibrium_t→mIt can learn to arrive original text Semantic relation between sheet-generation image data.Then the generator G of first group of production confrontation network is recycled_m→tScheme generating As feature m^genIt is generated as circulation text feature t^cyc, i.e. t^cyc=G_m→t(m^ten)=G_m→t(G_t→m(t^ori)), generator G_m→tIn Feature vector dimension variation is as before, is d_img→512→d_hash→100→d_txt.Arbiter D at the same time_txtWith generation Device G_m→tDynamic confrontation is carried out, urtext feature t is distinguished in trial^oriWith circulation text feature t^cyc.Confrontation reaches dynamic equilibrium Generator G afterwards_m→tIt can learn to the semantic relation generated between image-circulation text data.

The bidirectional circulating flow channel of image data and text data in a network by steps 1 and 2,3, in embodiment It is established, wherein a channel, primitive image features data m^oriNetwork, which is fought, by first group of production obtains production text Feature t^gen, then by t^genNetwork, which is fought, by second group of production generates chain image feature m^cyc；Another channel, original text Notebook data t^oriSecond group of confrontation generation net is first passed through to obtain generating characteristics of image m^gen, then by m^genPass through first group of production pair Anti- network production recycles text feature t^cyc.This sampled images and text data bidirectional circulating can generate in two groups of networks, with This has arbiter D simultaneously_imgAnd D_txtConfrontation generator is participated in, to improve effect of the e-learning across semantic relation between modal data Fruit.Wherein arbiter D_imgAnd D_txtLoss function design are as follows:

Wherein i indicates the data that i-th calculates, a total of n training sample data, and arbiter in the training process can not Disconnectedly towards reduction L^discDirection iterative learning.After the completion of the production confrontation network struction of bidirectional circulating, one of advantage The loop-around data finally obtained exactly can be used and relatively obtain circulation loss function compared with initial data, while being also generate The important component of device loss function:

Wherein θ₁It is the hyper parameter of network, is 0.001 in the present embodiment, | | * | |₂L2 distance is sought in expression.

Step 4, in order to improve cross-module state effectiveness of retrieval in practice, this method applicable threshold function is from two groups of generations The Hash codes m that can indicate image and text feature is extracted respectively in the public subspace of formula confrontation network generator^hashWith t^hash, and likelihood function is devised to assess the approximate error between two kinds of Hash codes.Specific implementation process is described as follows:

In two groups of production confrontation networks, since the input and output of generator are the characteristic of different modalities respectively, The middle layer of generator is treated as the public subspace (as shown in Figure 1) across modal data by this example, and will in above step The characteristic length of this layer is designed to the length d of the Hash codes needed_hash.If the feature vector of middle layer is m_comAnd t_com, generate Formula be m_hash=sgn (m_com- 0.5) and t_hash=sgn (t_com- 0.5), wherein sgn is threshold function table, and formula is meant that Each floating number in middle layer floating type feature vector, corresponding hash code bit is set as+1 when value is greater than 0.5, and value is less than Corresponding hash code bit is set as -1 when 0.5.Such threshold transformation can be for the every of the feature vector of each training sample One, each training sample can be to a Hash codes isometric with feature vector.Hash codes m is used in embodiment_hash、t_hash Substitute public sub-space feature vectors m_com、t_comIt retrieves, so that it may when by original retrieval between different floating type feature vectors Distance calculate replace with Hash intersymbol Hamming distance calculate, greatly improve the calculating speed of retrieval.

For the approximate error for the Hash intersymbol that quantization characteristic vector sum generates, the present embodiment devises relevant loss letter Number is as constraint.Example has used likelihood function of Hash codes under the conditions of feature vector, with the Hash codes jth of i-th of sample PositionWith feature vector jth positionFor (sample is either image is also possible to text):

WhereinIt is the relevant sigmoid function of feature vector:

It is close between feature vector and the Hash codes of generation to assess that embodiment according to likelihood function designs loss function Like error:

Wherein n is total sample number, d_hashFor vector digit.The loss function of assessment Hash codes approximate error will be used as network One of constraint condition play a role in training.

Step 5, in order to construct the better network model of effect, when the present embodiment utilizes a variety of constraint conditions to network training The data characteristics of generation is constrained, and is allowed to retain larger class feature, to improve accuracy when retrieval.It is more for data flow The multi-class characteristic of mode, constrained under the conditions of non-supervisory using manifold guarantee data similarity between mode between classification and Otherness；Since sample class label is given under surveillance requirements, constrained using triple to minimize similar different moulds Characteristic distance between state between data maximizes both characteristic distances between inhomogeneity or the data of different modalities.Specific implementation process It is described as follows:

There is the feature vector for introducing another small sortation network in the case of supervising to obtain to the public subspace of generator Carry out classification constraint.For there is the cross-module state data set of supervision, i.e., when the data sample of training has class label, in order to more Data category label is made full use of, the present embodiment carries out classification expression to public subspace using small sortation network, and designs Classification loss function is allowed to be different from other layer of vector, carry stronger to constrain the generations of public sub-space feature vectors Strong classification information also can correctly be classified when predicting classification.Classification loss function formula are as follows:

Constraint to the homogeneous data of cross-module state to similitude is carried out.In the data of cross-module state, have many semantic similar Pairs of training data, as some image data sample and another text data sample semantic similarity be very in training data Height has similar category attribute.It is in the present embodiment that training image sample data is similar to it in order to utilize this characteristic Samples of text data establish connection, and design loss function and the homogeneous data of cross-module state constrained.Loss function is public Formula is as follows:

WithIt is generator G respectively_t→m, G_m→tGenerate the feature vector of image and the public subspace of text, loss Function calculates semantic similar L2 distance of the correspondence across modal data.

It further expandsThe present embodiment is considered simultaneously between homogeneous data cross-module state and with same in mode The similarity constraint of class data, i.e., the distance of semantic similar pairs of cross-module state training data and the feature vector with modal data Semantic other dissimilar feature vectors should be less than.Under the training for having supervision, since data all have class label, Therefore the distance between the cross-module state data vector under identical semantic label is minimized using triple constraint.Triple constraint Signal is as shown in Fig. 2, icon of different shapes represents different classes of data, and different textures represents the mode of data not Together, the data in feature space and the same class data in same modal data or cross-module state are closely located, with cross-module state inhomogeneity Other data distance is farther out.In embodiment, to generate image dataFor (generate data feature tag be exactly its original The class label of beginning input data), the text data t of distinguishing label similar with its is chosen first_{α, i}, while randomly selecting inhomogeneity Other text data t_{β, i}, wherein α, β represent two categories label, and * representative is to generate data, and i represents the number of i-th calculating According to the triple constraint for generating image seeks to minimizet_{α, i}Between distance, maximize simultaneouslyt_{β, i}.Likewise, For generating textThe constraint of its triple and m_{α, i}, m_{β, i}It is related.Therefore design triple constraint loss function is as follows:

For non-supervisory training, the present embodiment devises manifold constraint to guarantee same mode and across in modal data The similarity of semantic similarity data.When due to using the training of non-supervisory data, data do not contain class label, therefore the present embodiment K- neighbour matrix is constructed to guarantee that the data of semantic similarity are polymerize, semantic different data are separated.As shown in figure 3, this Embodiment establishes similarity matrix after calculating kNN matrix, to the data to be constrained, then in public subspace to spy It levies vector and carries out manifold constraint.By text data t_αObtained generation image dataFor, according to t_αKNN matrix calculate As a result, by t_αA closest data of k (k is set as 2 in the present embodiment) 1 is denoted as in similarity matrix, the number not closed on According to being denoted as 0 in similarity matrix.After text data generates to obtain image feature vector, randomly select in similarity matrix For the 1 corresponding generation image feature vector conduct of text dataIt is corresponding for 0 text data in similarity matrix Generate image feature vector conductIn prevalence constraint, to minimizeWithBetween distance to guarantee semanteme The similarity of the generation feature vector of close data is high, maximizesWithBetween distance guarantee different semantic datas The similarity for generating feature vector is low.Similarly for generating text data, also have To carry out manifold about Beam.Therefore design manifold constraint loss function is as follows:

In conclusion the generator loss function that our the available loss functions through various constraints are constituted.There is supervision Under data training, generator loss function is by circulation loss functionHash codes loss functionThree Tuple constrains loss functionCross-module state homogeneous data loss functionWith classification loss functionComposition, formula are as follows:

Wherein θ₂, θ₃, θ₄, θ₅It is the adjustable hyper parameter of network respectively, is set to 5,5,0.001 in the present embodiment, 20.Under non-supervisory data training, generator loss function is by circulation loss functionHash codes loss FunctionManifold constrains loss functionCross-module state homogeneous data loss functionComposition, it is public Formula is as follows:

The value of hyper parameter as set before.

In summary 5 steps are designed after arbiter loss function and generator loss function using common minimum Very big algorithm iteration minimizes network losses, and the purpose of semantic relation between multi-modal data is established with realization.In the present embodiment Minimax Algorithm uses stochastic gradient descent optimization algorithm, specifically used more stable RMSProp optimization algorithm.By Confront with each other in arbiter and generator, thus the calculation method of the two be it is opposite, they all can be in each round iteration pair The last round of iteration result of anti-other side, and reach dynamic equilibrium in this confront with each other.Calculation method is as follows:

Since arbiter is comparatively fast trained in real process, S generator of network every trained iteration of this method design is Arbiter of iteration.Network training correlation hyper parameter S is set as 10 in the present embodiment, and the learning rate μ of network is set as 0.0001, the sample size in batches (batch size) of training is set as 64 every time；The weight learnt in network is carried out simultaneously Trimming will be trained every time and be greater than c in generator^genWeight be set to c^gen, c is greater than in arbiter^discWeight be set to c^disc, with The weight for exempting to learn is excessive.

Step 6, trained neural network is used for cross-module state data search, it is mainly that data are public by generator The feature vector boil down to Hash codes that subspace obtains recycle the Hamming distance of different data Hash intersymbol to retrieve.Specifically Implementation process is described as follows:

After network training as described above study, generator just obtains image and text data in embodiment Extracting mode across semantic relation relevant information between modal data.Embodiment can carry out the two-way inspection across modal data at this time Rope, the weight parameter in network that training fixed first finishes, by image to be retrieved and text data m^test, t^testPass through instruction Practice the generator G finished_m→t, G_t→mObtain the feature vector m on public subspace^com, t^com, then feature vector is generated as breathing out Uncommon code m^hash, t^hashFor use.When using image retrieval text, the Hash codes of the image are taken outCalculate itself and all texts The Hamming distance of Hash codes, apart from nearest Hash codesThe text of representative is image → text cross-module state retrieval knot Fruit；When using text retrieval image, the Hash codes of the text are taken outCalculate the Hamming distance of itself and all image hash codes From apart from nearest Hash codesThe image of representative is text → image cross-module state retrieval result.

Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.

Claims

1. a kind of cross-module state search method based on circulation production confrontation network, which comprises the following steps:

Two loop modules are designed, described two loop modules share two generators with identical network structure, and to life The output data for middle layer of growing up to be a useful person has carried out Hash coding, and the purpose of generator is to generate cross-module as true as possible by training State data；

One of loop module realizes mode m → mode t → mode m process by two generators, another Loop module realizes mode t → mode m → mode t process also by two generators；

Arbiter is designed in each loop module, the arbiter divides the generation data and initial data of same mode Class, and dynamic confrontation is carried out with generator, it ultimately generates device and arbiter and reaches dynamic equilibrium under given training condition.

2. the cross-module state search method according to claim 1 based on circulation production confrontation network, it is characterised in that:

For the multi-modal multi-class characteristic of data flow, constrained using manifold under the conditions of non-supervisory to guarantee between mode and classification Between data similarity and otherness；Since class label is given under surveillance requirements, constrained using triple come minimum Change the characteristic distance between similar different modalities between data, maximizes both characteristic distances between inhomogeneity or the data of different modalities.

3. the cross-module state search method according to claim 2 based on circulation production confrontation network, it is characterised in that:

The loss function of the arbiter specifically:

Wherein i indicates the data that i-th calculates, and a total of n training sample data, arbiter in the training process can be constantly Towards reduction L^discDirection iterative learning, D_imgAnd D_txtRespectively indicate two arbiters, (m^ori, t^ori) respectively indicate original mould State m and original mode t, m^cycGenerate mode m feature, t^cycGenerate mode t feature.

4. the cross-module state search method according to claim 3 based on circulation production confrontation network, it is characterised in that:

The loss function of the generator specifically:

5. the cross-module state search method according to claim 4 based on circulation production confrontation network, it is characterised in that:

If the middle layer feature vector of two generators is m_comAnd t_com, generate the formula of Hash coding are as follows:

m_hash=sgn (m_com-0.5)

t_hash=sgn (t_com-0.5)

Wherein sgn is threshold function table, and formula is meant that each floating number in middle layer floating type feature vector, value are greater than Corresponding hash code bit is set as+1 when 0.5, and corresponding hash code bit is set as -1 when value is less than 0.5.

6. the cross-module state search method according to claim 5 based on circulation production confrontation network, it is characterised in that: be The approximate error for the Hash intersymbol that quantization characteristic vector sum generates, this method devise relevant loss function as constraint, Specifically used likelihood functions of Hash codes under the conditions of feature vector, with the Hash codes jth position of i-th of sampleAnd spy Levy vector jth positionFor (sample is either image is also possible to text):

WhereinIt is the relevant sigmoid function of feature vector:

Loss function is further designed according to likelihood function to assess the approximation between feature vector and the Hash codes of generation accidentally Difference:

Wherein n is total sample number, d_hashFor vector digit.

7. the cross-module state search method according to claim 6 based on circulation production confrontation network, it is characterised in that: this Classification constraint is carried out to generator middle layer feature vector in invention, to design classification loss function formula are as follows:

WhereinIt is the feature vector of i-th of sampleThe sample predictions classification obtained by small sortation network, c_iIt is The actual class label of the sample, what classification loss function actually calculated is L2 distance between the two；For cross-module state Constraint of the homogeneous data to similitude is carried out, this method establish training image sample data samples of text data similar with it Connection, and design loss function and the homogeneous data of cross-module state is constrained；Loss function formula is as follows:

WithIt is generator C respectively_t→m, G_m→tGenerate the feature vector of image and the public subspace of text, loss function Calculate semantic similar L2 distance of the correspondence across modal data；In the case where there is the data training of supervision, since data all have There is class label, therefore minimize using triple constraint the distance between the cross-module state data vector under identical semantic label, The ternary loss function of design are as follows:

Wherein m, t respectively represent image and text data, α, and β represents two categories label, and * representative is to generate data, and i represents the The data of i calculating；For non-supervisory training, this method devises manifold constraint to guarantee same mode and cross-module state The similarity of semantic similarity data in data establishes similarity moment to the data to be constrained after calculating kNN matrix Then battle array carries out manifold constraint to feature vector in public subspace；It is as follows to design manifold constraint loss function:

Wherein neib, non respectively represent neighbouring and not proximity data, other symbol meanings are with before；Comprehensive various functions, Generator loss function designs under the data training for having supervision are as follows:

Generator loss function designs under non-supervisory data training are as follows:

θ₂, θ₃, θ₄, θ₅It is the weight hyper parameter of network；Whole network is carried out using RMSProp stochastic gradient descent optimization algorithm Training iteration, iterative formula are as follows:

Since the decline of arbiter gradient is very fast in real process, every S generator of trained iteration of network of this method design just changes Arbiter of generation, and use hyper parameter c^gen, c^discNetwork weight is trimmed, prevents network weight excessive.