CN108319686A

CN108319686A - Antagonism cross-media retrieval method based on limited text space

Info

Publication number: CN108319686A
Application number: CN201810101127.0A
Authority: CN
Inventors: 王文敏; 余政; 王荣刚; 李革; 王振宇; 赵辉; 高文
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2018-07-24
Anticipated expiration: 2038-02-01
Also published as: CN108319686B; WO2019148898A1

Abstract

The invention discloses a kind of antagonism cross-media retrieval methods based on limited text space, design feature extracts network, Feature Mapping network and mode grader, limited text space is obtained by study, image and text feature of the extraction suitable for cross-media retrieval realize mapping of the characteristics of image from image space to text space；Constantly to reduce in learning process the otherness of feature distribution between different modalities data by antagonistic training mechanism；It is achieved in cross-media retrieval.The present invention can preferably be fitted behavior expression of the mankind in cross-media retrieval task；The image and text feature that are more suitable for cross-media retrieval task are obtained, shortcoming of the pre-training feature in ability to express is compensated for；The mechanism for introducing confrontation inquiry learning further improves retrieval rate by the minimax game between mode grader and Feature Mapping network.

Description

Antagonism cross-media retrieval method based on limited text space

Technical field

The present invention relates to technical field of computer vision more particularly to a kind of antagonism based on limited text space across matchmaker Body search method.

Background technology

With the arriving in 2.0 epoch of Web, a large amount of multi-medium datas (image, text, video, audio etc.) start interconnecting Online accumulation and propagation.It is different from traditional single mode retrieval tasks, cross-media retrieval for realizing different modalities data it Between two-way retrieval, such as text retrieval image and image retrieval text.However, the isomery having since multi-medium data is congenital Characteristic, their similitude can not be weighed directly.Therefore, the key problem of the generic task is how to find an isomorphism Mapping space so that the similitude between the multi-medium data of isomery can be weighed directly.In current cross-media retrieval field In, people have carried out a large amount of research on the basis of this problem, and propose a series of typical cross-media retrieval algorithms, Such as CCA (Canonical Correlation Analysis, canonical correlation analysis), DeViSE (Deep Visual- Semantic Embedding, deep vision semantic embedding) and DSPE (Deep Structure-Preserving Image- Text Embeddings, the constant text image incorporation model of depth structure).But these methods also suffer from certain drawbacks.

First defect is embodied on the character representation of multi-medium data.Existing method mostly uses the CNN of pre-training (Convolutional neural network) model extracts characteristics of image, such as VGG (Visual Geometry The neural network structure that Group is proposed).However, these models are usually all that pre-training is carried out in image classification task, this The classification information that the characteristics of image that extraction obtains only includes object is had led to, to have lost a part for cross-media retrieval For may be critically important information, such as the interactive process etc. between the behavior act and object of object.For text For, Word2Vec, LDA (Latent Dirichlet Allocation) and FV (Fisher Vector) they are some mainstreams Text feature.However, they are also to carry out pre-training on the data set that some are different from cross-media retrieval, because This feature extracted is not particularly suited for cross-media retrieval.

Second defect is embodied in the selection of isomorphism feature space.There are three types of the selections substantially of the isomorphic space, is respectively Public space, text space and image space.From the perspective of human cognitive, understanding process of the brain for text and image It is not quite similar.For text, brain can directly extract feature and understand；And for an image, brain is total before understanding It is subconsciously first to describe it with text, i.e., is first converted from image space to text space.Therefore, it is carried out in text space Cross-media retrieval can more simulate the cognitive style of the mankind.The existing cross-media retrieval method based on text space mostly uses As final text space, character representation of the image in the space is then the classification by objects in images in the spaces Word2Vec What information combined.Therefore this feature can equally lose the information of the abundant action and interaction contained in image, this also table It is bright for cross-media retrieval, the spaces Word2Vec are not an effective text feature space.

Third defect is embodied in the otherness of different modalities data characteristics distribution.Although existing method all can will not The feature space of a certain isomorphism is mapped to the data characteristics of mode, but the mode wide gap (modality gap) between them is still So exist, and there is also apparent differences for feature distribution, this can lead to the decline of cross-media retrieval performance.

Invention content

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of antagonism based on limited text space across matchmaker Body search method obtains image corresponding with cross-media retrieval task by study first and text feature describes, secondly logical The cognitive style for crossing the simulation mankind finds a limited text space, for realizing the measuring similarity between image and text； This method also introduces antagonistic training mechanism, it is intended to reduce in text space learning process feature point between different modalities data The otherness of cloth, and then increase retrieval accuracy.

The principle of the present invention is：As described in the background art, the key problem of cross-media retrieval is how to find one together The mapping space of structure so that the similitude between the multi-medium data of isomery can be weighed directly.More precisely, this core Heart problem can be subdivided into two sub-problems.First subproblem is how to learn to obtain efficient multimedia data characteristics table Show.Second subproblem is how to find a suitable isomorphism feature space.It is proposed by the present invention to be based on limited text space Cross-media retrieval method include feature extraction network, Feature Mapping network and mode grader.For first subproblem, originally Invention obtains effective image and Text Representation using feature extraction e-learning.Based on iamge description (image Caption) task, the present invention learn to obtain a kind of new characteristics of image in such a way that iamge description algorithm is combined CNN. This kind of feature not only includes the classification information of objects in images, also comprising interactive information abundant between object；For text spy For sign, is started from scratch using Recognition with Recurrent Neural Network (RNN) and learn to be suitable for the text feature of cross-media retrieval task.For Two subproblems, the present invention obtain a limited text space using Feature Mapping e-learning；In order to further drop Otherness between low different modalities feature, the present invention devise a mode grader, for realizing with Feature Mapping network Minimax game.Specifically, mode grader is used to distinguish the mode of current limited text space feature, Feature Mapping Network is then used to learning to obtain the constant feature of mode and confuses mode grader whereby.During training, in addition to passing The triple of system is lost, and a kind of additional antagonism loss can propagate back to Feature Mapping network from mode grader, be used for Further decrease the otherness between different modalities feature." limited text space " indicates that the text obtained by party's calligraphy learning is empty Between be to be made of a series of base vector, these base vectors can be regarded as the various words in dictionary.Therefore the text is empty Between ability to express restricted by word quantity in dictionary, thus be limited.The method of the present invention is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on limited text space, by simulating people The cognitive style of class, extraction suitable for cross-media retrieval image and text feature, realize characteristics of image from image space to The mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce in learning process between different modalities data The otherness of feature distribution.This method achieves accurate retrieval result in cross-media retrieval classics data set.

Technical solution provided by the invention is：

A kind of antagonism cross-media retrieval method based on limited text space, utilizes feature extraction network, Feature Mapping Network and mode grader obtain limited text space by study, and extraction is special suitable for the image and text of cross-media retrieval Sign realizes mapping of the characteristics of image from image space to text space；Made in learning process not by antagonistic training mechanism The otherness of feature distribution between disconnected reduction different modalities data；The invention firstly uses data set D training characteristics extraction network, Feature Mapping network and mode grader, then realize antagonism across matchmaker using trained character network for retrieval request data Physical examination rope；It is as follows：

Assuming that training dataset D={ D₁,D₂,…,D_nShare n sample, each sample D_iIncluding a pictures I_iWith one Segment description text T_i, i.e. D_i=(I_i,T_i), each section of text is made of multiple (5) sentences, each sentence is independently The picture to match is described；Every image all includes 5 similar imports but different descriptive sentence；

1) feature of image and text in D is extracted by feature extraction network.

For image, existing VGG models and iamge description algorithm (Neural Image Captioning, NIC) are used The mode being combined extracts characteristics of image；For text, LSTM (Long Short Term Memory networks, length are used Short-term memory Recognition with Recurrent Neural Network) network extraction text feature.Since LSTM networks are without pre-training, its parameter with The parameter synchronization of Feature Mapping network updates.

The calculating process of image characteristics extraction is indicated such as formula 1：

Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input picture_VGG；NIC () is iamge description algorithm, the 512 dimensional feature I for extracting image_NIC；Cincatenate () is feature articulamentum, is used In by I_VGGAnd I_NIGConnect into the feature I of 4608 dimensions_Concat。

Text character extraction specifically executes following steps：

Text S=(the s that a given segment length is T₀,s₁,…,s_T), each word s in S_tCompiled using 1-of-k Code indicates that k represents the number of word in dictionary；Before being sent into LSTM networks, word s_tNeed first to be mapped to one more Dense space is expressed as formula 2：

x_t=W_es_t, t ∈ { 0L T }, (formula 2)

Wherein, W_eIt is term vector mapping matrix, is used for 1-of-k vectors s_tIt is encoded into the word vector of a d dimension；

The term vector in obtained dense space is sent into LSTM networks, formula 3 is expressed as：

Wherein, i_t,f_t,o_t,c_t,h_tIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively The output of member and hidden layer；x_tIndicate the word vector input at current time；h_t-1It is the LSTM unit hidden layers of previous moment It is defeated；σ indicates tangent bend function；⊙ is indicated using matrix element as the multiplying of unit；Tanh indicates tanh activation primitive； The hidden layer of T moment LSTM networks exports h_TThe as character representation of text S.

2) in one Fusion Features layer of the Top-layer Design Method of Feature Mapping network, by I_{VGG_txt}And I_{NIC_txt}It is fused into I_final, It is indicated as d dimensional feature of the input picture in limited text space；The dimension of limited text space is d；Feature Mapping network Text and step 1) are obtained into the limited text space that characteristics of image is respectively mapped under original state, then first by similar Property measure function comparative feature vector between similarity (calculating distance between the two), obtain current triple damage It loses；Secondly the feature vector of different modalities data is sent into mode grader to classify, obtains current confrontation loss, finally Limited text space is trained by optimizing the assembling loss function of triple loss and confrontation loss.

Here text feature Feature Mapping network is not sent into, the reason is that feature extraction network (LSTM networks) is carried in feature Mapping of the text to feature space is had been realized in during taking；

It is handled by formula 5 and obtains the Fusion Features layer in Feature Mapping network top：

Wherein, I_VGGIt is the 4096 dimension characteristics of image extracted by VGGNet, I_NICIt is by iamge description algorithm NIC Extract 512 obtained dimension characteristics of image, I_finalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g () Indicate two Feature Mapping functions, I_{VGG_txt}And I_{NIC_txt}It is I respectively_VGGAnd I_NICThe mapping of d Balakrishnans this space characteristics.

Similitude measure function is expressed as：S (v, t)=vt；Wherein, v and t respectively represents characteristics of image and text is special Sign；V and t first passes through L2 normalization layers and is normalized before comparison, so that s is of equal value with COS distance.

It is specific to execute following behaviour by optimizing triple loss function and confrontation loss function training characteristics mapping network Make：

Setting input picture or text and matched text or the distance between to match image be d₁, with mismatch text or not It is d to match the distance between image₂, d₁At least compare d₂Closely-spaced m；Interval m is a hyper parameter determined by the external world；Triple Loss function is expressed as formula 6：

Wherein, t_kIt is k-th of mismatch text of input picture v；v_kIt is k-th of mismatch image for inputting text t；M is Minimum range interval；S (v, t) is similarity measurements flow function；θ_fIt is the parameter of Feature Mapping network；Unmatched sample is each Cycle of training randomly selects from data set；

Antagonism in mode grader loses L_advSynchronous backward propagates to Feature Mapping network；

Define total loss function L such as formulas 7：

L=L_emb-λ·L_adv(formula 7)

Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1；L_embRepresent triple loss function；L_advIt is Additional antagonism loss function；

In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda can be real by formula 8 It is existing：

Wherein, p represents the percentage that current iterations account for total iterations；λ is auto-adaptive parameter；

Using above-mentioned loss function L training characteristics mapping networks, the parameter θ of Feature Mapping network is updated by formula 9_f：

Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θ_fIt is that feature is reflected Penetrate the parameter of network.

3) by what step 2) obtained mode classification is respectively fed to positioned at the image and text feature of same limited text space Device is classified, and trains mode grader by intersecting entropy loss；It is specific to execute following operation：

The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode The training of grader is realized by two classification cross entropy loss function of optimization, is expressed as formula 4：

Wherein, x_iAnd y_iI-th of input text space feature and its corresponding label are indicated respectively；N indicates current input Feature samples sum；θ_dIndicate the training parameter of mode grader；Function is for predicting current text space characteristics Mode, i.e. text or picture；L_advIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping The additional confrontation loss function of network；

The parameter θ of mode grader is updated by formula 10_d：

Wherein, the learning rate of μ representing optimizeds algorithm, L_advRepresent the total loss function of Feature Mapping network, θ_dIt is mode The parameter of grader.

4) step 2) and step 3) are repeated, until Feature Mapping network convergence；

5) to retrieval request be calculated the retrieval request data (image or text) in limited text space with data Collect the distance between another modal data in D, retrieval result is ranked up according to distance, and then obtains most similar retrieval knot Fruit.Distance is then calculated by being limited the dot product in text space between the feature vector of different modalities data.

Through the above steps, the antagonism cross-media retrieval based on limited text space is realized.

Compared with prior art, the beneficial effects of the invention are as follows：

The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process Between feature distribution otherness.This method achieves accurate retrieval result in cross-media retrieval classics data set. Specifically, the present invention obtains effective image and Text Representation using feature extraction e-learning, and characteristics of image is by into one Step is sent into Feature Mapping network, realizes the mapping from image space to text space.Finally in order to further decrease different moulds The otherness of feature distribution between state data, antagonism loss, which is reversed, caused by mode grader propagates to Feature Mapping net Network so that retrieval result is further promoted.Specifically, the present invention has following technical advantage：

(1) the present invention is directed to by way of simulating human cognitive, across media inspections are carried out in a limited text space Rope.With it is existing based on the method for public space or image space compared with, the present invention can preferably be fitted the mankind across matchmaker Behavior expression in body retrieval tasks；

(2) feature extraction network can learn to obtain the image and text feature that are more suitable for cross-media retrieval task, more Shortcoming of the pre-training feature in ability to express is mended；

(3) in order to further decrease the otherness of feature distribution between different modalities data, invention introduces antagonism It is accurate further to improve retrieval by the minimax game between mode grader and Feature Mapping network for the mechanism of study True rate.

Description of the drawings

Fig. 1 is the flow diagram of the method for the present invention；

Wherein, (a) indicates that the present invention includes three feature extraction network, Feature Mapping network and mode grader parts； (b) and (c) be respectively Feature Mapping network and mode grader network structure block diagram.

Fig. 2 is the schematic network structure of the feature extraction network of the present invention；

Wherein, (a) is image characteristics extraction network, passes through the knot of 19 layers of VGG models VGGNet and iamge description algorithm NIC Close extraction characteristics of image；(b) it is the Recognition with Recurrent Neural Network (LSTM) for extracting text feature.

Fig. 3 is the cross-media retrieval effect sectional drawing that the embodiment of the present invention is implemented in Flickr8K test data sets.

Specific implementation mode

Below in conjunction with the accompanying drawings, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.

The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process Between feature distribution otherness.Feature extraction network, Feature Mapping network and mode point in the present invention described in detail below The training step of class device and its realization and network.

1, feature extraction network

Feature extraction network includes mainly Liang Ge branches, including image characteristics extraction network and Text character extraction network, Correspond respectively to the feature extraction of image and text.

1) image characteristics extraction has obtained characteristics of image I by image characteristics extraction e-learning_Concat, including 4096 dimensions Feature I_VGGWith the characteristics of image I extracted by iamge description algorithm_NIC；

Image characteristics extraction network can be regarded as the VGGNet (nerve nets that Visual Geometry Group are proposed Network structure) and NIC (Neural Image Caption, the iamge description based on neural network) combination, VGGNet is 19 layers VGG models, NIC are iamge description algorithms.Wherein VGGNet carries out pre-training in image classification task, for extracting comprising rich The characteristics of image of rich object category information；Opposite, NIC carries out pre-training in iamge description task, for extracting comprising rich The characteristics of image of interactive information between rich object.Therefore the characteristics of image that the two extraction obtains is complementary.

Specifically, by it is one big it is small be 224 × 224 image be sent into VGGNet after, network can export one 4096 dimension Feature I_VGG；At the same time, the information loss in order to avoid characteristics of image in translation process, image mapping layer (Image in NIC Embedding Layer) output be regarded as the characteristics of image I that iamge description algorithm is extracted_NIC.Finally, the feature of image I_ConcatBe the equal of I_VGGAnd I_NICCombination.Calculating process is indicated such as formula 1：

Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input picture_VGG；NIC () is iamge description algorithm, the 512 dimensional feature I for extracting image_NIC；Concatenate () is feature articulamentum, is used In by I_VGGAnd I_NICConnect into the feature I of 4608 dimensions_Concat。

2) Text character extraction

Text character extraction Web vector graphic shot and long term remembers the text feature that Recognition with Recurrent Neural Network (LSTM) extracts d dimensions.Together When, d is also the dimension of limited text space.Assuming that the text S=(s that a given segment length is T₀,s₁,…,s_T), it is each in S A word s_tIt is encoded using 1-of-k to indicate, k represents the number of word in dictionary.Before being sent into LSTM networks, word s_t It needs first to be mapped to a more dense space：

x_t=W_es_t, t ∈ { 0L T }, (formula 2)

Wherein, W_eIt is a term vector mapping matrix, is used for 1-of-k vectors s_tIt is encoded into the word vector of a d dimension. After the term vector for obtaining dense space indicates, we are sent into them in LSTM networks, and mathematic(al) representation is expressed as formula 3：

Wherein, i_t,f_t,o_t,c_t,h_tIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively The output of member and hidden layer；x_tIndicate the word vector input at current time；h_t-1It is the LSTM unit hidden layers of previous moment It is defeated；σ indicates that tangent bend function, ⊙ are indicated using matrix element as the multiplying of unit；Tanh indicates tanh activation primitive； The character representation of text S is exactly the hidden layer output of T moment LSTM networks, i.e. h_T。

Fig. 3 is the network structure of feature of present invention extraction network；In the training process, the parameter of VGGNet is by solid always Fixed, NIC carries out pre-training using Flickr30K or MSCOCO training datasets in iamge description task.Specifically, we are first All images in data set are first sized to 256 × 256, then obtain size in such a way that single center is cut For 224 × 224 image block, finally it is sent into feature extraction network and extracts characteristics of image；For text, we used LSTM and two-way LSTM networks extract text feature, and the hidden layer number of network nodes of wherein LSTM units is 1024.

2, mode grader

In order to further decrease the otherness between different modalities feature distribution, we devise a mode grader, Function as the discriminator generated in confrontation network.The text space feature tag of given image is [0 1], the text of text Space characteristics label is [1 0], and the training of mode grader is realized by two classification cross entropy loss function of optimization, is expressed as Formula 4：

Wherein, x_iAnd y_iI-th of input text space feature and its corresponding label are indicated respectively；N indicates current input Feature samples sum；θ_dIndicate the training parameter of mode grader；Function is for predicting current text space characteristics Mode, i.e. text or picture；L_advIndicate two classification cross entropy loss functions of mode grader, while it is also that feature is reflected Penetrate the additional confrontation loss function of network.

3, Feature Mapping network

The parameter θ that the present invention passes through Feature Mapping network_fStudy obtains a limited text space.Feature extraction network science Characteristics of image I has been arrived in acquistion_Concat, including I_VGGAnd I_NICTwo parts.For characteristics of image I_ConcatTwo parts, Wo Men Two mapping function f () and g () are devised in Feature Mapping network, are respectively used to realize I_VGGAnd I_NICTo this sky of d Balakrishnans Between feature I_{VGG_txt}And I_{NIC_txt}Mapping.With I_VGGAnd I_NICIt is similar, I_{VGG_txt}And I_{NIC_txt}Be also complementary, thus we One Fusion Features layer of the Top-layer Design Method of Feature Mapping network, for realizing the mutual supplement with each other's advantages of the two.Processing procedure defines such as Formula 5：

Wherein, I_VGGIt is the 4096 dimension characteristics of image extracted by VGGNet, I_NICIt is by iamge description algorithm NIC Extract 512 obtained dimension characteristics of image, I_finalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g () Indicate two Feature Mapping functions, I_{VGG_txt}And I_{NIC_txt}It is I respectively_VGGAnd I_NICThe mapping of d Balakrishnans this space characteristics.It is worth note Meaning, the characteristic extraction procedure of text is equivalent to is mapped to the limited text space text.Therefore, Feature Mapping network Parameter θ_f(see formula 9) includes the parameter of LSTM networks.

(b) and (c) in Fig. 2 indicates the network structure of Feature Mapping network and mode grader respectively.Feature Mapping net Network includes two Feature Mapping network f () and g (), and a fused layer (fusion layer) and a L2 normalize layer (L2 Norm).F () includes two full articulamentums, and hidden layer number of network nodes is 2048 and 1024 respectively.Between each full articulamentum It uses ReLU as activation primitive, and is added to after ReLU Dropout layers to prevent over-fitting, wherein Dropout Rate is 0.5；G () includes a full articulamentum, and hidden layer number of network nodes is 1024；Fused layer (fusion layer) realizes Add operation as unit of matrix element；L2 normalization layers allow the similarity between the feature for learning to obtain directly to lead to Dot product is crossed to weigh, accelerates model convergence rate, increases trained stability.

It is exactly to compare in next step after the limited text space that image and text are respectively mapped under an original state Compared with the similarity between feature, corresponding triple loss is calculated.We define a similitude measure function s (v, t)= Vt, wherein v and t respectively represent image and text feature.In order to enable s is of equal value with COS distance, v and t are needed before comparison L2 normalization layers are first passed through to be normalized.Triple loss function being widely used in cross-media retrieval field It is general.In the case of given input picture (text), the distance between input picture (text) and matched text (image) are d₁, It is d with the distance between text (image) is mismatched₂, it is intended that d₁At least compare d₂Closely-spaced m.Interval m is one by extraneous true Fixed hyper parameter, for the ease of optimization, we fix m=0.3 and are applied in all data sets.Therefore, in the present invention, Triple loss function is expressed as formula 6：

Wherein, t_kIt is k-th of mismatch text of input picture v；v_kIt is k-th of mismatch image for inputting text t；M is Minimum range interval；S (v, t) is similarity measurements flow function；θ_fIt is the parameter of Feature Mapping network.In order to obtain these mismatches Sample, we randomly select in each cycle of training from data set.

Secondly, the feature vector of different modalities data is sent into mode grader to classify, is obtained current to damage-retardation It loses.In addition to triple is lost, the antagonism in mode grader loses L_advAlso can synchronous backward propagate to Feature Mapping network.

Finally, L is lost by optimizing triple_embL is lost with confrontation_advAssembling loss function train the limited text empty Between.Due to L_embAnd L_advOptimization aim on the contrary, total loss function L can be defined such as formula 7：

L=L_emb-λ·L_adv(formula 7)

Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1；L_embRepresent triple loss function；L_advIt is Additional antagonism loss function.In order to inhibit mode grader in the noise signal of training incipient stage, the update of parameter lambda can To be realized by mathematic(al) representation shown in formula 8：

Wherein, p represents the percentage that current iterations account for total iterations, and λ is an auto-adaptive parameter.

Fig. 3 illustrates present invention cross-media retrieval effect actual in Flickr8K test data sets.Table first row The image and text question for retrieval has been set out；Secondary series is respectively shown to the 4th row for each problem, LTS-A (VGG+BLSTM), before LTS-A (NIC+BLSTM) and LTS-A (VGG+NIC+BLSTM) ranking 5 search result.For figure For search text, the text being correctly retrieved is indicated with red font；It is correct to retrieve for text search image Image out all includes one to hooking.In terms of from the left side of table toward the right, search result has obtained significantly being promoted, especially From LTS-A (VGG+BLSTM) to LTS-A (NIC+BLSTM)；In addition to this, those by false retrieval come out samples from certain It also can be good at matching with problem in degree.

4, training method

The training process of the present invention includes four-stage.

One：In initial training stage, we fix the parameter of VGGNet, and using Flickr30K, (image data is from refined Brave photograph album website Flickr, amount to 30000 pictures) or MSCOCO (Microsoft use Amazon Company " robot of Turkey " The data set of service-creation) training data set pair NIC progress pre-training.After the completion of training, we can pass through feature extraction net Network extracts characteristics of image.

Two：In being extracted data set after the feature of all images, the second training stage was mainly used for study and obtains one A limited text space.After the loss function L given Feature Mapping network, we fix the parameter of mode grader θ_d, the parameter θ of Feature Mapping network is updated by following mathematic(al) representation_f, it is expressed as formula 9：

Three：After the second training stage, the third training stage is mainly used for enhancing the discriminating power of mode grader. Given the loss function L of mode grader_advLater, the parameter θ of our fixed character mapping networks_f, pass through following mathematical table The parameter θ of mode grader is updated up to formula_d：

Four：For every batch of training data, repeatedly the second training stage and third training stage always, until model is restrained.

Table 1 gives the present invention experimental result that cross-media retrieval is carried out in Flickr8K test data sets.In order to comment Valence retrieval effectiveness, we have followed the sorting measure standard of standard, use Recall@K and Median Rank.Recall@K are logical It crosses the correct matched data of calculating and comes the probability in preceding K (K=1,5,10) a retrieval result come to retrieving accuracy degree of progress Amount；Median Rank represent the median of ranking residing for correct matched data.Higher Recall@K and lower Median Rank indicates accurate retrieval effectiveness.The present invention is listed in figure compared with the effect of other existing advanced algorithms, including DeViSE (Deep Visual-Semantic Embedding, deep vision semantic embedding), m-RNN (Deep captioning With multimodal recurrent neural networks, the iamge description of multimedia Recognition with Recurrent Neural Network), Deep Fragment (Deep Fragment Embedding, the insertion of depth segment), DCCA (Deep Canonical Correlation Analysis, depth canonical correlation analysis), VSE (Unifying Visual-Semantic Embedding With Multimodal Neural Language Models, the unified embedded multimedia depth language model of vision semanteme), m-CNN_ENS(Multimodal Convolutional Neural Networks, multimedia convolutional neural networks), NIC (Neural Image Captioning, the iamge description based on neural network), HM-LSTM (Hierarchical Multimodal LSTM, with different levels multimedia LSTM networks).In addition to this, we also design based on the above method Four variants：

●LTS-A(VGG+LSTM)：During image characteristics extraction, iamge description algorithm NIC is removed, rest part It immobilizes；

●LTS-A(NIC+LSTM)：During image characteristics extraction, convolutional neural networks VGGNet is removed, remaining Part immobilizes；

●LTS-A(VGG+NIC+LSTM)：The network structure that attached drawing 2 is shown；

●LTS-A(VGG+NIC+BLSTM)：LSTM networks are replaced with two-way by the network structure that attached drawing 2 is shown LSTM networks (BLSTM).

Cross-media retrieval effect of 1 embodiment of table in Flickr8K test data sets.

In table 1, the retrieval of Img2Txt representative images to text；Txt2Img represents retrieval of the text to image.From table 1 I As can be seen that LTS-A (VGG+NIC+BLSTM) surmounted HM-LSTM in picture search text task, achieve at this stage Best retrieval effectiveness.However, effects of the LTS-A (VGG+NIC+BLSTM) in text search image task and being not so good as HM- LSTM.Most possible reason is that HM-LSTM uses a kind of tree-like LSTM network architectures, can be preferably to the layer of text Secondary structure is modeled.And present invention employs the chain type LSTM network architectures, with different levels semantic letter in text can not be obtained Breath.In addition to this, from the experimental result variation between four variants of the present invention as can be seen that when being used for image characteristics extraction Network is after VGGNet becomes NIC, and the accuracy rate of picture search text improves 22%, and the accuracy rate of text search image carries Rise 17%.This is also indicated that compares with traditional VGGNet, and NIC can extract more efficiently characteristics of image；When for image For the network of feature extraction after NIC becomes VGG+NIC, the accuracy rate of cross-media retrieval further improves 6%, this demonstrate Volume feature extraction network at this time can not only extract the careful object category information in image, also comprising abundant between object Interactive information；Finally, LSTM networks are substituted with two-way LSTM networks (BLSTM) and brings 2% additional retrieval rate It is promoted.

Table 2 illustrates cross-media retrieval effect of the embodiment in Flickr30K test data sets.In addition in Flickr8K In the existing advanced algorithm mentioned, we increase DAN (Dual Attention Networks, antithesis attention net Network), DSPE (Deep Structure-Preserving Image-Text Embeddings, the constant text image of depth structure Incorporation model), VSE++ (the enhancing model of Improving Visual-Semantic Embeddings, VSE).At this point, DAN Best retrieval effectiveness is achieved with DSPE, the wherein performance of DAN is better than DSPE.Due to the introducing of attention mechanism, DAN can The fine-grained information of data is given more sustained attention, these information are mostly beneficial to cross-media retrieval.Opposite, we are used only Global characteristics indicate image and text, therefore can be interfered by noise information in image or text.In addition to DAN, DSPE Performance also than we than get well, this is because DSPE used increasingly complex text feature (Fisher Vector) and damage Lose function.As for four variants of the present invention, their experiment performance is more similar with Flickr8K's.

Cross-media retrieval effect of 2 embodiment of table in Flickr30K test data sets

Cross-media retrieval effect of 3 embodiment of table in MSCOCO test data sets

Table 3 illustrates cross-media retrieval effect of the embodiment in MSCOCO test data sets.In addition in Flickr8K and The existing advanced algorithm mentioned in Flickr30K, we increase Order (Order-Embeddings Of Images The sequence of And Language, image and text is embedded in).At this point, LTS-A (VGG+NIC+LSTM) is in picture search text task On achieve best effect, about improve 2% on retrieval rate, and be less than DSPE only in 1 indexs of R@；Scheming As in retrieval text task, performances of the DSPE on Recall K is more outstanding than us, but LTS-A (VGG+NIC+LSTM) Best effect is achieved in Median Rank indexs.This is because chain type LSTM networks of the present invention cannot be very The good layering semantic information understood in text, therefore FV (Fisher are also just not so good as to the character representation ability of text Vector).As for four variants of the present invention, their experiment performance is similar to Flickr8K, Flickr30K's.

The cross-media retrieval effect of two variants LTS-A and LTS of 4 embodiment of table

Table 4 illustrates influence of the antagonism study mechanism to experimental result.We devise two primary two bright changes Body：LTS-A and LTS.LTS-A is exactly aforementioned LTS-A (VGG+NIC+LSTM)；LTS is then in LTS-A (VGG+NIC + LSTM) on the basis of, eliminate the mechanism of confrontation inquiry learning.

From in table we can see that LTS-A is obviously improved in cross-media retrieval accuracy rate compared with LTS.LTS is only It has been more than LTS-A in 1 indexs of R@of picture search text.The experimental results showed that confrontation inquiry learning is to reducing different modalities data Otherness between feature distribution it is with obvious effects.

Retrieval effectiveness of 5 embodiment of table in MSCOCO test data sets

Table 6 indicates in MSCOCO test data sets, extracts figure by the mean value of single cutting and ten cuttings respectively As the retrieval effectiveness of feature.

In above-mentioned implementation process, we extract characteristics of image using the single cutting (1-crop) of image-region. For validity of the characteristic mean as characteristics of image (10-crops) of ten different zones of authentication image, we devise LTS-A (10-crops), wherein LTS-A refer to that the image that LTS-A (VGG+NIC+BLSTM), 10-crops are represented at this time is special Sign is described by the characteristic mean of ten different zones of image.As can be seen from Table 6, the retrieval of LTS-A (10-crops) Accuracy rate is obviously improved compared with LTS-A (1-crop), this also elaborate using the characteristic mean of ten different zones of image as The feasibility of characteristics of image.

It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel are appreciated that：It is not departing from the present invention and spirit and scope of the appended claims, various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims

1. a kind of antagonism cross-media retrieval method based on limited text space, design feature extracts network, Feature Mapping net Network and mode grader obtain limited text space by study, and extraction is suitable for the image and text feature of cross-media retrieval, Realize mapping of the characteristics of image from image space to text space；Constantly to subtract in learning process by antagonistic training mechanism The otherness of feature distribution between small different modalities data；It is achieved in cross-media retrieval；Specifically：

A. feature extraction network includes image characteristics extraction network and Text character extraction network, is respectively used to image characteristics extraction And Text character extraction；Image characteristics extraction network has obtained image spy by one or both of VGGNet and NIC study Levy I_Concat, include the feature I of 4096 dimensions_VGGWith the characteristics of image I extracted by iamge description algorithm_NICOne or both of； Text character extraction Web vector graphic shot and long term remembers the text of Recognition with Recurrent Neural Network LSTM or two-way LSTM network Bs LSTM extraction d dimensions Eigen；

B. mode grader passes through two classification cross entropy of optimization as the discriminator in confrontation network to the training of mode grader Loss function is realized；The function is also the additional confrontation loss function of Feature Mapping network；

C. Feature Mapping network passes through parameter θ_fStudy obtains a limited text space；It is obtained for feature extraction e-learning Characteristics of image I_ConcatIncluding I_VGGAnd I_NIC, design map function f () and g (), is respectively used in Feature Mapping network Realize I_VGGAnd I_NICTo the mapping I of this space characteristics of d Balakrishnans_{VGG_txt}And I_{NIC_txt}；In the Top-layer Design Method one of Feature Mapping network Fusion Features layer, by I_{VGG_txt}And I_{NIC_txt}It is fused into I_final, as d dimensional feature table of the input picture in limited text space Show；The dimension of limited text space is d；

Assuming that training dataset D={ D₁,D₂,…,D_nShare n sample, each sample D_iIncluding a pictures I_iIt is retouched with one section The property stated text T_i, i.e. D_i=(I_i,T_i), each section of text be made of 5 sentences, each sentence is independently to matching Picture is described；D for data sets executes following steps 1) -4) to the feature extraction network, Feature Mapping network and mould State grader is trained：

1) feature of image and text in D is extracted by feature extraction network：For the image in D, VGG models and image are used Description algorithm NIC extracts to obtain characteristics of image；For the text in D, extracted using shot and long term memory Recognition with Recurrent Neural Network LSTM Text feature is obtained, and realizes mapping of the text to feature space, the parameter of LSTM networks is needed with the parameter of Feature Mapping network Synchronized update；

2) text and step 1) are obtained the limited text sky that characteristics of image is respectively mapped under original state by Feature Mapping network Between, the distance between feature vector is calculated by similitude measure function first, the similarity between comparative feature vector obtains Current triple loss；The feature vector of different modalities data feeding mode grader is classified again, is obtained current Confrontation loss；It is limited text space finally by the assembling loss function training of the loss of optimization triple and confrontation loss；

3) by step 2) obtain positioned at the image and text feature of same limited text space be respectively fed to mode grader into Row classification, and train mode grader by intersecting entropy loss；

4) step 2) -3 is repeated), until Feature Mapping network convergence；

5) to retrieval request be calculated the retrieval request data image or text in limited text space in data set D The distance between another modal data is ranked up retrieval result according to distance, and then obtains most similar retrieval result；Tool Body calculates distance by the dot product between the feature vector of different modalities data in space；

2. antagonism cross-media retrieval method as described in claim 1, characterized in that the calculating process table of image characteristics extraction Show such as formula 1：

Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input picture_VGG；NIC () is Iamge description algorithm, the 512 dimensional feature I for extracting image_NIC；Concatenate () is feature articulamentum, is used for I_VGG And I_NICConnect into the feature I of 4608 dimensions_Concat。

3. antagonism cross-media retrieval method as described in claim 1, characterized in that Text character extraction specifically executes as follows Step：

Text S=(the s that a given segment length is T₀,s₁,…,s_T), each word s in S_tUse 1-of-k coding schedules Show, k represents the number of word in dictionary；Before being sent into LSTM networks, word s_tNeed first to be mapped to one it is more dense Space, be expressed as formula 2：

x_t=W_es_t, t ∈ { 0L T }, (formula 2)

Wherein, i_t,f_t,o_t,c_t,h_tRespectively indicate LSTM units t moment input gate, forget door, out gate, mnemon and The output of hidden layer；x_tIndicate the word vector input at current time；h_t-1Be previous moment LSTM unit hidden layers it is defeated；σ tables Show tangent bend function；⊙ is indicated using matrix element as the multiplying of unit；Tanh indicates tanh activation primitive；The T moment The hidden layer of LSTM networks exports h_TThe as character representation of text S.

4. antagonism cross-media retrieval method as described in claim 1, characterized in that the training of mode grader specifically executes Following operation：

The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode classification The training of device is realized by two classification cross entropy loss function of optimization, is expressed as formula 4：

Wherein, x_iAnd y_iI-th of input text space feature and its corresponding label are indicated respectively；N indicates the spy currently inputted Levy total sample number；θ_dIndicate the training parameter of mode grader；Function is used to predict the mould of current text space characteristics State, i.e. text or picture；L_advIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping net The additional confrontation loss function of network；

The parameter θ of mode grader is updated by formula 10_d：

Wherein, the learning rate of μ representing optimizeds algorithm, L_advRepresent the total loss function of Feature Mapping network, θ_dIt is mode classification The parameter of device.

5. antagonism cross-media retrieval method as described in claim 1, characterized in that handled by formula 5 and obtain reflecting in feature Penetrate the Fusion Features layer of network top：

Wherein, I_VGGIt is the 4096 dimension characteristics of image extracted by VGGNet, I_NICIt is to be extracted by iamge description algorithm NIC 512 obtained dimension characteristics of image, I_finalIt is that d dimensional feature of the input picture in limited text space indicates that f () and g () are indicated Two Feature Mapping functions, I_{VGG_txt}And I_{NIC_txt}It is I respectively_VGGAnd I_NICThe mapping of d Balakrishnans this space characteristics.

6. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) is lost by optimizing triple Function and confrontation loss function training characteristics mapping network, it is specific to execute following operation：

Setting input picture or text and matched text or the distance between to match image be d₁, with mismatch text or mismatch The distance between image is d₂, d₁At least compare d₂Closely-spaced m；Interval m is a hyper parameter determined by the external world；Triple is lost Function representation is formula 6：

Wherein, t_kIt is k-th of mismatch text of input picture v；v_kIt is k-th of mismatch image for inputting text t；M is minimum Distance interval；S (v, t) is similarity measurements flow function；θ_fIt is the parameter of Feature Mapping network；Unmatched sample is in each training Period randomly selects from data set；

Define total loss function L such as formulas 7：

L=L_emb-λ·L_adv(formula 7)

In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda that can be realized by formula 8：

Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θ_fIt is Feature Mapping net The parameter of network.

7. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) the similitude measure function S (v, t) is expressed as：

S (v, t)=vt

Wherein, v and t respectively represent characteristics of image and text feature；V and t first passes through normalization layer and carries out normalizing before comparison Change is handled, so that s is of equal value with COS distance.