CN109299342B

CN109299342B - Cross-modal retrieval method based on cycle generation type countermeasure network

Info

Publication number: CN109299342B
Application number: CN201811455802.6A
Authority: CN
Inventors: 倪立昊; 王骞; 邹勤; 李明慧
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2021-12-17
Anticipated expiration: 2038-11-30
Also published as: CN109299342A

Abstract

The invention discloses a cross-modal retrieval method based on a cycle generation type confrontation network. Given different modal data can flow bidirectionally in the network, each modal data generates another modal data through a group of generative confrontation networks, the generated data is used as the input of the next group of generative confrontation networks, thereby realizing the bidirectional cycle generation of the data, and the network continuously learns the semantic relationship among the cross-modal data. In order to improve the retrieval efficiency, the method also approximates the result of the middle layer of the generator to the corresponding binary hash code by utilizing a threshold function and an approximation function, and designs various constraint conditions to ensure the similarity of the same-mode and same-class data and the difference of the data among different modes and classes, thereby further improving the retrieval accuracy and stability.

Description

Cross-modal retrieval method based on cycle generation type countermeasure network

Technical Field

The invention belongs to the technical field of multimedia information retrieval, and particularly relates to a cross-mode retrieval method based on a cycle generation type countermeasure network.

Technical Field

With the coming of the internet era, people can contact massive information of various modalities including pictures, videos, texts, audios and the like anytime and anywhere, how to acquire content required by the users from the massive information becomes a key point of interest of internet users, and the users often depend on precise retrieval services provided by retrieval engines such as google, Baidu and must. However, most of the conventional internet retrieval services still stay in the single-modal retrieval degree, the retrieval application for cross-modal data is less, the retrieval efficiency, accuracy and stability are all to be improved, most of the conventional internet retrieval services depend on the existing data tags, and the cross-modal retrieval of non-tag data cannot be achieved. Therefore, the research on the novel cross-modal retrieval method has strong practical significance and practical value, and the key point is that similar other modal data are directly retrieved by establishing the semantic relation among the multi-modal heterogeneous data, so that the cross-modal direct retrieval is realized under the condition of not marking all the modal data, and the retrieval performance is finally further improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-modal retrieval method based on a cycle generation type countermeasure network, which can effectively submit the performance of the prior cross-modal retrieval technology.

In order to achieve the above object, the cross-modal search method based on a cyclic generation type countermeasure network according to the present invention is characterized by comprising the following steps:

designing two circulation modules, wherein the two circulation modules share two generators with the same network structure, and hash coding is carried out on output data of a generator middle layer, and the purpose of the generators is to generate cross-modal data which is as real as possible through training;

one loop module realizes the process of the mode m → the mode t → the mode m through the two generators, and the other loop module also realizes the process of the mode t → the mode m → the mode t through the two generators;

designing respective discriminators for different modes in each loop module, wherein the discriminators try to classify generated data and original data of the mode and dynamically confront the generator, and finally the generator and the discriminators reach dynamic balance under given training conditions.

Further, aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted under the unsupervised condition to ensure the similarity and difference of data among modes and classes; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.

Further, the loss function of the discriminator is specifically:

the cyclic loss function obtained by comparing the finally generated data of the same modality with the original data is as follows:

wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training process^discDirection iterative learning of D_imgAnd D_txtRespectively representing two discriminators, (m)^ori，t^ori) Original feature vectors representing modes m and t, respectively, (m)^cyc，t^cyc) Respectively representing the feature vectors generated by the modality m and the modality t through the cyclic network.

Still further, the loss function of the generator is specifically:

wherein theta is₁Is a hyper-parameter of the network, | × | | non-conducting phosphor₂Indicating that the L2 distance is calculated.

Further, let the feature vector output by the middle layer of two generators be m_comAnd t_comThe formula for generating the hash code is as follows:

m_hash＝sgn(m_com-0.5)

t_hash＝sgn(t_com-0.5)

wherein sgn is a threshold function, the meaning of the formula is that each bit of floating point number in the intermediate layer floating point type characteristic vector is set to +1 when the value is greater than 0.5, and the corresponding hash code bit is set to-1 when the value is less than 0.5.

Still further, in order to quantify the approximate error between the feature vector and the generated hash code, the method designs a related loss function as a constraint, specifically uses a likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sample as the j bit

And j th position of characteristic vector

For example (samples can be both images and text):

wherein

Is a sigmoid function associated with the feature vector:

a loss function is further designed according to the likelihood function to evaluate the approximate error between the feature vector and the generated hash code:

where n is the total number of samples, d_hashIs the vector bit number.

Furthermore, the invention carries out category constraint on the characteristic vector of the generator middle layer, thereby designing a category loss function formula as follows:

wherein

Is the feature vector of the ith sample

Predicted classes of samples obtained via a small classification network, c_iIs the actual class label of the sample and the class loss function actually calculates the L2 distance between the two.

In order to carry out similarity constraint on cross-modal homogeneous data pairs, the method establishes connection between training image sample data and similar text sample data, designs a loss function to carry out constraint on the cross-modal homogeneous data, and the loss function formula is as follows:

and

are respectively a generator G_t→m，G_m→tFeature vectors are generated for the common subspace of the image and text, and a loss function computes the L2 distance between corresponding cross-modal homogeneous data that are semantically similar.

Under the condition of supervised data training, because data all have class labels, the distance between cross-modal data vectors under the same semantic label is minimized by using a triple constraint, and a designed ternary loss function is as follows:

wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, the method designs manifold constraint to ensure the similarity of semantically similar data in homomodal and cross-modal data, establishes a similarity matrix for the data to be constrained after calculating a kNN matrix, and then performs manifold constraint on a feature vector in a public subspace, wherein a manifold constraint loss function is designed as follows:

where neib and non represent adjacent and non-adjacent data, respectively, and the other symbols have the same meanings as before.

Further, by integrating the above loss function design, the generator loss function under the supervision training condition is designed as follows:

the generator loss function in the unsupervised training case is designed to:

θ₂，θ₃，θ₄，θ₅all are weight override parameters of the network. The whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:

due to the fact thatIn the actual process, the gradient of the discriminator is reduced quickly, the discriminator is iterated once when the network designed by the method is trained to iterate the generators for S times, and the hyper-parameter c is used^gen，c^discAnd pruning the network weight to prevent the network weight from being overlarge.

The invention has the advantages that:

the invention better establishes the semantic relation among multi-mode data by utilizing the cycle generation type antagonistic network constructed by two groups of generators and discriminators, designs various constraint conditions to improve the accuracy and stability of retrieval, adopts binary hash codes to replace original characteristics to improve the retrieval efficiency, and researches and explores a novel cross-modal retrieval method based on the cycle generation type antagonistic network, particularly aiming at the cross-modal retrieval between images and texts.

Drawings

FIG. 1 is an overall architecture diagram of a neural network according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of triple constraints according to an embodiment of the present invention.

FIG. 3 is a manifold restriction book intent of an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

in recent years, with the heat of artificial intelligence, deep learning technology is gradually emerging and affects many fields of computer science, and more people in the technical field of multimedia information retrieval use deep learning to improve the accuracy and stability of the existing retrieval. The generation type confrontation network (generating adaptive network) adopted in the method is a novel neural network which is widely used in recent years and estimates a generation model through the confrontation process, a generator (generator) for learning data distribution and a discriminator (discriminator) for discriminating the authenticity of data are trained in the network at the same time, and the generator and the discriminator mutually confront in the training process to finally achieve dynamic balance. The generative countermeasure network is widely applied to the fields of image generation, semantic segmentation, data enhancement and the like, and can well learn the data distribution rule of the training sample according to the loss function and generate new data similar to the training sample. The method utilizes two groups of generation type countermeasure networks to form a novel circulation network, and improves the efficiency, accuracy and stability of the network when the network is used for multi-mode retrieval through the hash code and various constraint conditions.

The invention provides a cross-modal retrieval method based on a cycle generation type confrontation network, which mainly designs a novel neural network, and the main overall structure of the neural network is shown in figure 1. The embodiment specifically describes the neural network framework and the data processing flow of the invention by taking mutual retrieval between image and text data as an example, and the following steps are carried out:

firstly, in the embodiment, the original two-dimensional image data actually needs to be subjected to preliminary processing, in the embodiment, 19-layer VGGNet popular in the deep learning field is selected, and 4096-dimensional feature vector output by fc7 layer of VGGNet is used as input original image feature m^oriI.e. the image characteristic dimension d_img4096. Meanwhile, the input original text data is processed into a preliminary feature vector, the embodiment uses a conventional Bag-of-Words (Bag-of-Words) model to process the text data, the length of the obtained BoW vector is related to the text data and a specific selected processing method, and for reference, the BoW vector dimension in the embodiment is set to 2000 dimensions, namely the text feature dimension d_txt2000, and the vector is used as the input original text feature t^ori。

Step 1, designing a first group of generation type countermeasure networks including a generator G_m→tAnd a discriminator D_txtFrom the input original image-original text data pair (m)^ori，t^ori) Obtaining the generated text data t^genThereby extracting a conversion mode for generating text data according to the image data, and obtaining the semantic relation between the image and the text data. The specific implementation process is described as follows:

as shown in FIG. 1, the upper half of the network can be regarded as a first group of generative confrontation networks, which mainly comprise a generator G_m→tAnd a discriminator D_txtWhen the input is the original image-original text dataTo (m)^ori，t^ori). Data flows in the network, original image m^oriThrough generator G_m→tObtaining a generated text t^genI.e. t^gen＝G_m→t(m^ori) And wishes to generate a text t^genAs much as possible with the original text t^oriSimilarly. Generator G_m→tIs composed of multiple layers of one-dimensional convolution layers, in which the feature vector dimension is changed to d_img→512→d_hash→100→d_txt。d_imgA dimension representing the input original image feature, 4096 in this embodiment; d_hashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and the dimension can be 64, 128, 256 and the like; d_txtThe dimension of the original text feature input in the network, which is also the feature length of the generated text, is 2000 in this embodiment. At the same time, the discriminator D_txtAnd generator G_m→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguished^genAnd generating text feature t^ori. Discriminator D_txtIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to d_txt→ 512 → 16. Generator G when the generator and the arbiter reach dynamic balance under given training conditions_m→tThe transformation mode of generating the text data according to the image data can be well extracted, so that the semantic relation between the original image and the generated text data is obtained.

Step 2, designing a second group of generation type countermeasure network, which comprises a generator G_t→mAnd a discriminator D_imgInputting the original image obtained for the previous step-generating text data pairs (m)^ori，t^gen) Obtaining a cyclic image m^cycAnd extracting a transformation mode for generating image data according to the text data so as to obtain the semantic relation between the text and the image data. The specific implementation process is described as follows:

as shown in FIG. 1, the lower half of the network can be regarded as a second group of generative countermeasure networks, mainly comprising a generator G_t→mAnd a discriminator D_imgWhen the input is the original image-generated textData pair (m)^ori，t^gen). Data flows in the network, generating text t^genThrough generator G_t→mObtaining a cyclic image m^cycI.e. m^cyc＝G_t→m(t^gen)＝G_t→m(G_m→t(m^ori) And hopefully loop image feature m)^cycAnd original image features m^oriAs similar as possible. Generator G_t→mIs composed of multiple layers of one-dimensional reverse convolution layers, in which the feature vector dimension is changed to d_txt→100→d_hash→512→d_img。d_txtThe dimension of the original text features input in the network, which is 2000 in this embodiment; d_hashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and can be 64, 128, 256 and the like, and is the same as the hash code length in the first group of generative countermeasure network; d_imgThe dimension representing the input original image feature, which is also the length of the last generated loop image feature, is 4096 in this embodiment. At the same time, the discriminator D_imgAnd generator G_t→mDynamic countermeasure is performed in an attempt to distinguish the cyclic image features m^cycAnd original image features m^ori. Discriminator D_imgIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to d_img→ 512 → 100 → 16. When the generator and the discriminator reach dynamic balance under given training conditions, the transformation mode of generating the image data according to the text data can be well extracted, and the semantic relation between the generated text and the circular image data is obtained.

And 3, utilizing the two groups of generation type countermeasure networks designed in the two steps, similarly reversing the data flow direction, and finally realizing a conversion mode of generating text data by the image data so as to obtain the semantic relation between the image and the text data. I.e. the first two steps are combined, firstly, the original text characteristics t to be input are resisted by using a second group of generating formulas^oriGenerating to generate an image feature m^genObtaining semantic relation between text and image data; then the first group of generation type countermeasure network is used to generate imageFeature m^genGenerated as a circular text feature t^cycAnd obtaining the semantic relation between the image and the text data. Finally, the purposes of circularly flowing image data and text data in two groups of generation type countermeasure networks, generating countermeasures and continuously optimizing the networks during training are achieved, and the specific implementation process is explained as follows:

the input data is still the original image-original text data pair (m)^ori，t^ori) First of all, in contrast to the sequence in which the two steps are carried out above, the generator G of the antagonistic network is confronted with the second set of generating equations_t→mOriginal text feature t to be input^oriGenerating to generate image features G_t→mI.e. m^gen＝G_t→m(t^ori) Generator G_t→mThe feature vector dimension change in (2) is the same as before, and is d_txt→100→d_hash→512→d_img. At the same time, the discriminator D_imgAnd generator G_t→mPerforming dynamic countermeasure to try to distinguish the original image features m^oriAnd generating image features m^gen. Generator G after the confrontation reaches dynamic balance_t→mThe semantic relationship between the original text-generated image data can be learned. Then, the generator G of the first group of generation type countermeasure network is used_m→tWill generate image features m^genGenerated as a circular text feature t^cycI.e. t^cyc＝G_m→t(m^ten)＝G_m→t(G_t→m(t^ori) Generator G)_m→tThe feature vector dimension change in (2) is the same as before, and is d_img→512→d_hash→100→d_txt. At the same time, the discriminator D_txtAnd generator G_m→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguished^oriAnd a recurring text feature t^cyc. Generator G after the confrontation reaches dynamic balance_m→tThe semantic relationship between the generated image-loop text data can be learned.

Through steps 1, 2 and 3, a bidirectional circulation flow channel of the image data and the text data in the embodiment in the network is established, wherein one channel is original image characteristic data m^oriObtaining production text features t through a first set of generative confrontation networks^genThen, t is further determined^genGeneration of recurring image features m by second set of generative countermeasure networks^cyc(ii) a The other channel, original text data t^oriFirstly, obtaining the generated image feature m through a second group to the antibiotic web^genThen m is added^genProducing recurring textual features t through a first set of generative confrontation networks^cyc. Thus, image and text data can be generated in a bidirectional loop in both networks, while there is a discriminator D_imgAnd D_txtAnd participating in a countermeasure generator to improve the effect of network learning on the semantic relationship among cross-modal data. Wherein the discriminator D_imgAnd D_txtThe loss function of (a) is designed to:

wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training process^discAnd (4) iteratively learning the direction. After the bidirectional cyclic generative countermeasure network is constructed, one of the advantages is that the finally obtained cyclic data can be compared with the original data to obtain a cyclic loss function, and the cyclic loss function is also an important component of the generator loss function:

wherein theta is₁Is a hyper-parameter of the network, 0.001, | \ | | | non-conducting phosphor in the embodiment₂Indicating that the L2 distance is calculated.

Step 4, in order to improve the efficiency of cross-modal retrieval in practical application, the method applies a threshold function to respectively extract hash codes m capable of representing image and text characteristics from the public subspaces of the two groups of generation type confrontation network generators^hashAnd t^hashAnd a likelihood function is designed to evaluate the approximation error between the two hash codes. The specific implementation process is described as follows:

in the two sets of generative countermeasure networks, since the input and output of the generator are respectively the feature data of different modalities, the present example uses the middle layer of the generator as the common subspace of the cross-modality data (as shown in fig. 1), and in the above steps, the feature length of the layer is designed to be the length d of the required hash code_hash. Let the feature vector of the middle layer be m_comAnd t_comThe generated formula is m_hash＝sgn(m_com-0.5) and t_hash＝sgn(t_com-0.5), where sgn is a threshold function, the formula means that for each floating point number in the intermediate layer floating point type feature vector, the corresponding hash code bit is set to +1 if the value is greater than 0.5 and to-1 if the value is less than 0.5. Such thresholding will apply to each bit of the feature vector of each training sample, and each training sample will have a hash code of the same length as the feature vector. Hash code m is used in the embodiments_hash、t_hashSurrogate common subspace feature vector m_com、t_comBy searching, the distance calculation between different floating point type characteristic vectors in the original searching can be replaced by the Hamming distance calculation between Hash codes, and the calculating speed of searching is greatly improved.

In order to quantify the approximation error between the feature vector and the generated hash code, the present embodiment designs a relevant loss function as a constraint. Example uses the likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sample

And j th position of characteristic vector

For example (samples can be both images and text):

wherein

Is a sigmoid function associated with the feature vector:

embodiments design a loss function from the likelihood function to evaluate the approximation error between the feature vector and the generated hash code:

where n is the total number of samples, d_hashIs the vector bit number. The loss function that evaluates the hash code approximation error will play a role in training as one of the constraints of the network.

Step 5, in order to construct a network model with better effect, the present embodiment utilizes multiple constraint conditions to constrain the data features generated during network training, so that more category features are retained, thereby improving the accuracy during retrieval. Aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under the supervision condition, because the sample class label is given, the characteristic distance between the data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between the data of neither the same type nor different modes is maximized. The specific implementation process is described as follows:

another small classification network is introduced under supervision to perform class constraints on the feature vectors obtained by the generator common subspace. For a supervised cross-modal data set, that is, when a data sample for training has a class label, in order to make full use of the data class label, the present embodiment utilizes a small classification network to perform class representation on a common subspace, and designs a class loss function to constrain generation of a common subspace feature vector, so that the common subspace feature vector is different from other layer vectors, carries stronger class information, and can be correctly classified during prediction classification. The class loss function is formulated as:

wherein

Is the feature vector of the ith sample

Similarity constraints are applied to pairs of homogeneous data across modalities. In the cross-modal data, there are many semantic similar paired training data, for example, a certain image data sample and another text data sample in the training data have high semantic similarity and have similar category attributes. To take advantage of this property, in this embodiment, training image sample data is connected with its similar text sample data, and a loss function is designed to constrain the homogeneous data across modalities. The loss function is formulated as follows:

and

are respectively a generator G_t→m，G_m→tFeature vectors for the common subspace of the image and text are generated, and a loss function computes the L2 distance of corresponding cross-modal data that are semantically similar.

Further extend

In the embodiment, the similarity constraint of the same-class data among the cross-modalities and the same-class data in the same modality is considered at the same time, that is, the distance between the feature vectors of the paired cross-modality training data and the same-modality data with similar semantics should be smaller than that of other feature vectors with dissimilar semantics. In the supervised training case, since the data all have class labels, the triplet constraint is used to minimize the distance between the cross-modal data vectors under the same semantic label. The triplet constraint is illustrated schematically in fig. 2, where different shapes of the icons represent different classes of data, and different textures represent different modalities of data, with data in the feature space being closer to the same-modality data or across the same class of data in the modalities, and further from the different classes of data across the modalities. In an embodiment, to generate image data

For example (the feature label of the generated data is the category label of the original input data), first, the text data t with the same category label is selected_α，iSimultaneously randomly selecting different types of text data t_β，iWhere α, β represent two class labels, where x represents the generated data, i represents the data for the ith calculation, and the triplet constraint to generate the image is minimized

t_α，iWhile maximizing the distance between

t_β，i. Similarly, for generating text

Its triplet constraint and m_α，i，m_β，iIt is related. The triplet constraint loss function is therefore designed as follows:

for the unsupervised training situation, the embodiment designs manifold constraints to ensure the similarity of semantically similar data in the homomodal and cross-modal data. Because the data does not contain class labels when unsupervised data training is adopted, the k-neighbor matrix is constructed in the embodiment to ensure that the data with similar semantics are aggregated and the data with different semantics are separated. As shown in fig. 3, in this embodiment, after the kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then the feature vector is subjected to manifold constraint in the common subspace. To be composed of text data t_αThe obtained generated image data

For example, according to t_αCalculating the k NN matrix of (1), and calculating t_αK (k is set to 2 in this embodiment) nearest data are denoted as 1 in the similarity matrix, and data that are not adjacent are denoted as 0 in the similarity matrix. After the text data are generated to obtain the image characteristic vectors, randomly selecting the generated image characteristic vectors corresponding to the text data with 1 in the similarity matrix as the image characteristic vectors

Generating image feature vectors corresponding to the text data of 0 in the similarity matrix as

In the stream constraint, minimize

And

the distance between the semantic similar data and the semantic similar data is ensured to be high in similarity of generated feature vectors, and the maximum is achieved

And

the distance between the semantic data and the semantic data ensures that the similarity of generated feature vectors of different semantic data is low. For generating text data, the same holds

To perform manifold constraints. The manifold constraint loss function is therefore designed as follows:

in summary, we can obtain a generator loss function composed of loss functions with various constraints. In the case of supervised data training, the generator loss function consists of a cyclic loss function

Hash code loss function

Triplet constrained loss function

Cross-modal homogeneous data loss function

And class loss function

The formula is as follows:

wherein theta is₂，θ₃，θ₄，θ₅The adjustable hyper-parameters of the network are set to 5, 5, 0.001, 20, respectively, in this embodiment. On unsupervised dataIn the training case, the generator loss function consists of a cyclic loss function

Hash code loss function

Manifold constraint loss function

Cross-modal homogeneous data loss function

Composition, the formula is as follows:

the value of the hyper-parameter is the same as previously set.

And (3) integrating the 5 steps, designing a discriminator loss function and a generator loss function, and then utilizing a common minimum maximum algorithm to iterate to minimize network loss so as to achieve the purpose of establishing the semantic relationship among the multi-modal data. The infinitesimal maximum algorithm in the embodiment uses a random gradient descent optimization algorithm, and particularly uses a more stable RMSProp optimization algorithm. Since the discriminator and the generator are confronted with each other, the calculation methods of the discriminator and the generator are opposite, and both of them confront the iteration result of the previous round at each round of iteration and reach dynamic balance in the confrontation. The calculation method is as follows:

because the discriminantor is trained quickly in the actual process, the network designed by the method is every timeThe training iterates the generator S times and iterates the discriminator once. In the present embodiment, the network training-related hyper-parameter S is set to 10, the learning rate μ of the network is set to 0.0001, and the batch sample size (batch size) per training is set to 64; meanwhile, the weight learned in the network is pruned, and the weight in the generator is larger than c in each training^genIs given a weight of c^genGreater than c in the discriminator^discIs given a weight of c^discSo as to avoid the learned weight from being too large.

And step 6, using the trained neural network for cross-modal data search, compressing the feature vectors obtained by the data through a generator public subspace into hash codes, and retrieving by utilizing the Hamming distance between different data hash codes. The specific implementation process is described as follows:

after the image and text data in the embodiment are trained and learned through the network, the generator obtains an extraction mode of semantic relation related information among cross-modal data. The embodiment can perform the bidirectional retrieval of cross-modal data at this time, firstly fix the weight parameters in the trained network, and retrieve the image and text data m to be retrieved^test，t^testBy trained generator G_m→t，G_t→mObtaining a feature vector m on a common subspace^com，t^comThen, the feature vector is generated as a hash code m^hash，t^hashAnd (5) standby. When searching text by using image, taking out hash code of the image

Calculating the Hamming distance between the hash code and all text hash codes, and calculating the hash code with the nearest distance

The represented text is the result of image → text cross-modality retrieval; when searching image by text, taking out hash code of the text

Calculating its Hamming distance to all image Hash codes, the Hash with the nearest distanceCode

The represented image is the result of the text → image cross modality search.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A cross-modal retrieval method based on a cycle generation type countermeasure network is characterized by comprising the following steps:

designing two circulation modules, wherein one circulation module realizes the process from the image to the text to the image through two generators, and the other circulation module realizes the process from the text to the image to the text through the two generators; the two circulation modules share two generators with the same network structure, and the output data of the middle layer of the generators are subjected to Hash coding;

and designing a discriminator in each cycle module, wherein the discriminator classifies the generated data and the original data of the same mode and performs dynamic confrontation with the generator, and finally the generator and the discriminator reach dynamic balance under given training conditions.

2. The cross-modal retrieval method based on the recurrent countermeasure network of claim 1, wherein:

aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.

3. The cross-modal retrieval method based on the recurrent countermeasure network of claim 2, wherein:

the loss function of the discriminator is specifically as follows:

wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training process^discDirection iterative learning of D_imgAnd D_txtRespectively representing two discriminators, (m)^ori，t^ori) Respectively representing an original mode m and an original mode t, m^cycGenerating a modal m-feature, t^cycGenerating a modal t characteristic; theta₁Is a hyper-parameter of the network, | × | | non-conducting phosphor₂Indicating that the L2 distance is calculated.

4. The cross-modal retrieval method based on the recurrent countermeasure network of claim 3, wherein:

let the intermediate layer eigenvector of the two generators be m_comAnd t_comThe formula for generating the hash code is as follows:

m_hash＝sgn(m_com-0.5)

t _hash＝sgn(t _com-0.5)

5. The cross-modal retrieval method based on the recurrent countermeasure network of claim 4, wherein: in order to quantify approximate error between the feature vector and the generated hash code, a related loss function is designed as a constraint, a likelihood function of the hash code under the condition of the feature vector is specifically used, a sample can be an image or a text, the j th bit of the hash code of the ith sample is

And j th position of characteristic vector

Then:

wherein

Is a sigmoid function associated with the feature vector:

where n is the total number of samples, d_hashIs the vector bit number.

6. The cross-modal retrieval method based on the recurrent countermeasure network of claim 5, wherein: and (3) carrying out class constraint on the characteristic vector of the generator intermediate layer, thereby designing a class loss function formula as follows:

wherein

Is the feature vector of the ith sample

Predicted classes of samples obtained via a small classification network, c_iIs the actual class label of the sample, and the class loss function actually calculates the L2 distance between the two; the cross-modal homogeneous data pair is subjected to similarity constraint, training image sample data is connected with similar text sample data, and a loss function is designed to constrain the cross-modal homogeneous data; the loss function is formulated as follows:

and

are respectively a generator G_t→m，G_m→tGenerating a feature vector of a common subspace of the image and the text, and calculating an L2 distance of corresponding cross-modal data with similar semantics by using a loss function; in the case of supervised data training, triple constraints are used to minimize the same semantics since the data all have class labelsThe distance between cross-modal data vectors under the label is designed as the ternary loss function:

wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, manifold constraint is designed to ensure the similarity of semantically similar data in homomodal and cross-modal data, after a kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then manifold constraint is carried out on the feature vector in a public subspace; the manifold constraint loss function is designed as follows:

wherein neib and non respectively represent adjacent data and non-adjacent data, and other symbols have the same meanings as before; synthesizing various functions, the generator loss function in the case of supervised data training is designed to:

the generator loss function in the case of unsupervised data training is designed to:

θ₂，θ₃，θ₄，θ₅all are weight value over parameters of the network; the whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:

because the gradient of the discriminator is reduced quickly in the actual process, the discriminator is iterated once when the designed network is trained to iterate the generators for S times, and the hyper-parameter c is used^gen，c^discAnd pruning the network weight to prevent the network weight from being overlarge.