CN109299342B - Cross-modal retrieval method based on cycle generation type countermeasure network - Google Patents
Cross-modal retrieval method based on cycle generation type countermeasure network Download PDFInfo
- Publication number
- CN109299342B CN109299342B CN201811455802.6A CN201811455802A CN109299342B CN 109299342 B CN109299342 B CN 109299342B CN 201811455802 A CN201811455802 A CN 201811455802A CN 109299342 B CN109299342 B CN 109299342B
- Authority
- CN
- China
- Prior art keywords
- data
- network
- modal
- cross
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a cross-modal retrieval method based on a cycle generation type confrontation network. Given different modal data can flow bidirectionally in the network, each modal data generates another modal data through a group of generative confrontation networks, the generated data is used as the input of the next group of generative confrontation networks, thereby realizing the bidirectional cycle generation of the data, and the network continuously learns the semantic relationship among the cross-modal data. In order to improve the retrieval efficiency, the method also approximates the result of the middle layer of the generator to the corresponding binary hash code by utilizing a threshold function and an approximation function, and designs various constraint conditions to ensure the similarity of the same-mode and same-class data and the difference of the data among different modes and classes, thereby further improving the retrieval accuracy and stability.
Description
Technical Field
The invention belongs to the technical field of multimedia information retrieval, and particularly relates to a cross-mode retrieval method based on a cycle generation type countermeasure network.
Technical Field
With the coming of the internet era, people can contact massive information of various modalities including pictures, videos, texts, audios and the like anytime and anywhere, how to acquire content required by the users from the massive information becomes a key point of interest of internet users, and the users often depend on precise retrieval services provided by retrieval engines such as google, Baidu and must. However, most of the conventional internet retrieval services still stay in the single-modal retrieval degree, the retrieval application for cross-modal data is less, the retrieval efficiency, accuracy and stability are all to be improved, most of the conventional internet retrieval services depend on the existing data tags, and the cross-modal retrieval of non-tag data cannot be achieved. Therefore, the research on the novel cross-modal retrieval method has strong practical significance and practical value, and the key point is that similar other modal data are directly retrieved by establishing the semantic relation among the multi-modal heterogeneous data, so that the cross-modal direct retrieval is realized under the condition of not marking all the modal data, and the retrieval performance is finally further improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal retrieval method based on a cycle generation type countermeasure network, which can effectively submit the performance of the prior cross-modal retrieval technology.
In order to achieve the above object, the cross-modal search method based on a cyclic generation type countermeasure network according to the present invention is characterized by comprising the following steps:
designing two circulation modules, wherein the two circulation modules share two generators with the same network structure, and hash coding is carried out on output data of a generator middle layer, and the purpose of the generators is to generate cross-modal data which is as real as possible through training;
one loop module realizes the process of the mode m → the mode t → the mode m through the two generators, and the other loop module also realizes the process of the mode t → the mode m → the mode t through the two generators;
designing respective discriminators for different modes in each loop module, wherein the discriminators try to classify generated data and original data of the mode and dynamically confront the generator, and finally the generator and the discriminators reach dynamic balance under given training conditions.
Further, aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted under the unsupervised condition to ensure the similarity and difference of data among modes and classes; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.
Further, the loss function of the discriminator is specifically:
the cyclic loss function obtained by comparing the finally generated data of the same modality with the original data is as follows:
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscDirection iterative learning of DimgAnd DtxtRespectively representing two discriminators, (m)ori,tori) Original feature vectors representing modes m and t, respectively, (m)cyc,tcyc) Respectively representing the feature vectors generated by the modality m and the modality t through the cyclic network.
Still further, the loss function of the generator is specifically:
wherein theta is1Is a hyper-parameter of the network, | × | | non-conducting phosphor2Indicating that the L2 distance is calculated.
Further, let the feature vector output by the middle layer of two generators be mcomAnd tcomThe formula for generating the hash code is as follows:
mhash=sgn(mcom-0.5)
thash=sgn(tcom-0.5)
wherein sgn is a threshold function, the meaning of the formula is that each bit of floating point number in the intermediate layer floating point type characteristic vector is set to +1 when the value is greater than 0.5, and the corresponding hash code bit is set to-1 when the value is less than 0.5.
Still further, in order to quantify the approximate error between the feature vector and the generated hash code, the method designs a related loss function as a constraint, specifically uses a likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sample as the j bitAnd j th position of characteristic vectorFor example (samples can be both images and text):
a loss function is further designed according to the likelihood function to evaluate the approximate error between the feature vector and the generated hash code:
where n is the total number of samples, dhashIs the vector bit number.
Furthermore, the invention carries out category constraint on the characteristic vector of the generator middle layer, thereby designing a category loss function formula as follows:
whereinIs the feature vector of the ith samplePredicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample and the class loss function actually calculates the L2 distance between the two.
In order to carry out similarity constraint on cross-modal homogeneous data pairs, the method establishes connection between training image sample data and similar text sample data, designs a loss function to carry out constraint on the cross-modal homogeneous data, and the loss function formula is as follows:
andare respectively a generator Gt→m,Gm→tFeature vectors are generated for the common subspace of the image and text, and a loss function computes the L2 distance between corresponding cross-modal homogeneous data that are semantically similar.
Under the condition of supervised data training, because data all have class labels, the distance between cross-modal data vectors under the same semantic label is minimized by using a triple constraint, and a designed ternary loss function is as follows:
wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, the method designs manifold constraint to ensure the similarity of semantically similar data in homomodal and cross-modal data, establishes a similarity matrix for the data to be constrained after calculating a kNN matrix, and then performs manifold constraint on a feature vector in a public subspace, wherein a manifold constraint loss function is designed as follows:
where neib and non represent adjacent and non-adjacent data, respectively, and the other symbols have the same meanings as before.
Further, by integrating the above loss function design, the generator loss function under the supervision training condition is designed as follows:
the generator loss function in the unsupervised training case is designed to:
θ2,θ3,θ4,θ5all are weight override parameters of the network. The whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:
due to the fact thatIn the actual process, the gradient of the discriminator is reduced quickly, the discriminator is iterated once when the network designed by the method is trained to iterate the generators for S times, and the hyper-parameter c is usedgen,cdiscAnd pruning the network weight to prevent the network weight from being overlarge.
The invention has the advantages that:
the invention better establishes the semantic relation among multi-mode data by utilizing the cycle generation type antagonistic network constructed by two groups of generators and discriminators, designs various constraint conditions to improve the accuracy and stability of retrieval, adopts binary hash codes to replace original characteristics to improve the retrieval efficiency, and researches and explores a novel cross-modal retrieval method based on the cycle generation type antagonistic network, particularly aiming at the cross-modal retrieval between images and texts.
Drawings
FIG. 1 is an overall architecture diagram of a neural network according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of triple constraints according to an embodiment of the present invention.
FIG. 3 is a manifold restriction book intent of an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
in recent years, with the heat of artificial intelligence, deep learning technology is gradually emerging and affects many fields of computer science, and more people in the technical field of multimedia information retrieval use deep learning to improve the accuracy and stability of the existing retrieval. The generation type confrontation network (generating adaptive network) adopted in the method is a novel neural network which is widely used in recent years and estimates a generation model through the confrontation process, a generator (generator) for learning data distribution and a discriminator (discriminator) for discriminating the authenticity of data are trained in the network at the same time, and the generator and the discriminator mutually confront in the training process to finally achieve dynamic balance. The generative countermeasure network is widely applied to the fields of image generation, semantic segmentation, data enhancement and the like, and can well learn the data distribution rule of the training sample according to the loss function and generate new data similar to the training sample. The method utilizes two groups of generation type countermeasure networks to form a novel circulation network, and improves the efficiency, accuracy and stability of the network when the network is used for multi-mode retrieval through the hash code and various constraint conditions.
The invention provides a cross-modal retrieval method based on a cycle generation type confrontation network, which mainly designs a novel neural network, and the main overall structure of the neural network is shown in figure 1. The embodiment specifically describes the neural network framework and the data processing flow of the invention by taking mutual retrieval between image and text data as an example, and the following steps are carried out:
firstly, in the embodiment, the original two-dimensional image data actually needs to be subjected to preliminary processing, in the embodiment, 19-layer VGGNet popular in the deep learning field is selected, and 4096-dimensional feature vector output by fc7 layer of VGGNet is used as input original image feature moriI.e. the image characteristic dimension dimg4096. Meanwhile, the input original text data is processed into a preliminary feature vector, the embodiment uses a conventional Bag-of-Words (Bag-of-Words) model to process the text data, the length of the obtained BoW vector is related to the text data and a specific selected processing method, and for reference, the BoW vector dimension in the embodiment is set to 2000 dimensions, namely the text feature dimension dtxt2000, and the vector is used as the input original text feature tori。
as shown in FIG. 1, the upper half of the network can be regarded as a first group of generative confrontation networks, which mainly comprise a generator Gm→tAnd a discriminator DtxtWhen the input is the original image-original text dataTo (m)ori,tori). Data flows in the network, original image moriThrough generator Gm→tObtaining a generated text tgenI.e. tgen=Gm→t(mori) And wishes to generate a text tgenAs much as possible with the original text toriSimilarly. Generator Gm→tIs composed of multiple layers of one-dimensional convolution layers, in which the feature vector dimension is changed to dimg→512→dhash→100→dtxt。dimgA dimension representing the input original image feature, 4096 in this embodiment; dhashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and the dimension can be 64, 128, 256 and the like; dtxtThe dimension of the original text feature input in the network, which is also the feature length of the generated text, is 2000 in this embodiment. At the same time, the discriminator DtxtAnd generator Gm→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguishedgenAnd generating text feature tori. Discriminator DtxtIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to dtxt→ 512 → 16. Generator G when the generator and the arbiter reach dynamic balance under given training conditionsm→tThe transformation mode of generating the text data according to the image data can be well extracted, so that the semantic relation between the original image and the generated text data is obtained.
Step 2, designing a second group of generation type countermeasure network, which comprises a generator Gt→mAnd a discriminator DimgInputting the original image obtained for the previous step-generating text data pairs (m)ori,tgen) Obtaining a cyclic image mcycAnd extracting a transformation mode for generating image data according to the text data so as to obtain the semantic relation between the text and the image data. The specific implementation process is described as follows:
as shown in FIG. 1, the lower half of the network can be regarded as a second group of generative countermeasure networks, mainly comprising a generator Gt→mAnd a discriminator DimgWhen the input is the original image-generated textData pair (m)ori,tgen). Data flows in the network, generating text tgenThrough generator Gt→mObtaining a cyclic image mcycI.e. mcyc=Gt→m(tgen)=Gt→m(Gm→t(mori) And hopefully loop image feature m)cycAnd original image features moriAs similar as possible. Generator Gt→mIs composed of multiple layers of one-dimensional reverse convolution layers, in which the feature vector dimension is changed to dtxt→100→dhash→512→dimg。dtxtThe dimension of the original text features input in the network, which is 2000 in this embodiment; dhashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and can be 64, 128, 256 and the like, and is the same as the hash code length in the first group of generative countermeasure network; dimgThe dimension representing the input original image feature, which is also the length of the last generated loop image feature, is 4096 in this embodiment. At the same time, the discriminator DimgAnd generator Gt→mDynamic countermeasure is performed in an attempt to distinguish the cyclic image features mcycAnd original image features mori. Discriminator DimgIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to dimg→ 512 → 100 → 16. When the generator and the discriminator reach dynamic balance under given training conditions, the transformation mode of generating the image data according to the text data can be well extracted, and the semantic relation between the generated text and the circular image data is obtained.
And 3, utilizing the two groups of generation type countermeasure networks designed in the two steps, similarly reversing the data flow direction, and finally realizing a conversion mode of generating text data by the image data so as to obtain the semantic relation between the image and the text data. I.e. the first two steps are combined, firstly, the original text characteristics t to be input are resisted by using a second group of generating formulasoriGenerating to generate an image feature mgenObtaining semantic relation between text and image data; then the first group of generation type countermeasure network is used to generate imageFeature mgenGenerated as a circular text feature tcycAnd obtaining the semantic relation between the image and the text data. Finally, the purposes of circularly flowing image data and text data in two groups of generation type countermeasure networks, generating countermeasures and continuously optimizing the networks during training are achieved, and the specific implementation process is explained as follows:
the input data is still the original image-original text data pair (m)ori,tori) First of all, in contrast to the sequence in which the two steps are carried out above, the generator G of the antagonistic network is confronted with the second set of generating equationst→mOriginal text feature t to be inputoriGenerating to generate image features Gt→mI.e. mgen=Gt→m(tori) Generator Gt→mThe feature vector dimension change in (2) is the same as before, and is dtxt→100→dhash→512→dimg. At the same time, the discriminator DimgAnd generator Gt→mPerforming dynamic countermeasure to try to distinguish the original image features moriAnd generating image features mgen. Generator G after the confrontation reaches dynamic balancet→mThe semantic relationship between the original text-generated image data can be learned. Then, the generator G of the first group of generation type countermeasure network is usedm→tWill generate image features mgenGenerated as a circular text feature tcycI.e. tcyc=Gm→t(mten)=Gm→t(Gt→m(tori) Generator G)m→tThe feature vector dimension change in (2) is the same as before, and is dimg→512→dhash→100→dtxt. At the same time, the discriminator DtxtAnd generator Gm→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguishedoriAnd a recurring text feature tcyc. Generator G after the confrontation reaches dynamic balancem→tThe semantic relationship between the generated image-loop text data can be learned.
Through steps 1, 2 and 3, a bidirectional circulation flow channel of the image data and the text data in the embodiment in the network is established, wherein one channel is original image characteristic data moriObtaining production text features t through a first set of generative confrontation networksgenThen, t is further determinedgenGeneration of recurring image features m by second set of generative countermeasure networkscyc(ii) a The other channel, original text data toriFirstly, obtaining the generated image feature m through a second group to the antibiotic webgenThen m is addedgenProducing recurring textual features t through a first set of generative confrontation networkscyc. Thus, image and text data can be generated in a bidirectional loop in both networks, while there is a discriminator DimgAnd DtxtAnd participating in a countermeasure generator to improve the effect of network learning on the semantic relationship among cross-modal data. Wherein the discriminator DimgAnd DtxtThe loss function of (a) is designed to:
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscAnd (4) iteratively learning the direction. After the bidirectional cyclic generative countermeasure network is constructed, one of the advantages is that the finally obtained cyclic data can be compared with the original data to obtain a cyclic loss function, and the cyclic loss function is also an important component of the generator loss function:
wherein theta is1Is a hyper-parameter of the network, 0.001, | \ | | | non-conducting phosphor in the embodiment2Indicating that the L2 distance is calculated.
Step 4, in order to improve the efficiency of cross-modal retrieval in practical application, the method applies a threshold function to respectively extract hash codes m capable of representing image and text characteristics from the public subspaces of the two groups of generation type confrontation network generatorshashAnd thashAnd a likelihood function is designed to evaluate the approximation error between the two hash codes. The specific implementation process is described as follows:
in the two sets of generative countermeasure networks, since the input and output of the generator are respectively the feature data of different modalities, the present example uses the middle layer of the generator as the common subspace of the cross-modality data (as shown in fig. 1), and in the above steps, the feature length of the layer is designed to be the length d of the required hash codehash. Let the feature vector of the middle layer be mcomAnd tcomThe generated formula is mhash=sgn(mcom-0.5) and thash=sgn(tcom-0.5), where sgn is a threshold function, the formula means that for each floating point number in the intermediate layer floating point type feature vector, the corresponding hash code bit is set to +1 if the value is greater than 0.5 and to-1 if the value is less than 0.5. Such thresholding will apply to each bit of the feature vector of each training sample, and each training sample will have a hash code of the same length as the feature vector. Hash code m is used in the embodimentshash、thashSurrogate common subspace feature vector mcom、tcomBy searching, the distance calculation between different floating point type characteristic vectors in the original searching can be replaced by the Hamming distance calculation between Hash codes, and the calculating speed of searching is greatly improved.
In order to quantify the approximation error between the feature vector and the generated hash code, the present embodiment designs a relevant loss function as a constraint. Example uses the likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sampleAnd j th position of characteristic vectorFor example (samples can be both images and text):
embodiments design a loss function from the likelihood function to evaluate the approximation error between the feature vector and the generated hash code:
where n is the total number of samples, dhashIs the vector bit number. The loss function that evaluates the hash code approximation error will play a role in training as one of the constraints of the network.
Step 5, in order to construct a network model with better effect, the present embodiment utilizes multiple constraint conditions to constrain the data features generated during network training, so that more category features are retained, thereby improving the accuracy during retrieval. Aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under the supervision condition, because the sample class label is given, the characteristic distance between the data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between the data of neither the same type nor different modes is maximized. The specific implementation process is described as follows:
another small classification network is introduced under supervision to perform class constraints on the feature vectors obtained by the generator common subspace. For a supervised cross-modal data set, that is, when a data sample for training has a class label, in order to make full use of the data class label, the present embodiment utilizes a small classification network to perform class representation on a common subspace, and designs a class loss function to constrain generation of a common subspace feature vector, so that the common subspace feature vector is different from other layer vectors, carries stronger class information, and can be correctly classified during prediction classification. The class loss function is formulated as:
whereinIs the feature vector of the ith samplePredicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample and the class loss function actually calculates the L2 distance between the two.
Similarity constraints are applied to pairs of homogeneous data across modalities. In the cross-modal data, there are many semantic similar paired training data, for example, a certain image data sample and another text data sample in the training data have high semantic similarity and have similar category attributes. To take advantage of this property, in this embodiment, training image sample data is connected with its similar text sample data, and a loss function is designed to constrain the homogeneous data across modalities. The loss function is formulated as follows:
andare respectively a generator Gt→m,Gm→tFeature vectors for the common subspace of the image and text are generated, and a loss function computes the L2 distance of corresponding cross-modal data that are semantically similar.
Further extendIn the embodiment, the similarity constraint of the same-class data among the cross-modalities and the same-class data in the same modality is considered at the same time, that is, the distance between the feature vectors of the paired cross-modality training data and the same-modality data with similar semantics should be smaller than that of other feature vectors with dissimilar semantics. In the supervised training case, since the data all have class labels, the triplet constraint is used to minimize the distance between the cross-modal data vectors under the same semantic label. The triplet constraint is illustrated schematically in fig. 2, where different shapes of the icons represent different classes of data, and different textures represent different modalities of data, with data in the feature space being closer to the same-modality data or across the same class of data in the modalities, and further from the different classes of data across the modalities. In an embodiment, to generate image dataFor example (the feature label of the generated data is the category label of the original input data), first, the text data t with the same category label is selectedα,iSimultaneously randomly selecting different types of text data tβ,iWhere α, β represent two class labels, where x represents the generated data, i represents the data for the ith calculation, and the triplet constraint to generate the image is minimizedtα,iWhile maximizing the distance betweentβ,i. Similarly, for generating textIts triplet constraint and mα,i,mβ,iIt is related. The triplet constraint loss function is therefore designed as follows:
for the unsupervised training situation, the embodiment designs manifold constraints to ensure the similarity of semantically similar data in the homomodal and cross-modal data. Because the data does not contain class labels when unsupervised data training is adopted, the k-neighbor matrix is constructed in the embodiment to ensure that the data with similar semantics are aggregated and the data with different semantics are separated. As shown in fig. 3, in this embodiment, after the kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then the feature vector is subjected to manifold constraint in the common subspace. To be composed of text data tαThe obtained generated image dataFor example, according to tαCalculating the k NN matrix of (1), and calculating tαK (k is set to 2 in this embodiment) nearest data are denoted as 1 in the similarity matrix, and data that are not adjacent are denoted as 0 in the similarity matrix. After the text data are generated to obtain the image characteristic vectors, randomly selecting the generated image characteristic vectors corresponding to the text data with 1 in the similarity matrix as the image characteristic vectorsGenerating image feature vectors corresponding to the text data of 0 in the similarity matrix asIn the stream constraint, minimizeAndthe distance between the semantic similar data and the semantic similar data is ensured to be high in similarity of generated feature vectors, and the maximum is achievedAndthe distance between the semantic data and the semantic data ensures that the similarity of generated feature vectors of different semantic data is low. For generating text data, the same holds To perform manifold constraints. The manifold constraint loss function is therefore designed as follows:
in summary, we can obtain a generator loss function composed of loss functions with various constraints. In the case of supervised data training, the generator loss function consists of a cyclic loss functionHash code loss functionTriplet constrained loss functionCross-modal homogeneous data loss functionAnd class loss functionThe formula is as follows:
wherein theta is2,θ3,θ4,θ5The adjustable hyper-parameters of the network are set to 5, 5, 0.001, 20, respectively, in this embodiment. On unsupervised dataIn the training case, the generator loss function consists of a cyclic loss functionHash code loss functionManifold constraint loss functionCross-modal homogeneous data loss functionComposition, the formula is as follows:
the value of the hyper-parameter is the same as previously set.
And (3) integrating the 5 steps, designing a discriminator loss function and a generator loss function, and then utilizing a common minimum maximum algorithm to iterate to minimize network loss so as to achieve the purpose of establishing the semantic relationship among the multi-modal data. The infinitesimal maximum algorithm in the embodiment uses a random gradient descent optimization algorithm, and particularly uses a more stable RMSProp optimization algorithm. Since the discriminator and the generator are confronted with each other, the calculation methods of the discriminator and the generator are opposite, and both of them confront the iteration result of the previous round at each round of iteration and reach dynamic balance in the confrontation. The calculation method is as follows:
because the discriminantor is trained quickly in the actual process, the network designed by the method is every timeThe training iterates the generator S times and iterates the discriminator once. In the present embodiment, the network training-related hyper-parameter S is set to 10, the learning rate μ of the network is set to 0.0001, and the batch sample size (batch size) per training is set to 64; meanwhile, the weight learned in the network is pruned, and the weight in the generator is larger than c in each traininggenIs given a weight of cgenGreater than c in the discriminatordiscIs given a weight of cdiscSo as to avoid the learned weight from being too large.
And step 6, using the trained neural network for cross-modal data search, compressing the feature vectors obtained by the data through a generator public subspace into hash codes, and retrieving by utilizing the Hamming distance between different data hash codes. The specific implementation process is described as follows:
after the image and text data in the embodiment are trained and learned through the network, the generator obtains an extraction mode of semantic relation related information among cross-modal data. The embodiment can perform the bidirectional retrieval of cross-modal data at this time, firstly fix the weight parameters in the trained network, and retrieve the image and text data m to be retrievedtest,ttestBy trained generator Gm→t,Gt→mObtaining a feature vector m on a common subspacecom,tcomThen, the feature vector is generated as a hash code mhash,thashAnd (5) standby. When searching text by using image, taking out hash code of the imageCalculating the Hamming distance between the hash code and all text hash codes, and calculating the hash code with the nearest distanceThe represented text is the result of image → text cross-modality retrieval; when searching image by text, taking out hash code of the textCalculating its Hamming distance to all image Hash codes, the Hash with the nearest distanceCodeThe represented image is the result of the text → image cross modality search.
The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.
Claims (6)
1. A cross-modal retrieval method based on a cycle generation type countermeasure network is characterized by comprising the following steps:
designing two circulation modules, wherein one circulation module realizes the process from the image to the text to the image through two generators, and the other circulation module realizes the process from the text to the image to the text through the two generators; the two circulation modules share two generators with the same network structure, and the output data of the middle layer of the generators are subjected to Hash coding;
and designing a discriminator in each cycle module, wherein the discriminator classifies the generated data and the original data of the same mode and performs dynamic confrontation with the generator, and finally the generator and the discriminator reach dynamic balance under given training conditions.
2. The cross-modal retrieval method based on the recurrent countermeasure network of claim 1, wherein:
aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.
3. The cross-modal retrieval method based on the recurrent countermeasure network of claim 2, wherein:
the loss function of the discriminator is specifically as follows:
the cyclic loss function obtained by comparing the finally generated data of the same modality with the original data is as follows:
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscDirection iterative learning of DimgAnd DtxtRespectively representing two discriminators, (m)ori,tori) Respectively representing an original mode m and an original mode t, mcycGenerating a modal m-feature, tcycGenerating a modal t characteristic; theta1Is a hyper-parameter of the network, | × | | non-conducting phosphor2Indicating that the L2 distance is calculated.
4. The cross-modal retrieval method based on the recurrent countermeasure network of claim 3, wherein:
let the intermediate layer eigenvector of the two generators be mcomAnd tcomThe formula for generating the hash code is as follows:
mhash=sgn(mcom-0.5)
t hash=sgn(t com-0.5)
wherein sgn is a threshold function, the meaning of the formula is that each bit of floating point number in the intermediate layer floating point type characteristic vector is set to +1 when the value is greater than 0.5, and the corresponding hash code bit is set to-1 when the value is less than 0.5.
5. The cross-modal retrieval method based on the recurrent countermeasure network of claim 4, wherein: in order to quantify approximate error between the feature vector and the generated hash code, a related loss function is designed as a constraint, a likelihood function of the hash code under the condition of the feature vector is specifically used, a sample can be an image or a text, the j th bit of the hash code of the ith sample isAnd j th position of characteristic vectorThen:
a loss function is further designed according to the likelihood function to evaluate the approximate error between the feature vector and the generated hash code:
where n is the total number of samples, dhashIs the vector bit number.
6. The cross-modal retrieval method based on the recurrent countermeasure network of claim 5, wherein: and (3) carrying out class constraint on the characteristic vector of the generator intermediate layer, thereby designing a class loss function formula as follows:
whereinIs the feature vector of the ith samplePredicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample, and the class loss function actually calculates the L2 distance between the two; the cross-modal homogeneous data pair is subjected to similarity constraint, training image sample data is connected with similar text sample data, and a loss function is designed to constrain the cross-modal homogeneous data; the loss function is formulated as follows:
andare respectively a generator Gt→m,Gm→tGenerating a feature vector of a common subspace of the image and the text, and calculating an L2 distance of corresponding cross-modal data with similar semantics by using a loss function; in the case of supervised data training, triple constraints are used to minimize the same semantics since the data all have class labelsThe distance between cross-modal data vectors under the label is designed as the ternary loss function:
wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, manifold constraint is designed to ensure the similarity of semantically similar data in homomodal and cross-modal data, after a kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then manifold constraint is carried out on the feature vector in a public subspace; the manifold constraint loss function is designed as follows:
wherein neib and non respectively represent adjacent data and non-adjacent data, and other symbols have the same meanings as before; synthesizing various functions, the generator loss function in the case of supervised data training is designed to:
the generator loss function in the case of unsupervised data training is designed to:
θ2,θ3,θ4,θ5all are weight value over parameters of the network; the whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:
because the gradient of the discriminator is reduced quickly in the actual process, the discriminator is iterated once when the designed network is trained to iterate the generators for S times, and the hyper-parameter c is usedgen,cdiscAnd pruning the network weight to prevent the network weight from being overlarge.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811455802.6A CN109299342B (en) | 2018-11-30 | 2018-11-30 | Cross-modal retrieval method based on cycle generation type countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811455802.6A CN109299342B (en) | 2018-11-30 | 2018-11-30 | Cross-modal retrieval method based on cycle generation type countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109299342A CN109299342A (en) | 2019-02-01 |
CN109299342B true CN109299342B (en) | 2021-12-17 |
Family
ID=65142338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811455802.6A Active CN109299342B (en) | 2018-11-30 | 2018-11-30 | Cross-modal retrieval method based on cycle generation type countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299342B (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019652B (en) * | 2019-03-14 | 2022-06-03 | 九江学院 | Cross-modal Hash retrieval method based on deep learning |
US11475308B2 (en) * | 2019-03-15 | 2022-10-18 | Samsung Electronics Co., Ltd. | Jointly pruning and quantizing deep neural networks |
CN110059157A (en) * | 2019-03-18 | 2019-07-26 | 华南师范大学 | A kind of picture and text cross-module state search method, system, device and storage medium |
CN110032734B (en) * | 2019-03-18 | 2023-02-28 | 百度在线网络技术(北京)有限公司 | Training method and device for similar meaning word expansion and generation of confrontation network model |
CN110222140B (en) * | 2019-04-22 | 2021-07-13 | 中国科学院信息工程研究所 | Cross-modal retrieval method based on counterstudy and asymmetric hash |
CN111127385B (en) * | 2019-06-06 | 2023-01-13 | 昆明理工大学 | Medical information cross-modal Hash coding learning method based on generative countermeasure network |
CN110309861B (en) * | 2019-06-10 | 2021-05-25 | 浙江大学 | Multi-modal human activity recognition method based on generation of confrontation network |
CN110334708A (en) * | 2019-07-03 | 2019-10-15 | 中国科学院自动化研究所 | Difference automatic calibrating method, system, device in cross-module state target detection |
CN110443309A (en) * | 2019-08-07 | 2019-11-12 | 浙江大学 | A kind of electromyography signal gesture identification method of combination cross-module state association relation model |
CN112487217A (en) * | 2019-09-12 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Cross-modal retrieval method, device, equipment and computer-readable storage medium |
CN110909181A (en) * | 2019-09-30 | 2020-03-24 | 中国海洋大学 | Cross-modal retrieval method and system for multi-type ocean data |
CN110930469B (en) * | 2019-10-25 | 2021-11-16 | 北京大学 | Text image generation method and system based on transition space mapping |
CN110990595B (en) * | 2019-12-04 | 2023-05-05 | 成都考拉悠然科技有限公司 | Cross-domain alignment embedded space zero sample cross-modal retrieval method |
CN111104982B (en) * | 2019-12-20 | 2021-09-24 | 电子科技大学 | Label-independent cross-task confrontation sample generation method |
CN111353076B (en) * | 2020-02-21 | 2023-10-10 | 华为云计算技术有限公司 | Method for training cross-modal retrieval model, cross-modal retrieval method and related device |
WO2021189383A1 (en) * | 2020-03-26 | 2021-09-30 | 深圳先进技术研究院 | Training and generation methods for generating high-energy ct image model, device, and storage medium |
CN111523663B (en) * | 2020-04-22 | 2023-06-23 | 北京百度网讯科技有限公司 | Target neural network model training method and device and electronic equipment |
CN111581405B (en) * | 2020-04-26 | 2021-10-26 | 电子科技大学 | Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning |
CN111783980B (en) * | 2020-06-28 | 2023-04-07 | 大连理工大学 | Ranking learning method based on dual cooperation generation type countermeasure network |
CN111881884B (en) * | 2020-08-11 | 2021-05-28 | 中国科学院自动化研究所 | Cross-modal transformation assistance-based face anti-counterfeiting detection method, system and device |
CN112199462A (en) * | 2020-09-30 | 2021-01-08 | 三维通信股份有限公司 | Cross-modal data processing method and device, storage medium and electronic device |
CN112364192B (en) * | 2020-10-13 | 2024-10-01 | 中山大学 | Zero sample hash retrieval method based on ensemble learning |
WO2022104540A1 (en) * | 2020-11-17 | 2022-05-27 | 深圳大学 | Cross-modal hash retrieval method, terminal device, and storage medium |
CN113706646A (en) * | 2021-06-30 | 2021-11-26 | 酷栈(宁波)创意科技有限公司 | Data processing method for generating landscape painting |
CN113204522B (en) * | 2021-07-05 | 2021-09-24 | 中国海洋大学 | Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network |
CN113779283B (en) * | 2021-11-11 | 2022-04-01 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method with deep supervision and feature fusion |
CN116524420B (en) * | 2023-07-03 | 2023-09-12 | 武汉大学 | Key target detection method and system in traffic scene |
CN117133024B (en) * | 2023-10-12 | 2024-07-26 | 湖南工商大学 | Palm print image recognition method integrating multi-scale features and dynamic learning rate |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473307A (en) * | 2013-09-10 | 2013-12-25 | 浙江大学 | Cross-media sparse Hash indexing method |
CN106547826A (en) * | 2016-09-30 | 2017-03-29 | 西安电子科技大学 | A kind of cross-module state search method, device and computer-readable medium |
CN107871014A (en) * | 2017-11-23 | 2018-04-03 | 清华大学 | A kind of big data cross-module state search method and system based on depth integration Hash |
CN108256627A (en) * | 2017-12-29 | 2018-07-06 | 中国科学院自动化研究所 | The mutual generating apparatus of audio-visual information and its training system that generation network is fought based on cycle |
CN108510559A (en) * | 2017-07-19 | 2018-09-07 | 哈尔滨工业大学深圳研究生院 | It is a kind of based on have supervision various visual angles discretization multimedia binary-coding method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9030430B2 (en) * | 2012-12-14 | 2015-05-12 | Barnesandnoble.Com Llc | Multi-touch navigation mode |
-
2018
- 2018-11-30 CN CN201811455802.6A patent/CN109299342B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473307A (en) * | 2013-09-10 | 2013-12-25 | 浙江大学 | Cross-media sparse Hash indexing method |
CN106547826A (en) * | 2016-09-30 | 2017-03-29 | 西安电子科技大学 | A kind of cross-module state search method, device and computer-readable medium |
CN108510559A (en) * | 2017-07-19 | 2018-09-07 | 哈尔滨工业大学深圳研究生院 | It is a kind of based on have supervision various visual angles discretization multimedia binary-coding method |
CN107871014A (en) * | 2017-11-23 | 2018-04-03 | 清华大学 | A kind of big data cross-module state search method and system based on depth integration Hash |
CN108256627A (en) * | 2017-12-29 | 2018-07-06 | 中国科学院自动化研究所 | The mutual generating apparatus of audio-visual information and its training system that generation network is fought based on cycle |
Non-Patent Citations (1)
Title |
---|
跨模态检索研究综述;欧卫华等;《贵州师范大学学报(自然科学版)》;20180331;第36卷(第2期);第114-120页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109299342A (en) | 2019-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299342B (en) | Cross-modal retrieval method based on cycle generation type countermeasure network | |
Deng et al. | Unsupervised semantic-preserving adversarial hashing for image search | |
Wu et al. | Cycle-consistent deep generative hashing for cross-modal retrieval | |
Wu et al. | Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. | |
CN110222140B (en) | Cross-modal retrieval method based on counterstudy and asymmetric hash | |
Makhzani et al. | Adversarial autoencoders | |
CN109241317B (en) | Pedestrian Hash retrieval method based on measurement loss in deep learning network | |
Lai et al. | Instance-aware hashing for multi-label image retrieval | |
CN111581405A (en) | Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning | |
CN109271522A (en) | Comment sensibility classification method and system based on depth mixed model transfer learning | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN110222218B (en) | Image retrieval method based on multi-scale NetVLAD and depth hash | |
CN112434159B (en) | Method for classifying paper multi-labels by using deep neural network | |
CN111080551B (en) | Multi-label image complement method based on depth convolution feature and semantic neighbor | |
CN113204522B (en) | Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network | |
CN110598022B (en) | Image retrieval system and method based on robust deep hash network | |
CN109960732B (en) | Deep discrete hash cross-modal retrieval method and system based on robust supervision | |
CN114896434B (en) | Hash code generation method and device based on center similarity learning | |
CN111461175A (en) | Label recommendation model construction method and device of self-attention and cooperative attention mechanism | |
Song et al. | A weighted topic model learned from local semantic space for automatic image annotation | |
CN113779283B (en) | Fine-grained cross-media retrieval method with deep supervision and feature fusion | |
Yang et al. | Graph regularized encoder-decoder networks for image representation learning | |
Zheng et al. | Robust representation learning with reliable pseudo-labels generation via self-adaptive optimal transport for short text clustering | |
CN114168773A (en) | Semi-supervised sketch image retrieval method based on pseudo label and reordering | |
Xiao et al. | ANE: Network embedding via adversarial autoencoders |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |