CN109299342B - Cross-modal retrieval method based on cycle generation type countermeasure network - Google Patents

Cross-modal retrieval method based on cycle generation type countermeasure network Download PDF

Info

Publication number
CN109299342B
CN109299342B CN201811455802.6A CN201811455802A CN109299342B CN 109299342 B CN109299342 B CN 109299342B CN 201811455802 A CN201811455802 A CN 201811455802A CN 109299342 B CN109299342 B CN 109299342B
Authority
CN
China
Prior art keywords
data
network
modal
cross
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811455802.6A
Other languages
Chinese (zh)
Other versions
CN109299342A (en
Inventor
倪立昊
王骞
邹勤
李明慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201811455802.6A priority Critical patent/CN109299342B/en
Publication of CN109299342A publication Critical patent/CN109299342A/en
Application granted granted Critical
Publication of CN109299342B publication Critical patent/CN109299342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-modal retrieval method based on a cycle generation type confrontation network. Given different modal data can flow bidirectionally in the network, each modal data generates another modal data through a group of generative confrontation networks, the generated data is used as the input of the next group of generative confrontation networks, thereby realizing the bidirectional cycle generation of the data, and the network continuously learns the semantic relationship among the cross-modal data. In order to improve the retrieval efficiency, the method also approximates the result of the middle layer of the generator to the corresponding binary hash code by utilizing a threshold function and an approximation function, and designs various constraint conditions to ensure the similarity of the same-mode and same-class data and the difference of the data among different modes and classes, thereby further improving the retrieval accuracy and stability.

Description

Cross-modal retrieval method based on cycle generation type countermeasure network
Technical Field
The invention belongs to the technical field of multimedia information retrieval, and particularly relates to a cross-mode retrieval method based on a cycle generation type countermeasure network.
Technical Field
With the coming of the internet era, people can contact massive information of various modalities including pictures, videos, texts, audios and the like anytime and anywhere, how to acquire content required by the users from the massive information becomes a key point of interest of internet users, and the users often depend on precise retrieval services provided by retrieval engines such as google, Baidu and must. However, most of the conventional internet retrieval services still stay in the single-modal retrieval degree, the retrieval application for cross-modal data is less, the retrieval efficiency, accuracy and stability are all to be improved, most of the conventional internet retrieval services depend on the existing data tags, and the cross-modal retrieval of non-tag data cannot be achieved. Therefore, the research on the novel cross-modal retrieval method has strong practical significance and practical value, and the key point is that similar other modal data are directly retrieved by establishing the semantic relation among the multi-modal heterogeneous data, so that the cross-modal direct retrieval is realized under the condition of not marking all the modal data, and the retrieval performance is finally further improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal retrieval method based on a cycle generation type countermeasure network, which can effectively submit the performance of the prior cross-modal retrieval technology.
In order to achieve the above object, the cross-modal search method based on a cyclic generation type countermeasure network according to the present invention is characterized by comprising the following steps:
designing two circulation modules, wherein the two circulation modules share two generators with the same network structure, and hash coding is carried out on output data of a generator middle layer, and the purpose of the generators is to generate cross-modal data which is as real as possible through training;
one loop module realizes the process of the mode m → the mode t → the mode m through the two generators, and the other loop module also realizes the process of the mode t → the mode m → the mode t through the two generators;
designing respective discriminators for different modes in each loop module, wherein the discriminators try to classify generated data and original data of the mode and dynamically confront the generator, and finally the generator and the discriminators reach dynamic balance under given training conditions.
Further, aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted under the unsupervised condition to ensure the similarity and difference of data among modes and classes; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.
Further, the loss function of the discriminator is specifically:
Figure BDA0001887718790000021
the cyclic loss function obtained by comparing the finally generated data of the same modality with the original data is as follows:
Figure BDA0001887718790000022
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscDirection iterative learning of DimgAnd DtxtRespectively representing two discriminators, (m)ori,tori) Original feature vectors representing modes m and t, respectively, (m)cyc,tcyc) Respectively representing the feature vectors generated by the modality m and the modality t through the cyclic network.
Still further, the loss function of the generator is specifically:
Figure BDA0001887718790000023
wherein theta is1Is a hyper-parameter of the network, | × | | non-conducting phosphor2Indicating that the L2 distance is calculated.
Further, let the feature vector output by the middle layer of two generators be mcomAnd tcomThe formula for generating the hash code is as follows:
mhash=sgn(mcom-0.5)
thash=sgn(tcom-0.5)
wherein sgn is a threshold function, the meaning of the formula is that each bit of floating point number in the intermediate layer floating point type characteristic vector is set to +1 when the value is greater than 0.5, and the corresponding hash code bit is set to-1 when the value is less than 0.5.
Still further, in order to quantify the approximate error between the feature vector and the generated hash code, the method designs a related loss function as a constraint, specifically uses a likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sample as the j bit
Figure BDA0001887718790000031
And j th position of characteristic vector
Figure BDA0001887718790000032
For example (samples can be both images and text):
Figure BDA0001887718790000033
wherein
Figure BDA0001887718790000034
Is a sigmoid function associated with the feature vector:
Figure BDA0001887718790000035
a loss function is further designed according to the likelihood function to evaluate the approximate error between the feature vector and the generated hash code:
Figure BDA0001887718790000036
where n is the total number of samples, dhashIs the vector bit number.
Furthermore, the invention carries out category constraint on the characteristic vector of the generator middle layer, thereby designing a category loss function formula as follows:
Figure BDA0001887718790000041
wherein
Figure BDA0001887718790000042
Is the feature vector of the ith sample
Figure BDA0001887718790000043
Predicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample and the class loss function actually calculates the L2 distance between the two.
In order to carry out similarity constraint on cross-modal homogeneous data pairs, the method establishes connection between training image sample data and similar text sample data, designs a loss function to carry out constraint on the cross-modal homogeneous data, and the loss function formula is as follows:
Figure BDA0001887718790000044
Figure BDA0001887718790000045
and
Figure BDA0001887718790000046
are respectively a generator Gt→m,Gm→tFeature vectors are generated for the common subspace of the image and text, and a loss function computes the L2 distance between corresponding cross-modal homogeneous data that are semantically similar.
Under the condition of supervised data training, because data all have class labels, the distance between cross-modal data vectors under the same semantic label is minimized by using a triple constraint, and a designed ternary loss function is as follows:
Figure BDA0001887718790000047
wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, the method designs manifold constraint to ensure the similarity of semantically similar data in homomodal and cross-modal data, establishes a similarity matrix for the data to be constrained after calculating a kNN matrix, and then performs manifold constraint on a feature vector in a public subspace, wherein a manifold constraint loss function is designed as follows:
Figure BDA0001887718790000051
where neib and non represent adjacent and non-adjacent data, respectively, and the other symbols have the same meanings as before.
Further, by integrating the above loss function design, the generator loss function under the supervision training condition is designed as follows:
Figure BDA0001887718790000052
the generator loss function in the unsupervised training case is designed to:
Figure BDA0001887718790000053
θ2,θ3,θ4,θ5all are weight override parameters of the network. The whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:
Figure BDA0001887718790000054
Figure BDA0001887718790000055
due to the fact thatIn the actual process, the gradient of the discriminator is reduced quickly, the discriminator is iterated once when the network designed by the method is trained to iterate the generators for S times, and the hyper-parameter c is usedgen,cdiscAnd pruning the network weight to prevent the network weight from being overlarge.
The invention has the advantages that:
the invention better establishes the semantic relation among multi-mode data by utilizing the cycle generation type antagonistic network constructed by two groups of generators and discriminators, designs various constraint conditions to improve the accuracy and stability of retrieval, adopts binary hash codes to replace original characteristics to improve the retrieval efficiency, and researches and explores a novel cross-modal retrieval method based on the cycle generation type antagonistic network, particularly aiming at the cross-modal retrieval between images and texts.
Drawings
FIG. 1 is an overall architecture diagram of a neural network according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of triple constraints according to an embodiment of the present invention.
FIG. 3 is a manifold restriction book intent of an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
in recent years, with the heat of artificial intelligence, deep learning technology is gradually emerging and affects many fields of computer science, and more people in the technical field of multimedia information retrieval use deep learning to improve the accuracy and stability of the existing retrieval. The generation type confrontation network (generating adaptive network) adopted in the method is a novel neural network which is widely used in recent years and estimates a generation model through the confrontation process, a generator (generator) for learning data distribution and a discriminator (discriminator) for discriminating the authenticity of data are trained in the network at the same time, and the generator and the discriminator mutually confront in the training process to finally achieve dynamic balance. The generative countermeasure network is widely applied to the fields of image generation, semantic segmentation, data enhancement and the like, and can well learn the data distribution rule of the training sample according to the loss function and generate new data similar to the training sample. The method utilizes two groups of generation type countermeasure networks to form a novel circulation network, and improves the efficiency, accuracy and stability of the network when the network is used for multi-mode retrieval through the hash code and various constraint conditions.
The invention provides a cross-modal retrieval method based on a cycle generation type confrontation network, which mainly designs a novel neural network, and the main overall structure of the neural network is shown in figure 1. The embodiment specifically describes the neural network framework and the data processing flow of the invention by taking mutual retrieval between image and text data as an example, and the following steps are carried out:
firstly, in the embodiment, the original two-dimensional image data actually needs to be subjected to preliminary processing, in the embodiment, 19-layer VGGNet popular in the deep learning field is selected, and 4096-dimensional feature vector output by fc7 layer of VGGNet is used as input original image feature moriI.e. the image characteristic dimension dimg4096. Meanwhile, the input original text data is processed into a preliminary feature vector, the embodiment uses a conventional Bag-of-Words (Bag-of-Words) model to process the text data, the length of the obtained BoW vector is related to the text data and a specific selected processing method, and for reference, the BoW vector dimension in the embodiment is set to 2000 dimensions, namely the text feature dimension dtxt2000, and the vector is used as the input original text feature tori
Step 1, designing a first group of generation type countermeasure networks including a generator Gm→tAnd a discriminator DtxtFrom the input original image-original text data pair (m)ori,tori) Obtaining the generated text data tgenThereby extracting a conversion mode for generating text data according to the image data, and obtaining the semantic relation between the image and the text data. The specific implementation process is described as follows:
as shown in FIG. 1, the upper half of the network can be regarded as a first group of generative confrontation networks, which mainly comprise a generator Gm→tAnd a discriminator DtxtWhen the input is the original image-original text dataTo (m)ori,tori). Data flows in the network, original image moriThrough generator Gm→tObtaining a generated text tgenI.e. tgen=Gm→t(mori) And wishes to generate a text tgenAs much as possible with the original text toriSimilarly. Generator Gm→tIs composed of multiple layers of one-dimensional convolution layers, in which the feature vector dimension is changed to dimg→512→dhash→100→dtxt。dimgA dimension representing the input original image feature, 4096 in this embodiment; dhashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and the dimension can be 64, 128, 256 and the like; dtxtThe dimension of the original text feature input in the network, which is also the feature length of the generated text, is 2000 in this embodiment. At the same time, the discriminator DtxtAnd generator Gm→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguishedgenAnd generating text feature tori. Discriminator DtxtIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to dtxt→ 512 → 16. Generator G when the generator and the arbiter reach dynamic balance under given training conditionsm→tThe transformation mode of generating the text data according to the image data can be well extracted, so that the semantic relation between the original image and the generated text data is obtained.
Step 2, designing a second group of generation type countermeasure network, which comprises a generator Gt→mAnd a discriminator DimgInputting the original image obtained for the previous step-generating text data pairs (m)ori,tgen) Obtaining a cyclic image mcycAnd extracting a transformation mode for generating image data according to the text data so as to obtain the semantic relation between the text and the image data. The specific implementation process is described as follows:
as shown in FIG. 1, the lower half of the network can be regarded as a second group of generative countermeasure networks, mainly comprising a generator Gt→mAnd a discriminator DimgWhen the input is the original image-generated textData pair (m)ori,tgen). Data flows in the network, generating text tgenThrough generator Gt→mObtaining a cyclic image mcycI.e. mcyc=Gt→m(tgen)=Gt→m(Gm→t(mori) And hopefully loop image feature m)cycAnd original image features moriAs similar as possible. Generator Gt→mIs composed of multiple layers of one-dimensional reverse convolution layers, in which the feature vector dimension is changed to dtxt→100→dhash→512→dimg。dtxtThe dimension of the original text features input in the network, which is 2000 in this embodiment; dhashThe dimension of the intermediate layer features to be used for hash code generation is determined by the required hash code length, and can be 64, 128, 256 and the like, and is the same as the hash code length in the first group of generative countermeasure network; dimgThe dimension representing the input original image feature, which is also the length of the last generated loop image feature, is 4096 in this embodiment. At the same time, the discriminator DimgAnd generator Gt→mDynamic countermeasure is performed in an attempt to distinguish the cyclic image features mcycAnd original image features mori. Discriminator DimgIs a feedforward neural network composed of fully-connected layers, in which the characteristic dimension is changed to dimg→ 512 → 100 → 16. When the generator and the discriminator reach dynamic balance under given training conditions, the transformation mode of generating the image data according to the text data can be well extracted, and the semantic relation between the generated text and the circular image data is obtained.
And 3, utilizing the two groups of generation type countermeasure networks designed in the two steps, similarly reversing the data flow direction, and finally realizing a conversion mode of generating text data by the image data so as to obtain the semantic relation between the image and the text data. I.e. the first two steps are combined, firstly, the original text characteristics t to be input are resisted by using a second group of generating formulasoriGenerating to generate an image feature mgenObtaining semantic relation between text and image data; then the first group of generation type countermeasure network is used to generate imageFeature mgenGenerated as a circular text feature tcycAnd obtaining the semantic relation between the image and the text data. Finally, the purposes of circularly flowing image data and text data in two groups of generation type countermeasure networks, generating countermeasures and continuously optimizing the networks during training are achieved, and the specific implementation process is explained as follows:
the input data is still the original image-original text data pair (m)ori,tori) First of all, in contrast to the sequence in which the two steps are carried out above, the generator G of the antagonistic network is confronted with the second set of generating equationst→mOriginal text feature t to be inputoriGenerating to generate image features Gt→mI.e. mgen=Gt→m(tori) Generator Gt→mThe feature vector dimension change in (2) is the same as before, and is dtxt→100→dhash→512→dimg. At the same time, the discriminator DimgAnd generator Gt→mPerforming dynamic countermeasure to try to distinguish the original image features moriAnd generating image features mgen. Generator G after the confrontation reaches dynamic balancet→mThe semantic relationship between the original text-generated image data can be learned. Then, the generator G of the first group of generation type countermeasure network is usedm→tWill generate image features mgenGenerated as a circular text feature tcycI.e. tcyc=Gm→t(mten)=Gm→t(Gt→m(tori) Generator G)m→tThe feature vector dimension change in (2) is the same as before, and is dimg→512→dhash→100→dtxt. At the same time, the discriminator DtxtAnd generator Gm→tDynamic confrontation is carried out, and original text characteristics t are tried to be distinguishedoriAnd a recurring text feature tcyc. Generator G after the confrontation reaches dynamic balancem→tThe semantic relationship between the generated image-loop text data can be learned.
Through steps 1, 2 and 3, a bidirectional circulation flow channel of the image data and the text data in the embodiment in the network is established, wherein one channel is original image characteristic data moriObtaining production text features t through a first set of generative confrontation networksgenThen, t is further determinedgenGeneration of recurring image features m by second set of generative countermeasure networkscyc(ii) a The other channel, original text data toriFirstly, obtaining the generated image feature m through a second group to the antibiotic webgenThen m is addedgenProducing recurring textual features t through a first set of generative confrontation networkscyc. Thus, image and text data can be generated in a bidirectional loop in both networks, while there is a discriminator DimgAnd DtxtAnd participating in a countermeasure generator to improve the effect of network learning on the semantic relationship among cross-modal data. Wherein the discriminator DimgAnd DtxtThe loss function of (a) is designed to:
Figure BDA0001887718790000101
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscAnd (4) iteratively learning the direction. After the bidirectional cyclic generative countermeasure network is constructed, one of the advantages is that the finally obtained cyclic data can be compared with the original data to obtain a cyclic loss function, and the cyclic loss function is also an important component of the generator loss function:
Figure BDA0001887718790000102
wherein theta is1Is a hyper-parameter of the network, 0.001, | \ | | | non-conducting phosphor in the embodiment2Indicating that the L2 distance is calculated.
Step 4, in order to improve the efficiency of cross-modal retrieval in practical application, the method applies a threshold function to respectively extract hash codes m capable of representing image and text characteristics from the public subspaces of the two groups of generation type confrontation network generatorshashAnd thashAnd a likelihood function is designed to evaluate the approximation error between the two hash codes. The specific implementation process is described as follows:
in the two sets of generative countermeasure networks, since the input and output of the generator are respectively the feature data of different modalities, the present example uses the middle layer of the generator as the common subspace of the cross-modality data (as shown in fig. 1), and in the above steps, the feature length of the layer is designed to be the length d of the required hash codehash. Let the feature vector of the middle layer be mcomAnd tcomThe generated formula is mhash=sgn(mcom-0.5) and thash=sgn(tcom-0.5), where sgn is a threshold function, the formula means that for each floating point number in the intermediate layer floating point type feature vector, the corresponding hash code bit is set to +1 if the value is greater than 0.5 and to-1 if the value is less than 0.5. Such thresholding will apply to each bit of the feature vector of each training sample, and each training sample will have a hash code of the same length as the feature vector. Hash code m is used in the embodimentshash、thashSurrogate common subspace feature vector mcom、tcomBy searching, the distance calculation between different floating point type characteristic vectors in the original searching can be replaced by the Hamming distance calculation between Hash codes, and the calculating speed of searching is greatly improved.
In order to quantify the approximation error between the feature vector and the generated hash code, the present embodiment designs a relevant loss function as a constraint. Example uses the likelihood function of the hash code under the condition of the feature vector, and takes the jth bit of the hash code of the ith sample
Figure BDA0001887718790000111
And j th position of characteristic vector
Figure BDA0001887718790000112
For example (samples can be both images and text):
Figure BDA0001887718790000113
wherein
Figure BDA0001887718790000114
Is a sigmoid function associated with the feature vector:
Figure BDA0001887718790000115
embodiments design a loss function from the likelihood function to evaluate the approximation error between the feature vector and the generated hash code:
Figure BDA0001887718790000116
where n is the total number of samples, dhashIs the vector bit number. The loss function that evaluates the hash code approximation error will play a role in training as one of the constraints of the network.
Step 5, in order to construct a network model with better effect, the present embodiment utilizes multiple constraint conditions to constrain the data features generated during network training, so that more category features are retained, thereby improving the accuracy during retrieval. Aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under the supervision condition, because the sample class label is given, the characteristic distance between the data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between the data of neither the same type nor different modes is maximized. The specific implementation process is described as follows:
another small classification network is introduced under supervision to perform class constraints on the feature vectors obtained by the generator common subspace. For a supervised cross-modal data set, that is, when a data sample for training has a class label, in order to make full use of the data class label, the present embodiment utilizes a small classification network to perform class representation on a common subspace, and designs a class loss function to constrain generation of a common subspace feature vector, so that the common subspace feature vector is different from other layer vectors, carries stronger class information, and can be correctly classified during prediction classification. The class loss function is formulated as:
Figure BDA0001887718790000121
wherein
Figure BDA0001887718790000122
Is the feature vector of the ith sample
Figure BDA0001887718790000123
Predicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample and the class loss function actually calculates the L2 distance between the two.
Similarity constraints are applied to pairs of homogeneous data across modalities. In the cross-modal data, there are many semantic similar paired training data, for example, a certain image data sample and another text data sample in the training data have high semantic similarity and have similar category attributes. To take advantage of this property, in this embodiment, training image sample data is connected with its similar text sample data, and a loss function is designed to constrain the homogeneous data across modalities. The loss function is formulated as follows:
Figure BDA0001887718790000124
Figure BDA0001887718790000125
and
Figure BDA0001887718790000126
are respectively a generator Gt→m,Gm→tFeature vectors for the common subspace of the image and text are generated, and a loss function computes the L2 distance of corresponding cross-modal data that are semantically similar.
Further extend
Figure BDA0001887718790000127
In the embodiment, the similarity constraint of the same-class data among the cross-modalities and the same-class data in the same modality is considered at the same time, that is, the distance between the feature vectors of the paired cross-modality training data and the same-modality data with similar semantics should be smaller than that of other feature vectors with dissimilar semantics. In the supervised training case, since the data all have class labels, the triplet constraint is used to minimize the distance between the cross-modal data vectors under the same semantic label. The triplet constraint is illustrated schematically in fig. 2, where different shapes of the icons represent different classes of data, and different textures represent different modalities of data, with data in the feature space being closer to the same-modality data or across the same class of data in the modalities, and further from the different classes of data across the modalities. In an embodiment, to generate image data
Figure BDA0001887718790000131
For example (the feature label of the generated data is the category label of the original input data), first, the text data t with the same category label is selectedα,iSimultaneously randomly selecting different types of text data tβ,iWhere α, β represent two class labels, where x represents the generated data, i represents the data for the ith calculation, and the triplet constraint to generate the image is minimized
Figure BDA0001887718790000132
tα,iWhile maximizing the distance between
Figure BDA0001887718790000133
tβ,i. Similarly, for generating text
Figure BDA0001887718790000134
Its triplet constraint and mα,i,mβ,iIt is related. The triplet constraint loss function is therefore designed as follows:
Figure BDA0001887718790000135
for the unsupervised training situation, the embodiment designs manifold constraints to ensure the similarity of semantically similar data in the homomodal and cross-modal data. Because the data does not contain class labels when unsupervised data training is adopted, the k-neighbor matrix is constructed in the embodiment to ensure that the data with similar semantics are aggregated and the data with different semantics are separated. As shown in fig. 3, in this embodiment, after the kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then the feature vector is subjected to manifold constraint in the common subspace. To be composed of text data tαThe obtained generated image data
Figure BDA0001887718790000136
For example, according to tαCalculating the k NN matrix of (1), and calculating tαK (k is set to 2 in this embodiment) nearest data are denoted as 1 in the similarity matrix, and data that are not adjacent are denoted as 0 in the similarity matrix. After the text data are generated to obtain the image characteristic vectors, randomly selecting the generated image characteristic vectors corresponding to the text data with 1 in the similarity matrix as the image characteristic vectors
Figure BDA0001887718790000141
Generating image feature vectors corresponding to the text data of 0 in the similarity matrix as
Figure BDA0001887718790000142
In the stream constraint, minimize
Figure BDA0001887718790000143
And
Figure BDA0001887718790000144
the distance between the semantic similar data and the semantic similar data is ensured to be high in similarity of generated feature vectors, and the maximum is achieved
Figure BDA0001887718790000145
And
Figure BDA0001887718790000146
the distance between the semantic data and the semantic data ensures that the similarity of generated feature vectors of different semantic data is low. For generating text data, the same holds
Figure BDA0001887718790000147
Figure BDA0001887718790000148
To perform manifold constraints. The manifold constraint loss function is therefore designed as follows:
Figure BDA0001887718790000149
in summary, we can obtain a generator loss function composed of loss functions with various constraints. In the case of supervised data training, the generator loss function consists of a cyclic loss function
Figure BDA00018877187900001410
Hash code loss function
Figure BDA00018877187900001411
Triplet constrained loss function
Figure BDA00018877187900001412
Cross-modal homogeneous data loss function
Figure BDA00018877187900001413
And class loss function
Figure BDA00018877187900001414
The formula is as follows:
Figure BDA00018877187900001415
wherein theta is2,θ3,θ4,θ5The adjustable hyper-parameters of the network are set to 5, 5, 0.001, 20, respectively, in this embodiment. On unsupervised dataIn the training case, the generator loss function consists of a cyclic loss function
Figure BDA00018877187900001416
Hash code loss function
Figure BDA00018877187900001417
Manifold constraint loss function
Figure BDA00018877187900001418
Cross-modal homogeneous data loss function
Figure BDA00018877187900001419
Composition, the formula is as follows:
Figure BDA00018877187900001420
the value of the hyper-parameter is the same as previously set.
And (3) integrating the 5 steps, designing a discriminator loss function and a generator loss function, and then utilizing a common minimum maximum algorithm to iterate to minimize network loss so as to achieve the purpose of establishing the semantic relationship among the multi-modal data. The infinitesimal maximum algorithm in the embodiment uses a random gradient descent optimization algorithm, and particularly uses a more stable RMSProp optimization algorithm. Since the discriminator and the generator are confronted with each other, the calculation methods of the discriminator and the generator are opposite, and both of them confront the iteration result of the previous round at each round of iteration and reach dynamic balance in the confrontation. The calculation method is as follows:
Figure BDA0001887718790000151
Figure BDA0001887718790000152
because the discriminantor is trained quickly in the actual process, the network designed by the method is every timeThe training iterates the generator S times and iterates the discriminator once. In the present embodiment, the network training-related hyper-parameter S is set to 10, the learning rate μ of the network is set to 0.0001, and the batch sample size (batch size) per training is set to 64; meanwhile, the weight learned in the network is pruned, and the weight in the generator is larger than c in each traininggenIs given a weight of cgenGreater than c in the discriminatordiscIs given a weight of cdiscSo as to avoid the learned weight from being too large.
And step 6, using the trained neural network for cross-modal data search, compressing the feature vectors obtained by the data through a generator public subspace into hash codes, and retrieving by utilizing the Hamming distance between different data hash codes. The specific implementation process is described as follows:
after the image and text data in the embodiment are trained and learned through the network, the generator obtains an extraction mode of semantic relation related information among cross-modal data. The embodiment can perform the bidirectional retrieval of cross-modal data at this time, firstly fix the weight parameters in the trained network, and retrieve the image and text data m to be retrievedtest,ttestBy trained generator Gm→t,Gt→mObtaining a feature vector m on a common subspacecom,tcomThen, the feature vector is generated as a hash code mhash,thashAnd (5) standby. When searching text by using image, taking out hash code of the image
Figure BDA0001887718790000153
Calculating the Hamming distance between the hash code and all text hash codes, and calculating the hash code with the nearest distance
Figure BDA0001887718790000154
The represented text is the result of image → text cross-modality retrieval; when searching image by text, taking out hash code of the text
Figure BDA0001887718790000155
Calculating its Hamming distance to all image Hash codes, the Hash with the nearest distanceCode
Figure BDA0001887718790000156
The represented image is the result of the text → image cross modality search.
The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims (6)

1. A cross-modal retrieval method based on a cycle generation type countermeasure network is characterized by comprising the following steps:
designing two circulation modules, wherein one circulation module realizes the process from the image to the text to the image through two generators, and the other circulation module realizes the process from the text to the image to the text through the two generators; the two circulation modules share two generators with the same network structure, and the output data of the middle layer of the generators are subjected to Hash coding;
and designing a discriminator in each cycle module, wherein the discriminator classifies the generated data and the original data of the same mode and performs dynamic confrontation with the generator, and finally the generator and the discriminator reach dynamic balance under given training conditions.
2. The cross-modal retrieval method based on the recurrent countermeasure network of claim 1, wherein:
aiming at the characteristics of multi-mode and multi-class data streams, manifold constraints are adopted to ensure the similarity and difference of data among modes and classes under an unsupervised condition; under supervision conditions, due to the fact that class labels are given, the characteristic distance between data of the same type and different modes is minimized by adopting the triple constraint, and the characteristic distance between data of neither the same type nor the different modes is maximized.
3. The cross-modal retrieval method based on the recurrent countermeasure network of claim 2, wherein:
the loss function of the discriminator is specifically as follows:
Figure FDA0003324964020000011
the cyclic loss function obtained by comparing the finally generated data of the same modality with the original data is as follows:
Figure FDA0003324964020000012
wherein i represents the data of the ith calculation, n training sample data are provided in total, and the discriminator can continuously reduce L in the training processdiscDirection iterative learning of DimgAnd DtxtRespectively representing two discriminators, (m)ori,tori) Respectively representing an original mode m and an original mode t, mcycGenerating a modal m-feature, tcycGenerating a modal t characteristic; theta1Is a hyper-parameter of the network, | × | | non-conducting phosphor2Indicating that the L2 distance is calculated.
4. The cross-modal retrieval method based on the recurrent countermeasure network of claim 3, wherein:
let the intermediate layer eigenvector of the two generators be mcomAnd tcomThe formula for generating the hash code is as follows:
mhash=sgn(mcom-0.5)
t hash=sgn(t com-0.5)
wherein sgn is a threshold function, the meaning of the formula is that each bit of floating point number in the intermediate layer floating point type characteristic vector is set to +1 when the value is greater than 0.5, and the corresponding hash code bit is set to-1 when the value is less than 0.5.
5. The cross-modal retrieval method based on the recurrent countermeasure network of claim 4, wherein: in order to quantify approximate error between the feature vector and the generated hash code, a related loss function is designed as a constraint, a likelihood function of the hash code under the condition of the feature vector is specifically used, a sample can be an image or a text, the j th bit of the hash code of the ith sample is
Figure FDA0003324964020000021
And j th position of characteristic vector
Figure FDA0003324964020000022
Then:
Figure FDA0003324964020000023
wherein
Figure FDA0003324964020000024
Is a sigmoid function associated with the feature vector:
Figure FDA0003324964020000025
a loss function is further designed according to the likelihood function to evaluate the approximate error between the feature vector and the generated hash code:
Figure FDA0003324964020000026
Figure FDA0003324964020000031
where n is the total number of samples, dhashIs the vector bit number.
6. The cross-modal retrieval method based on the recurrent countermeasure network of claim 5, wherein: and (3) carrying out class constraint on the characteristic vector of the generator intermediate layer, thereby designing a class loss function formula as follows:
Figure FDA0003324964020000032
wherein
Figure FDA0003324964020000033
Is the feature vector of the ith sample
Figure FDA0003324964020000034
Predicted classes of samples obtained via a small classification network, ciIs the actual class label of the sample, and the class loss function actually calculates the L2 distance between the two; the cross-modal homogeneous data pair is subjected to similarity constraint, training image sample data is connected with similar text sample data, and a loss function is designed to constrain the cross-modal homogeneous data; the loss function is formulated as follows:
Figure FDA0003324964020000035
Figure FDA0003324964020000036
and
Figure FDA0003324964020000037
are respectively a generator Gt→m,Gm→tGenerating a feature vector of a common subspace of the image and the text, and calculating an L2 distance of corresponding cross-modal data with similar semantics by using a loss function; in the case of supervised data training, triple constraints are used to minimize the same semantics since the data all have class labelsThe distance between cross-modal data vectors under the label is designed as the ternary loss function:
Figure FDA0003324964020000038
wherein m and t respectively represent image and text data, alpha and beta represent two kinds of class labels, the x represents generated data, and i represents data used for the ith calculation; aiming at the unsupervised training condition, manifold constraint is designed to ensure the similarity of semantically similar data in homomodal and cross-modal data, after a kNN matrix is calculated, a similarity matrix is established for the data to be constrained, and then manifold constraint is carried out on the feature vector in a public subspace; the manifold constraint loss function is designed as follows:
Figure FDA0003324964020000041
wherein neib and non respectively represent adjacent data and non-adjacent data, and other symbols have the same meanings as before; synthesizing various functions, the generator loss function in the case of supervised data training is designed to:
Figure FDA0003324964020000042
the generator loss function in the case of unsupervised data training is designed to:
Figure FDA0003324964020000043
θ2,θ3,θ4,θ5all are weight value over parameters of the network; the whole network is trained and iterated by using a RMSProp random gradient descent optimization algorithm, and the iteration formula is as follows:
Figure FDA0003324964020000044
Figure FDA0003324964020000045
because the gradient of the discriminator is reduced quickly in the actual process, the discriminator is iterated once when the designed network is trained to iterate the generators for S times, and the hyper-parameter c is usedgen,cdiscAnd pruning the network weight to prevent the network weight from being overlarge.
CN201811455802.6A 2018-11-30 2018-11-30 Cross-modal retrieval method based on cycle generation type countermeasure network Active CN109299342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811455802.6A CN109299342B (en) 2018-11-30 2018-11-30 Cross-modal retrieval method based on cycle generation type countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811455802.6A CN109299342B (en) 2018-11-30 2018-11-30 Cross-modal retrieval method based on cycle generation type countermeasure network

Publications (2)

Publication Number Publication Date
CN109299342A CN109299342A (en) 2019-02-01
CN109299342B true CN109299342B (en) 2021-12-17

Family

ID=65142338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811455802.6A Active CN109299342B (en) 2018-11-30 2018-11-30 Cross-modal retrieval method based on cycle generation type countermeasure network

Country Status (1)

Country Link
CN (1) CN109299342B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019652B (en) * 2019-03-14 2022-06-03 九江学院 Cross-modal Hash retrieval method based on deep learning
US11475308B2 (en) * 2019-03-15 2022-10-18 Samsung Electronics Co., Ltd. Jointly pruning and quantizing deep neural networks
CN110059157A (en) * 2019-03-18 2019-07-26 华南师范大学 A kind of picture and text cross-module state search method, system, device and storage medium
CN110032734B (en) * 2019-03-18 2023-02-28 百度在线网络技术(北京)有限公司 Training method and device for similar meaning word expansion and generation of confrontation network model
CN110222140B (en) * 2019-04-22 2021-07-13 中国科学院信息工程研究所 Cross-modal retrieval method based on counterstudy and asymmetric hash
CN111127385B (en) * 2019-06-06 2023-01-13 昆明理工大学 Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN110309861B (en) * 2019-06-10 2021-05-25 浙江大学 Multi-modal human activity recognition method based on generation of confrontation network
CN110334708A (en) * 2019-07-03 2019-10-15 中国科学院自动化研究所 Difference automatic calibrating method, system, device in cross-module state target detection
CN110443309A (en) * 2019-08-07 2019-11-12 浙江大学 A kind of electromyography signal gesture identification method of combination cross-module state association relation model
CN112487217A (en) * 2019-09-12 2021-03-12 腾讯科技(深圳)有限公司 Cross-modal retrieval method, device, equipment and computer-readable storage medium
CN110909181A (en) * 2019-09-30 2020-03-24 中国海洋大学 Cross-modal retrieval method and system for multi-type ocean data
CN110930469B (en) * 2019-10-25 2021-11-16 北京大学 Text image generation method and system based on transition space mapping
CN110990595B (en) * 2019-12-04 2023-05-05 成都考拉悠然科技有限公司 Cross-domain alignment embedded space zero sample cross-modal retrieval method
CN111104982B (en) * 2019-12-20 2021-09-24 电子科技大学 Label-independent cross-task confrontation sample generation method
CN111353076B (en) * 2020-02-21 2023-10-10 华为云计算技术有限公司 Method for training cross-modal retrieval model, cross-modal retrieval method and related device
WO2021189383A1 (en) * 2020-03-26 2021-09-30 深圳先进技术研究院 Training and generation methods for generating high-energy ct image model, device, and storage medium
CN111523663B (en) * 2020-04-22 2023-06-23 北京百度网讯科技有限公司 Target neural network model training method and device and electronic equipment
CN111581405B (en) * 2020-04-26 2021-10-26 电子科技大学 Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN111783980B (en) * 2020-06-28 2023-04-07 大连理工大学 Ranking learning method based on dual cooperation generation type countermeasure network
CN111881884B (en) * 2020-08-11 2021-05-28 中国科学院自动化研究所 Cross-modal transformation assistance-based face anti-counterfeiting detection method, system and device
CN112199462A (en) * 2020-09-30 2021-01-08 三维通信股份有限公司 Cross-modal data processing method and device, storage medium and electronic device
CN112364192B (en) * 2020-10-13 2024-10-01 中山大学 Zero sample hash retrieval method based on ensemble learning
WO2022104540A1 (en) * 2020-11-17 2022-05-27 深圳大学 Cross-modal hash retrieval method, terminal device, and storage medium
CN113706646A (en) * 2021-06-30 2021-11-26 酷栈(宁波)创意科技有限公司 Data processing method for generating landscape painting
CN113204522B (en) * 2021-07-05 2021-09-24 中国海洋大学 Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network
CN113779283B (en) * 2021-11-11 2022-04-01 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN116524420B (en) * 2023-07-03 2023-09-12 武汉大学 Key target detection method and system in traffic scene
CN117133024B (en) * 2023-10-12 2024-07-26 湖南工商大学 Palm print image recognition method integrating multi-scale features and dynamic learning rate

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473307A (en) * 2013-09-10 2013-12-25 浙江大学 Cross-media sparse Hash indexing method
CN106547826A (en) * 2016-09-30 2017-03-29 西安电子科技大学 A kind of cross-module state search method, device and computer-readable medium
CN107871014A (en) * 2017-11-23 2018-04-03 清华大学 A kind of big data cross-module state search method and system based on depth integration Hash
CN108256627A (en) * 2017-12-29 2018-07-06 中国科学院自动化研究所 The mutual generating apparatus of audio-visual information and its training system that generation network is fought based on cycle
CN108510559A (en) * 2017-07-19 2018-09-07 哈尔滨工业大学深圳研究生院 It is a kind of based on have supervision various visual angles discretization multimedia binary-coding method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9030430B2 (en) * 2012-12-14 2015-05-12 Barnesandnoble.Com Llc Multi-touch navigation mode

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473307A (en) * 2013-09-10 2013-12-25 浙江大学 Cross-media sparse Hash indexing method
CN106547826A (en) * 2016-09-30 2017-03-29 西安电子科技大学 A kind of cross-module state search method, device and computer-readable medium
CN108510559A (en) * 2017-07-19 2018-09-07 哈尔滨工业大学深圳研究生院 It is a kind of based on have supervision various visual angles discretization multimedia binary-coding method
CN107871014A (en) * 2017-11-23 2018-04-03 清华大学 A kind of big data cross-module state search method and system based on depth integration Hash
CN108256627A (en) * 2017-12-29 2018-07-06 中国科学院自动化研究所 The mutual generating apparatus of audio-visual information and its training system that generation network is fought based on cycle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
跨模态检索研究综述;欧卫华等;《贵州师范大学学报(自然科学版)》;20180331;第36卷(第2期);第114-120页 *

Also Published As

Publication number Publication date
CN109299342A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299342B (en) Cross-modal retrieval method based on cycle generation type countermeasure network
Deng et al. Unsupervised semantic-preserving adversarial hashing for image search
Wu et al. Cycle-consistent deep generative hashing for cross-modal retrieval
Wu et al. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval.
CN110222140B (en) Cross-modal retrieval method based on counterstudy and asymmetric hash
Makhzani et al. Adversarial autoencoders
CN109241317B (en) Pedestrian Hash retrieval method based on measurement loss in deep learning network
Lai et al. Instance-aware hashing for multi-label image retrieval
CN111581405A (en) Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN109271522A (en) Comment sensibility classification method and system based on depth mixed model transfer learning
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN110222218B (en) Image retrieval method based on multi-scale NetVLAD and depth hash
CN112434159B (en) Method for classifying paper multi-labels by using deep neural network
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN113204522B (en) Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN109960732B (en) Deep discrete hash cross-modal retrieval method and system based on robust supervision
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN111461175A (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
Song et al. A weighted topic model learned from local semantic space for automatic image annotation
CN113779283B (en) Fine-grained cross-media retrieval method with deep supervision and feature fusion
Yang et al. Graph regularized encoder-decoder networks for image representation learning
Zheng et al. Robust representation learning with reliable pseudo-labels generation via self-adaptive optimal transport for short text clustering
CN114168773A (en) Semi-supervised sketch image retrieval method based on pseudo label and reordering
Xiao et al. ANE: Network embedding via adversarial autoencoders

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant