CN108319686B - Antagonism cross-media retrieval method based on limited text space - Google Patents
Antagonism cross-media retrieval method based on limited text space Download PDFInfo
- Publication number
- CN108319686B CN108319686B CN201810101127.0A CN201810101127A CN108319686B CN 108319686 B CN108319686 B CN 108319686B CN 201810101127 A CN201810101127 A CN 201810101127A CN 108319686 B CN108319686 B CN 108319686B
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- image
- network
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000008485 antagonism Effects 0.000 title claims abstract description 28
- 238000013507 mapping Methods 0.000 claims abstract description 94
- 238000000605 extraction Methods 0.000 claims abstract description 51
- 238000012549 training Methods 0.000 claims abstract description 48
- 230000008569 process Effects 0.000 claims abstract description 17
- 238000009826 distribution Methods 0.000 claims abstract description 13
- 230000007246 mechanism Effects 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims description 65
- 239000013598 vector Substances 0.000 claims description 28
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- 230000000306 recurrent effect Effects 0.000 claims description 7
- 238000005259 measurement Methods 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 238000005452 bending Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000001902 propagating effect Effects 0.000 claims description 2
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 238000011524 similarity measure Methods 0.000 claims 1
- 230000006399 behavior Effects 0.000 abstract description 3
- 230000007812 deficiency Effects 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 18
- 238000012360 testing method Methods 0.000 description 10
- 230000001149 cognitive effect Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 230000003042 antagnostic effect Effects 0.000 description 3
- 210000004556 brain Anatomy 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 238000010219 correlation analysis Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000282372 Panthera onca Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Library & Information Science (AREA)
- Fuzzy Systems (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a antagonism cross-media retrieval method based on a limited text space, which comprises the steps of designing a feature extraction network, a feature mapping network and a modal classifier, obtaining the limited text space through learning, extracting image and text features suitable for cross-media retrieval, and realizing the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; thereby enabling cross-media retrieval. The invention can better fit the human behavior in the cross-media retrieval task; obtaining the image and text characteristics more suitable for the cross-media retrieval task, and making up the deficiency of the pre-training characteristics in the expression capability; and a antagonism learning mechanism is introduced, and the retrieval accuracy is further improved through the maximum and minimum games between the modal classifier and the feature mapping network.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a antagonism cross-media retrieval method based on a limited text space.
Background
With the advent of the Web 2.0 era, a large amount of multimedia data (images, text, video, audio, etc.) began to accumulate and spread over the internet. Unlike traditional single modality retrieval tasks, cross-media retrieval is used to enable two-way retrieval between different modality data, such as text retrieval images and image retrieval text. However, due to the inherent heterogeneous nature of multimedia data, their similarity cannot be directly measured. Therefore, the core problem of this kind of task is how to find a homogeneous mapping space, so that the similarity between heterogeneous multimedia data can be directly measured. In the current cross-media search field, a lot of research is carried out on the basis of the problem, and a series of typical cross-media search algorithms such as CCA (Canonical Correlation Analysis), Deep Visual-Semantic Embedding (Deep Visual-Semantic Embedding), and DSPE (Deep Structure-invariant Text Image Embedding) are proposed. However, these methods also have certain drawbacks.
The first drawback is represented by the characterization of the multimedia data. The existing methods mostly adopt a pre-trained cnn (conditional neural network) model to extract image features, such as a neural network structure proposed by VGG (Visual Geometry Group). However, these models are usually pre-trained on the task of image classification, which results in that the extracted image features only contain class information of the object, thereby losing some information that may be important for cross-media retrieval, such as behavior and motion of the object and interaction process between the objects. For text, Word2Vec, lda (late Dirichlet allocation) and fv (fisher vector) are some mainstream text feature extraction methods. However, they are also pre-trained on some data sets other than cross-media retrieval, so the extracted features are not suitable for cross-media retrieval.
A second drawback is represented by the choice of homogeneous feature space. There are roughly three choices of isomorphic space, namely public space, text space and image space. From a human cognitive perspective, the brain's understanding process for text and images is not the same. For text, the brain can directly extract features and understand; for an image, the brain always subconsciously describes it first in text before understanding, i.e. first converts from image space to text space. Therefore, cross-media retrieval in text space can better simulate human cognitive manner. The existing cross-media retrieval method based on the text space mostly adopts the Word2Vec space as the final text space, and the feature representation of the image in the space is obtained by combining the class information of the object in the image. Therefore, the feature also loses the information of rich actions and interactions contained in the image, which also shows that the Word2Vec space is not a valid text feature space for cross-media retrieval.
A third drawback is represented by the variability of the distribution of characteristics of the different modalities. Although existing methods map data features of different modalities to some homogeneous feature space, a modality gap (modality gap) still exists between them, and there is also a significant difference in feature distribution, which may result in a decrease in cross-media retrieval performance.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a antagonism cross-media retrieval method based on a limited text space, firstly, image and text feature descriptions corresponding to a cross-media retrieval task are obtained through learning, and secondly, a limited text space is found through simulating a human cognition mode and is used for realizing similarity measurement between images and texts; the method also introduces a antagonism training mechanism, aims to reduce the difference of feature distribution among different modal data in the text space learning process, and further increases the retrieval accuracy.
The principle of the invention is as follows: as described in the background, the core problem of cross-media retrieval is how to find a homogeneous mapping space so that the similarity between heterogeneous multimedia data can be directly measured. More precisely, this core problem can be subdivided into two sub-problems. The first sub-problem is how to learn to get an efficient representation of multimedia data features. The second sub-problem is how to find a suitable isomorphic feature space. The invention provides a cross-media retrieval method based on a limited text space. For the first sub-problem, the invention uses feature extraction network learning to obtain effective image and text feature representation. Based on an image description (image capture) task, the invention learns to obtain a new image feature by combining a CNN (computer network) and an image description algorithm. The characteristics not only comprise the category information of the objects in the image, but also comprise rich interaction information among the objects; for text features, learning from scratch with a Recurrent Neural Network (RNN) is applied to the cross-media retrieval task. For the second subproblem, the method adopts feature mapping network learning to obtain a limited text space; in order to further reduce the difference between different modal characteristics, the invention designs a modal classifier for realizing the maximum and minimum game with the characteristic mapping network. In particular, the modality classifier is used for distinguishing the modality of the current limited text space feature, and the feature mapping network is used for learning the feature with the unchanged modality and thereby confusing the modality classifier. In addition to the conventional triplet penalty, an additional antagonism penalty is propagated back from the modality classifier to the feature mapping network during the training process, for further reducing the difference between different modality features. "constrained text space" means that the text space learned by this method is made up of a series of basis vectors that can be thought of as different words in a dictionary. The expressive power of this text space is therefore limited by the number of words in the dictionary and is therefore limited. The method of the invention mainly obtains the limited text space through learning, and realizes the similarity measurement between the image and the text. The method is based on the limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The method obtains more accurate retrieval results in the cross-media retrieval classical data set.
The technical scheme provided by the invention is as follows:
a antagonism cross-media retrieval method based on a limited text space utilizes a feature extraction network, a feature mapping network and a modal classifier to obtain the limited text space through learning, extracts image and text features suitable for cross-media retrieval and realizes the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; firstly, training a feature extraction network, a feature mapping network and a modal classifier by using a data set D, and then realizing antagonistic cross-media retrieval by using the trained feature network aiming at retrieval request data; the method comprises the following specific steps:
let training dataset D ═ D be assumed1,D2,…,DnThere are n samples, each sample DiComprising a picture IiAnd a piece of descriptive text TiI.e. Di=(Ii,Ti) Each text segment is composed of a plurality (5) of sentences, each sentence describing a matching picture independently; each image contains 5 descriptive sentences with similar but different meanings;
1) and extracting the features of the image and the text in the D through a feature extraction network.
For an Image, extracting Image features by combining an existing VGG model and an Image description algorithm (NIC); for text, the LSTM (Long Short Term Memory networks) network is used to extract text features. Since the LSTM network is not pre-trained, its parameters are updated synchronously with the parameters of the feature mapping network.
The calculation process of image feature extraction is expressed as formula 1:
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC;Cincatenate (. cndot.) is a feature connection layer for connecting IVGGAnd INIGFeature I connected in 4608 dimensionsConcat。
The text feature extraction specifically executes the following steps:
giving a text of length T S ═ (S)0,s1,…,sT) Each word S in StAll using 1-of-k coding representation, k representing the number of words in the dictionary; before entering the LSTM network, the word stNeeds to be mapped to a more dense space first, represented by equation 2:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs a word vector mapping matrix for mapping 1-of-k vectors stEncoding into a d-dimensional word vector;
the resulting dense-space word vectors are fed into the LSTM network, represented as equation 3:
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ represents a double bending function; an element indicates a multiplication operation in a matrix element; tanh represents a hyperbolic tangent activation function; hidden layer output h of LSTM network at T momentTI.e. a feature representation of the text S.
2) Designing a feature fusion layer on the top layer of the feature mapping network, and designing IVGG_txtAnd INIC_txtIs fused into IfinalAs a d-dimensional feature representation of the input image in a restricted text space; the dimension of the restricted text space is d; respectively mapping the text and the image characteristics obtained in the step 1) to a limited text space in an initial state by a characteristic mapping network, and then comparing the characteristics through a similarity measurement functionThe similarity between the eigenvectors (namely, the distance between the two vectors is calculated) to obtain the current triplet loss; secondly, feature vectors of different modal data are sent to a modal classifier for classification to obtain the current confrontation loss, and finally the limited text space is trained by optimizing a combined loss function of the triple loss and the confrontation loss.
The text features are not sent to the feature mapping network here, because the feature extraction network (LSTM network) already realizes the mapping of the text to the feature space in the process of feature extraction;
and obtaining a feature fusion layer at the top layer of the feature mapping network by the processing of the formula 5:
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs 512-dimensional image features, I, extracted by an image description algorithm NICfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping.
The similarity metric function is expressed as: s (v, t) ═ v · t; wherein v and t represent image features and text features, respectively; v and t are normalized by the L2 normalization layer prior to comparison, so that s is equivalent to the cosine distance.
Training a feature mapping network by optimizing a triple loss function and a resistance loss function, and specifically executing the following operations:
setting the distance between the input image or text and the matching text or matching image to be d1The distance from unmatched text or unmatched image is d2,d1At least the ratio d2A small interval m; the interval m is a hyper-parameter determined by the outside world; the triplet loss function is represented by equation 6:
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network; the unmatched samples are randomly selected from the data set in each training period;
antagonism loss L in modal classifieradvSynchronously and reversely propagating to the feature mapping network;
defining the overall loss function L as equation 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function;
in order to suppress the noise signal of the modal classifier at the initial stage of training, the update of the parameter λ can be implemented by equation 8:
wherein p represents the percentage of the current iteration times in the total iteration times; λ is an adaptive parameter;
training the feature mapping network by adopting the loss function L, and updating the parameter theta of the feature mapping network through formula 9f:
Wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
3) Respectively sending the images and the text features which are obtained in the step 2) and are positioned in the same limited text space into a modal classifier for classification, and training the modal classifier through cross entropy loss; the following operations are specifically executed:
given the text spatial feature label of the image as [ 01 ], the text spatial feature label of the text as [ 10 ], the training of the modal classifier is realized by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresenting a two-class cross entropy loss function of the modal classifier and an additional countermeasure loss function of the feature mapping network;
updating parameter θ of modal classifier by equation 10d:
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function, θ, of the feature mapping networkdAre parameters of the modality classifier.
4) Repeating the step 2) and the step 3) until the feature mapping network is converged;
5) and calculating the distance between the data (image or text) of the retrieval request and the data of the other modality in the data set D in the limited text space according to the retrieval request, and sequencing the retrieval results according to the distance to further obtain the most similar retrieval result. The distance is then calculated by the dot product between the feature vectors of the different modality data in the restricted text space.
Through the steps, antagonism cross-media retrieval based on the limited text space is realized.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a antagonism cross-media retrieval method based on a limited text space, which mainly obtains the limited text space through learning and realizes similarity measurement between images and texts. The method is based on a limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The method obtains more accurate retrieval results in the cross-media retrieval classical data set. Specifically, the invention uses the feature extraction network to learn to obtain effective image and text feature representation, and the image features are further sent into the feature mapping network to realize the mapping from the image space to the text space. Finally, in order to further reduce the difference of feature distribution among different modal data, the antagonism loss generated by the modal classifier is reversely propagated to the feature mapping network, so that the retrieval result is further improved. Specifically, the present invention has the following technical advantages:
the invention is directed to cross-media retrieval in a restricted text space by way of simulating human cognition. Compared with the existing method based on public space or image space, the method can better fit the human behavior in the cross-media retrieval task;
(II) the feature extraction network can learn to obtain image and text features more suitable for a cross-media retrieval task, and the defects of pre-training features in expression ability are made up;
and thirdly, in order to further reduce the difference of feature distribution among different modal data, the invention introduces a mechanism of antagonism learning, and further improves the retrieval accuracy rate through the maximum and minimum game between the modal classifier and the feature mapping network.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention;
wherein, (a) shows that the invention comprises three parts of a feature extraction network, a feature mapping network and a modal classifier; (b) and (c) network structure diagrams of the feature mapping network and the modal classifier, respectively.
FIG. 2 is a schematic diagram of a network architecture of the feature extraction network of the present invention;
the method comprises the following steps that (a) an image feature extraction network is used for extracting image features through combination of a 19-layer VGG model VGGNet and an image description algorithm NIC; (b) is a recurrent neural network (LSTM) for extracting text features.
FIG. 3 is a cross-media retrieval effect screenshot implemented on a Flickr8K test data set according to an embodiment of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a antagonism cross-media retrieval method based on a limited text space, which mainly obtains the limited text space through learning and realizes similarity measurement between images and texts. The method is based on a limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The feature extraction network, the feature mapping network, the modality classifier and the implementation thereof, and the training procedure of the network in the present invention are described in detail below.
1. Feature extraction network
The feature extraction network mainly comprises two branches, including an image feature extraction network and a text feature extraction network, which respectively correspond to feature extraction of images and texts.
1) Extracting image characteristics, obtaining image characteristics I through image characteristic extraction network learningConcatIncluding 4096-dimensional feature IVGGAnd image features I extracted by an image description algorithmNIC;
The Image feature extraction network can be regarded as a combination of VGGNet (Neural network structure proposed by Visual Geometry Group) which is a 19-layer VGG model and NIC (Neural network-based Image description) which is an Image description algorithm. The VGGNet is pre-trained on an image classification task and used for extracting image features containing rich object class information; in contrast, the NIC is pre-trained on image description tasks for extracting image features containing rich information of interactions between objects. Therefore, the image features extracted by the two are complementary.
Specifically, after an image of 224 × 224 size is fed into VGGNet, the network outputs a 4096-dimensional feature IVGG(ii) a Meanwhile, in order to avoid information loss of Image features in the translation process, the output of an Image mapping Layer (Image Embedding Layer) in the NIC is regarded as the Image features I extracted by the Image description algorithmNIC. Finally, the characteristics I of the imageConcatIs equivalent to IVGGAnd INICIn combination with (1). The calculation procedure is expressed as formula 1:
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC(ii) a Consatenate (-) is a feature connection layer for connecting IVGGAnd INICFeature I connected in 4608 dimensionsConcat。
2) Text feature extraction
The text feature extraction network extracts d-dimensional text features using a long-short term memory recurrent neural network (LSTM). At the same time, d is also the dimension of the restricted text space. Suppose that a text of length T is given S ═ (S)0,s1,…,sT) Each word S in StAre all represented using 1-of-k encoding, with k representing the number of words in the dictionary. Before entering the LSTM network, the word stNeed to be mapped firstTo a more dense space:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs a word vector mapping matrix for mapping the 1-of-k vectors stCoded into a d-dimensional word vector. After obtaining the dense-space word vector representations, we feed them into the LSTM network, with the mathematical expression expressed as equation 3:
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ denotes a double curved function, and "-" denotes a multiplication operation in units of matrix elements; tanh represents a hyperbolic tangent activation function; the feature representation of the text S is the hidden layer output of the LSTM network at time T, namely hT。
FIG. 3 is a network architecture of the feature extraction network of the present invention; during the training process, the parameters of VGGNet are fixed all the time, and the NIC is pre-trained on the image description task using Flickr30K or MSCOCO training data set. Specifically, firstly, setting the size of all images in a data set to be 256 × 256, then obtaining image blocks with the size of 224 × 224 by adopting a single central area cutting mode, and finally sending the image blocks into a feature extraction network to extract image features; for text, we use LSTM and bi-directional LSTM networks to extract text features, where the number of hidden layer nodes of LSTM units is 1024.
2. Modal classifier
To further reduce the difference between different modal feature distributions, we design a modal classifier that acts as a discriminator in the generation countermeasure network. Given an image text space feature label of [ 01 ], a text space feature label of [ 10 ], the training of the modal classifier is implemented by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresents the two-class cross-entropy loss function of the modality classifier, and is an additional countermeasure loss function of the feature mapping network.
3. Feature mapping network
The invention maps the parameter theta of the network through the characteristicfLearning results in a restricted text space. The image characteristics I are obtained by the characteristic extraction network learningConcatComprising IVGGAnd INICTwo parts. For image features IConcatTwo mapping functions f (-) and g (-) are designed in the feature mapping network and are respectively used for realizing IVGGAnd INICTo d-dimension text space characteristic IVGG_txtAnd INIC_txtTo (3) is performed. And IVGGAnd INICSimilarly, IVGG_txtAnd INIC_txtAnd the two characteristics are complementary, so that a characteristic fusion layer is designed at the top layer of the characteristic mapping network, and the advantages of the two characteristics are complementary. The process is defined as formula 5:
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs extracted by an image description algorithm NIC512 dimensional image features, IfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping. Notably, the feature extraction process for text amounts to mapping the text to the restricted text space. Thus, the parameter θ of the feature mapping networkf(see equation 9) contains the parameters of the LSTM network.
Fig. 2 (b) and (c) show the network structures of the feature mapping network and the modality classifier, respectively. The feature mapping network comprises two feature mapping networks f (-) and g (-) a fusion layer (fusion layer) and an L2 normalization layer (L2 Norm). f (-) contains two fully connected layers, the hidden layer node numbers are 2048 and 1024, respectively. ReLU is used as an activation function between all fully connected layers, and Dropout layers are added after the ReLU to prevent overfitting, wherein the Dropout rate is 0.5; g (-) contains one fully connected layer, the number of hidden layer nodes is 1024; the fusion layer (fusion layer) implements an addition operation in units of matrix elements; the L2 normalization layer enables the similarity between the learned features to be measured directly through point multiplication, the model convergence speed is increased, and the training stability is improved.
After mapping the image and the text to a restricted text space in an initial state, the next step is to compare the similarity between the features and calculate the corresponding triple loss. We define a similarity metric function s (v, t) ═ v · t, where v and t represent image and text features, respectively. To make s equivalent to the cosine distance, v and t need to be normalized by the L2 normalization layer before comparison. Triple loss functions have very wide application in the field of cross-media retrieval. Given an input image (text), the distance between the input image (text) and the matching text (image) is d1And the distance between the unmatched text (image) and the unmatched text is d2We hope d to d1At least the ratio d2A small interval m. The interval m is an externally determined hyperparameter, and for optimization purposes we fix m 0.3 and apply it to all datasetsIn (1). Thus, in the present invention, the triplet loss function is represented by equation 6:
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network. To obtain these unmatched samples, we randomly chosen from the dataset at each training cycle.
Secondly, the feature vectors of different modal data are sent to a modal classifier for classification, and the current confrontation loss is obtained. In addition to triple loss, antagonism loss L in modal classifiersadvIt will also be propagated back to the feature mapping network in synchronization.
Finally, by optimizing the triplet loss LembAnd to combat the loss LadvThe combined loss function of (a) to train the restricted text space. Due to LembAnd LadvIn contrast, the overall loss function L can be defined as:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function. In order to suppress the noise signal of the modal classifier at the initial stage of training, the updating of the parameter λ can be realized by the mathematical expression shown in equation 8:
where p represents the percentage of the current number of iterations to the total number of iterations, and λ is an adaptive parameter.
FIG. 3 shows the actual cross-media retrieval effect of the present invention on the Flickr8K test data set. The first column of the table lists the image and text questions for retrieval; the second column to the fourth column show the search results of LTS-A (VGG + BLSTM), LTS-A (NIC + BLSTM) and LTS-A (VGG + NIC + BLSTM) ranking 5 for each question, respectively. For image search text, correctly retrieved text is represented in red font; for text search images, the correctly retrieved images all contain a tick. The search results are significantly improved, looking from left to right of the table, particularly from LTS-A (VGG + BLSTM) to LTS-A (NIC + BLSTM); in addition, those samples that are erroneously retrieved can be matched to the problem to some extent.
4. Training mode
The training process of the present invention includes four phases.
Firstly, the method comprises the following steps: in the initial training phase, we fix the parameters of VGGNet, pre-training the NIC using Flickr30K (image data from the jaguar photo album website Flickr, 30000 pictures in total) or MSCOCO (a dataset created by microsoft using amazon's "turkish robot" service). After training is completed, the image features can be extracted through the feature extraction network.
II, secondly: after extracting the features of all images in the dataset, the second training phase is mainly used to learn a restricted text space. Given the loss function L of the feature mapping network, the parameter θ of our fixed-mode classifierdUpdating the parameter θ of the feature mapping network by the following mathematical expressionfExpressed by formula 9:
wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
Thirdly, the method comprises the following steps: after the second training phase, a third training phase is used primarily to enhance the discriminative power of the modality classifier. Loss function L given modal classifieradvThen we fixParameter θ of feature mapping networkfUpdating the parameter θ of the modal classifier by the following mathematical expressiond:
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function, θ, of the feature mapping networkdAre parameters of the modality classifier.
Fourthly, the method comprises the following steps: the second training phase and the third training phase are repeated for each batch of training data until the model converges.
Table 1 shows the experimental results of cross-media retrieval of the present invention on the Flickr8K test data set. To evaluate the effectiveness of the search, we followed standard ranking metrics, using Recall @ K and Median Rank. Recall @ K measures the retrieval accuracy by calculating the probability that the correctly matched data is ranked in the top K (K ═ 1,5,10) retrieval results; median Rank represents the Median of the ranks where the correct match data is located. A higher Recall @ K and a lower Median Rank indicate a more accurate retrieval effect. The figure lists the comparison of the effect of the present invention with other advanced algorithms in the prior art, including DeVise (Deep Visual-Semantic Embedding), m-RNN (Deep targeting with multi-modal recurrent network, image description of multimedia recurrent Neural network), Deep Fragment (Deep Fragment Embedding), DCCA (Deep Canonical Correlation Analysis), VSE (unified Visual-Semantic Embedding with multi-modal Neural network), m-CNN (unified Visual-Semantic Embedding with multi-modal Semantic model), etcENS(multimedia Convolutional Neural Networks), NIC (Neural Image capturing, Image description based Neural Networks), HM-LSTM (Hierarchical multimedia LSTM network). In addition, we have designed four variants on the basis of the above method:
● LTS-A (VGG + LSTM): in the image feature extraction process, an image description algorithm NIC is removed, and the rest part is fixed;
● LTS-A (NIC + LSTM): in the image feature extraction process, the convolutional neural network VGGNet is removed, and the rest part is fixed;
● LTS-A (VGG + NIC + LSTM): the network architecture shown in fig. 2;
● LTS-A (VGG + NIC + BLSTM): fig. 2 shows a network architecture that replaces the LSTM network with a bidirectional LSTM network (BLSTM).
Table 1 the embodiment cross media retrieval effect on Flickr8K test data set.
In table 1, Img2Txt represents image-to-text retrieval; txt2Img represents text-to-image retrieval. As can be seen from Table 1, LTS-A (VGG + NIC + BLSTM) surpasses HM-LSTM in the task of image search o mutext, and achieves the best search effect at present. LTS-A (VGG + NIC + BLSTM), however, does not work as well as HM-LSTM on o mutext search image tasks. The most probable reason is that the HM-LSTM adopts a tree-shaped LSTM network architecture, and can better model the hierarchical structure of the text. The invention adopts the chain type LSTM network architecture, and cannot acquire hierarchical semantic information in the text. In addition, as can be seen from the experimental result variation among the four variants of the present invention, when the network for image feature extraction is changed from VGGNet to NIC, the accuracy of image search text is improved by 22%, and the accuracy of text search image is improved by 17%. This also indicates that the NIC is able to extract more efficient image features than the traditional VGGNet; after the network for image feature extraction is changed from NIC to VGG + NIC, the accuracy of cross-media retrieval is further improved by 6%, which shows that the network for image feature extraction can not only extract detailed object category information in the image, but also contain rich interaction information among objects; finally, replacing the LSTM network with a bidirectional LSTM network (BLSTM) brings an additional 2% retrieval accuracy improvement.
Table 2 shows the cross-media retrieval effect of the embodiment on the Flickr30K test data set. In addition to the existing advanced algorithms mentioned in Flickr8K, we add DAN (Dual Attention Networks ), DSPE (Deep Structure-Preserving Image-Text embedding model), VSE + + (enhanced Visual-Semantic embedding of VSE). At this time, DAN and DSPE, which perform better than DSPE, achieve the best search effect. Due to the introduction of attention mechanisms, DAN is able to continuously focus on fine-grained information of data, which is mostly beneficial for cross-media retrieval. Instead, we use only global features to represent images and text, and thus are disturbed by noise information in the images or text. Besides DAN, DSPE also performs better than our, because it uses more complex textual features (Fisher Vector) and loss functions. With respect to the four variants of the invention, their experimental performance was relatively similar to that of Flickr 8K.
Table 2 embodiment cross media retrieval effect on Flickr30K test data set
Table 3 cross media retrieval effect of the embodiment on MSCOCO test data set
Table 3 shows the cross-media retrieval effect of the embodiment on the MSCOCO test data set. In addition to the existing advanced algorithms mentioned in Flickr8K And Flickr30K, we add Order (Order-embedding Of Images And Language, sequential embedding Of Images And text). At the moment, the LTS-A (VGG + NIC + LSTM) achieves the best effect on the image search o mutext task, the retrieval accuracy is improved by about 2%, and the index R @1 is lower than the DSPE; in the task of image retrieval and o mutext, the DSPE has better performance on Recall @ K than that of us, but LTS-A (VGG + NIC + LSTM) has the best effect on the index of Median Rank. This is because the chain LSTM network used in the present invention does not have a good understanding of the hierarchical semantic information in the text, and therefore, the feature representation capability of the text is inferior to that of FV (Fisher vector). As for the four variants of the invention, their experimental performance was similar to that of Flickr8K, Flickr 30K.
TABLE 4 Cross-mediA retrieval Effect of two variants LTS-A and LTS of the embodiment
Table 4 shows the effect of antagonistic learning mechanisms on the experimental results. We have devised two variants of the original invention: LTS-A and LTS. LTS-A is the previously mentioned LTS-A (VGG + NIC + LSTM); LTS is based on LTS-A (VGG + NIC + LSTM), and the mechanism of adversity learning is removed.
From the table we can see that LTS-A has A significant improvement in cross-mediA retrieval accuracy over LTS. LTS only exceeds LTS-A in the R @1 index of the image search o mutext. The experimental results show that the antagonistic learning has obvious effect on reducing the difference between the characteristic distributions of different modal data.
Table 5 search effect of the embodiment on MSCOCO test data set
Table 6 shows the search effect of extracting image features by means of single cropping and ten cropping on the MSCOCO test dataset, respectively.
In the above implementation, we all use a single crop (1-crop) of the image region to extract the image features. In order to verify the validity of the feature mean of ten different regions of an image as an image feature (10-crops), we designed LTS-A (10-crops), wherein LTS-A refers to LTS-A (VGG + NIC + BLSTM), and 10-crops represents that the image feature at this time is described by the feature mean of ten different regions of the image. As can be seen from Table 6, the accuracy of LTS-A (10-crops) retrieval is significantly improved compared with LTS-A (1-crops), which also illustrates the feasibility of using the feature mean of ten different regions of an image as the image feature.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.
Claims (7)
1. A antagonism cross-media retrieval method based on a limited text space comprises the steps of designing a feature extraction network, a feature mapping network and a modal classifier, obtaining the limited text space through learning, extracting image and text features suitable for cross-media retrieval, and realizing the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; thereby enabling cross-media retrieval; specifically, the method comprises the following steps:
A. the feature extraction network comprises an image feature extraction network and a text feature extraction network which are respectively used for image feature extraction and text feature extraction; the image feature extraction network obtains the image feature I through learning of one or two of VGGNet and NICConcatIncluding 4096-dimensional feature IVGGAnd image features I extracted by an image description algorithmNICOne or two of them; the text feature extraction network extracts d-dimensional text features by using a long-short term memory recurrent neural network (LSTM) or a bidirectional LSTM network (BLSTM);
B. the modal classifier is used as a discriminator in the countermeasure network, and the training of the modal classifier is realized by optimizing a two-class cross entropy loss function; this function is also an additional penalty function for the feature mapping network;
C. feature mapping network pass parameter θfLearning to obtain a limited text space; learning to get for feature extraction networkImage characteristics IConcatComprising ofVGGAnd INICDesigning mapping functions f (-) and g (-) in the feature mapping network for realizing I respectivelyVGGAnd INICMapping I to d-dimensional text space featuresVGG_txtAnd INIC_txt(ii) a Designing a feature fusion layer on the top layer of the feature mapping network, and designing IVGG_txtAnd INIC_txtIs fused into IfinalAs a d-dimensional feature representation of the input image in a restricted text space; the dimension of the restricted text space is d;
let training dataset D ═ D be assumed1,D2,…,DnThere are n samples, each sample DiComprising a picture IiAnd a piece of descriptive text TiI.e. Di=(Ii,Ti) Each text segment consists of 5 sentences, and each sentence independently describes a matched picture; for the data set D, the following steps 1) -4) are executed to train the feature extraction network, the feature mapping network and the modal classifier:
1) and (3) extracting the features of the image and the text in the D through a feature extraction network: for the image in the D, extracting image features by using a VGG model and an image description algorithm NIC; for the text in the step D, extracting text features by using a long-short term memory recurrent neural network (LSTM), and realizing the mapping of the text to a feature space, wherein parameters of the LSTM network and parameters of a feature mapping network need to be updated synchronously;
2) respectively mapping the text and the image characteristics obtained in the step 1) to a limited text space in an initial state by a characteristic mapping network, firstly calculating the distance between characteristic vectors by a similarity measurement function, and comparing the similarity between the characteristic vectors to obtain the current triple loss; then, the feature vectors of different modal data are sent to a modal classifier for classification, and the current confrontation loss is obtained; finally, training a limited text space by optimizing a combined loss function of the triple loss and the countermeasure loss;
3) respectively sending the images and the text features which are obtained in the step 2) and are positioned in the same limited text space into a modal classifier for classification, and training the modal classifier through cross entropy loss;
4) repeating the steps 2) -3) until the feature mapping network is converged;
5) calculating the distance between the image or text of the retrieval request data and the other modal data in the data set D in the limited text space according to the retrieval request, and sequencing the retrieval results according to the distance to further obtain the most similar retrieval results; specifically, the distance is calculated through the dot product between the feature vectors of different modal data in the space;
through the steps, antagonism cross-media retrieval based on the limited text space is realized.
2. The adversarial cross-media retrieval method of claim 1, wherein the computation process of image feature extraction is expressed as formula 1:
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC(ii) a Consatenate (-) is a feature connection layer for connecting IVGGAnd INICFeature I connected in 4608 dimensionsConcat。
3. The adversarial cross-media retrieval method as claimed in claim 1, wherein the text feature extraction specifically performs the following steps:
giving a text of length T S ═ (S)0,s1,…,sT) Each word vector S in StAll using 1-of-k coding representation, k representing the number of words in the dictionary; word vector s before entry into the LSTM networktNeeds to be mapped to a more dense space first, represented by equation 2:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs wordVector mapping matrix for vector s of 1-of-k wordstEncoding into a d-dimensional word vector;
the resulting dense-space word vectors are fed into the LSTM network, represented as equation 3:
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ represents a double bending function; an element indicates a multiplication operation in a matrix element; tanh represents a hyperbolic tangent activation function; hidden layer output h of LSTM network at T momentTI.e. a feature representation of the text S.
4. The adversarial cross-media retrieval method of claim 1, wherein the training of the modality classifier specifically performs the following operations:
given the text spatial feature label of the image as [ 01 ], the text spatial feature label of the text as [ 10 ], the training of the modal classifier is realized by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresenting a two-class cross entropy loss function of the modal classifier and an additional countermeasure loss function of the feature mapping network;
updating parameter θ of modal classifier by equation 10d:
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function of the feature mapping network.
5. The adversarial cross-media retrieval method of claim 1, wherein the feature fusion layer at the top layer of the feature mapping network is obtained by the processing of equation 5:
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs 512-dimensional image features, I, extracted by an image description algorithm NICfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping.
6. The adversarial cross-media retrieval method of claim 1, wherein the step 2) trains the feature mapping network by optimizing the triple loss function and the adversarial loss function, and specifically performs the following operations:
setting the distance between the input image or text and the matching text or matching image to be d1The distance from unmatched text or unmatched image is d2,d1At least the ratio d2A small interval m; the interval m is a hyper-parameter determined by the outside world; the triplet loss function is represented by equation 6:
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network; the unmatched samples are randomly selected from the data set in each training period;
antagonism loss L in modal classifieradvSynchronously and reversely propagating to the feature mapping network;
defining the overall loss function L as equation 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function;
in order to suppress the noise signal of the modal classifier at the initial stage of training, the update of the parameter λ can be implemented by equation 8:
wherein p represents the percentage of the current iteration times in the total iteration times; λ is an adaptive parameter;
training the feature mapping network by adopting the loss function L, and updating the parameter theta of the feature mapping network through formula 9f:
Wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
7. The adversarial cross media retrieval method of claim 1, wherein the similarity measure function s (v, t) of step 2) is expressed as:
s(v,t)=v·t
wherein v and t represent image features and text features, respectively; v and t are normalized by a normalization layer prior to comparison to make s equivalent to the cosine distance.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810101127.0A CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
PCT/CN2018/111327 WO2019148898A1 (en) | 2018-02-01 | 2018-10-23 | Adversarial cross-media retrieving method based on restricted text space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810101127.0A CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108319686A CN108319686A (en) | 2018-07-24 |
CN108319686B true CN108319686B (en) | 2021-07-30 |
Family
ID=62888861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810101127.0A Expired - Fee Related CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108319686B (en) |
WO (1) | WO2019148898A1 (en) |
Families Citing this family (78)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319686B (en) * | 2018-02-01 | 2021-07-30 | 北京大学深圳研究生院 | Antagonism cross-media retrieval method based on limited text space |
CN109344266B (en) * | 2018-06-29 | 2021-08-06 | 北京大学深圳研究生院 | Dual-semantic-space-based antagonistic cross-media retrieval method |
CN109508400B (en) * | 2018-10-09 | 2020-08-28 | 中国科学院自动化研究所 | Method for generating image-text abstract |
CN109783655B (en) * | 2018-12-07 | 2022-12-30 | 西安电子科技大学 | Cross-modal retrieval method and device, computer equipment and storage medium |
CN109783657B (en) * | 2019-01-07 | 2022-12-30 | 北京大学深圳研究生院 | Multi-step self-attention cross-media retrieval method and system based on limited text space |
CN109919162B (en) * | 2019-01-25 | 2021-08-10 | 武汉纺织大学 | Model for outputting MR image feature point description vector symbol and establishing method thereof |
CN110059217B (en) * | 2019-04-29 | 2022-11-04 | 广西师范大学 | Image text cross-media retrieval method for two-stage network |
CN110189249B (en) * | 2019-05-24 | 2022-02-18 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110175256B (en) * | 2019-05-30 | 2024-06-07 | 上海联影医疗科技股份有限公司 | Image data retrieval method, device, equipment and storage medium |
CN112182281B (en) * | 2019-07-05 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Audio recommendation method, device and storage medium |
CN110502743A (en) * | 2019-07-12 | 2019-11-26 | 北京邮电大学 | Social networks based on confrontation study and semantic similarity is across media search method |
CN110674688B (en) * | 2019-08-19 | 2023-10-31 | 深圳力维智联技术有限公司 | Face recognition model acquisition method, system and medium for video monitoring scene |
CN110866129A (en) * | 2019-11-01 | 2020-03-06 | 中电科大数据研究院有限公司 | Cross-media retrieval method based on cross-media uniform characterization model |
CN111105013B (en) * | 2019-11-05 | 2023-08-11 | 中国科学院深圳先进技术研究院 | Optimization method of countermeasure network architecture, image description generation method and system |
CN111179254B (en) * | 2019-12-31 | 2023-05-30 | 复旦大学 | Domain adaptive medical image segmentation method based on feature function and countermeasure learning |
CN113094550B (en) * | 2020-01-08 | 2023-10-24 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN111198964B (en) * | 2020-01-10 | 2023-04-25 | 中国科学院自动化研究所 | Image retrieval method and system |
CN111259152A (en) * | 2020-01-20 | 2020-06-09 | 刘秀萍 | Deep multilayer network driven feature aggregation category divider |
CN111259851B (en) * | 2020-01-23 | 2021-04-23 | 清华大学 | Multi-mode event detection method and device |
CN111325319B (en) * | 2020-02-02 | 2023-11-28 | 腾讯云计算(北京)有限责任公司 | Neural network model detection method, device, equipment and storage medium |
CN111353076B (en) * | 2020-02-21 | 2023-10-10 | 华为云计算技术有限公司 | Method for training cross-modal retrieval model, cross-modal retrieval method and related device |
CN111368176B (en) * | 2020-03-02 | 2023-08-18 | 南京财经大学 | Cross-modal hash retrieval method and system based on supervision semantic coupling consistency |
CN111782921A (en) * | 2020-03-25 | 2020-10-16 | 北京沃东天骏信息技术有限公司 | Method and device for searching target |
CN111597810B (en) * | 2020-04-13 | 2024-01-05 | 广东工业大学 | Named entity identification method for semi-supervised decoupling |
CN113673635B (en) * | 2020-05-15 | 2023-09-01 | 复旦大学 | Hand-drawn sketch understanding deep learning method based on self-supervision learning task |
CN111651660B (en) * | 2020-05-28 | 2023-05-02 | 拾音智能科技有限公司 | Method for cross-media retrieval of difficult samples |
CN111651577B (en) * | 2020-06-01 | 2023-04-21 | 全球能源互联网研究院有限公司 | Cross-media data association analysis model training and data association analysis method and system |
CN111708745B (en) * | 2020-06-18 | 2023-04-21 | 全球能源互联网研究院有限公司 | Cross-media data sharing representation method and user behavior analysis method and system |
CN111882032B (en) * | 2020-07-13 | 2023-12-01 | 广东石油化工学院 | Neural semantic memory storage method |
CN112001482B (en) * | 2020-08-14 | 2024-05-24 | 佳都科技集团股份有限公司 | Vibration prediction and model training method, device, computer equipment and storage medium |
CN111984800B (en) * | 2020-08-16 | 2023-11-17 | 西安电子科技大学 | Hash cross-modal information retrieval method based on dictionary pair learning |
CN114969417B (en) * | 2020-09-23 | 2023-04-11 | 华为技术有限公司 | Image reordering method, related device and computer readable storage medium |
CN112466281A (en) * | 2020-10-13 | 2021-03-09 | 讯飞智元信息科技有限公司 | Harmful audio recognition decoding method and device |
CN112214988B (en) * | 2020-10-14 | 2024-01-23 | 哈尔滨福涛科技有限责任公司 | Deep learning and rule combination-based negotiable article structure analysis method |
CN112396091B (en) * | 2020-10-23 | 2024-02-09 | 西安电子科技大学 | Social media image popularity prediction method, system, storage medium and application |
CN112651448B (en) * | 2020-12-29 | 2023-09-15 | 中山大学 | Multi-mode emotion analysis method for social platform expression package |
CN112949384B (en) * | 2021-01-23 | 2024-03-08 | 西北工业大学 | Remote sensing image scene classification method based on antagonistic feature extraction |
CN112818157B (en) * | 2021-02-10 | 2022-09-16 | 浙江大学 | Combined query image retrieval method based on multi-order confrontation characteristic learning |
CN112861977B (en) * | 2021-02-19 | 2024-01-26 | 中国人民武装警察部队工程大学 | Migration learning data processing method, system, medium, equipment, terminal and application |
CN113052311B (en) * | 2021-03-16 | 2024-01-19 | 西北工业大学 | Feature extraction network with layer jump structure and method for generating features and descriptors |
CN113420166A (en) * | 2021-03-26 | 2021-09-21 | 阿里巴巴新加坡控股有限公司 | Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment |
CN113537272B (en) * | 2021-03-29 | 2024-03-19 | 之江实验室 | Deep learning-based semi-supervised social network abnormal account detection method |
CN113159071B (en) * | 2021-04-20 | 2022-06-21 | 复旦大学 | Cross-modal image-text association anomaly detection method |
CN113536013B (en) * | 2021-06-03 | 2024-02-23 | 国家电网有限公司大数据中心 | Cross-media image retrieval method and system |
CN113379603B (en) * | 2021-06-10 | 2024-03-15 | 大连海事大学 | Ship target detection method based on deep learning |
CN113656616B (en) * | 2021-06-23 | 2024-02-27 | 同济大学 | Three-dimensional model sketch retrieval method based on heterogeneous twin neural network |
CN113360683B (en) * | 2021-06-30 | 2024-04-19 | 北京百度网讯科技有限公司 | Method for training cross-modal retrieval model and cross-modal retrieval method and device |
CN113362416B (en) * | 2021-07-01 | 2024-05-17 | 中国科学技术大学 | Method for generating image based on text of target detection |
CN113254678B (en) * | 2021-07-14 | 2021-10-01 | 北京邮电大学 | Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof |
CN113610128B (en) * | 2021-07-28 | 2024-02-13 | 西北大学 | Aesthetic attribute retrieval-based picture aesthetic description modeling and describing method and system |
CN114022687B (en) * | 2021-09-24 | 2024-05-10 | 之江实验室 | Image description countermeasure generation method based on reinforcement learning |
CN113946710B (en) * | 2021-10-12 | 2024-06-11 | 浙江大学 | Video retrieval method based on multi-mode and self-supervision characterization learning |
CN114090801B (en) * | 2021-10-19 | 2024-07-19 | 山东师范大学 | Deep countering attention cross-modal hash retrieval method and system |
CN114022372B (en) * | 2021-10-25 | 2024-04-16 | 大连理工大学 | Mask image patching method for introducing semantic loss context encoder |
CN114153969B (en) * | 2021-11-09 | 2024-06-21 | 浙江大学 | Efficient text classification system with high accuracy |
CN114297473B (en) * | 2021-11-25 | 2024-10-15 | 北京邮电大学 | News event searching method and system based on multistage image-text semantic alignment model |
CN114241517B (en) * | 2021-12-02 | 2024-02-27 | 河南大学 | Cross-mode pedestrian re-recognition method based on image generation and shared learning network |
CN114298159B (en) * | 2021-12-06 | 2024-04-09 | 湖南工业大学 | Image similarity detection method based on text fusion under label-free sample |
CN114138995B (en) * | 2021-12-08 | 2024-07-16 | 东北大学 | Small sample cross-modal retrieval method based on countermeasure learning |
CN114443916B (en) * | 2022-01-25 | 2024-02-06 | 中国人民解放军国防科技大学 | Supply and demand matching method and system for test data |
CN114495281A (en) * | 2022-02-10 | 2022-05-13 | 南京邮电大学 | Cross-modal pedestrian re-identification method based on integral and partial constraints |
CN114677569B (en) * | 2022-02-17 | 2024-05-10 | 之江实验室 | Character-image pair generation method and device based on feature decoupling |
CN114676218B (en) * | 2022-03-10 | 2024-08-27 | 清华大学 | Information retrieval method and device, electronic equipment and readable storage medium |
CN114743630B (en) * | 2022-04-01 | 2024-08-02 | 杭州电子科技大学 | Medical report generation method based on cross-modal contrast learning |
CN115114395B (en) * | 2022-04-15 | 2024-03-19 | 腾讯科技(深圳)有限公司 | Content retrieval and model training method and device, electronic equipment and storage medium |
CN114936285B (en) * | 2022-05-25 | 2024-07-12 | 齐鲁工业大学 | Crisis information detection method and system based on antagonistic multi-mode automatic encoder |
CN115129917B (en) * | 2022-06-06 | 2024-04-09 | 武汉大学 | optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics |
CN115048491B (en) * | 2022-06-18 | 2024-09-06 | 哈尔滨工业大学 | Software cross-modal retrieval method based on hypothesis test in heterogeneous semantic space |
CN115131613B (en) * | 2022-07-01 | 2024-04-02 | 中国科学技术大学 | Small sample image classification method based on multidirectional knowledge migration |
CN115909317B (en) * | 2022-07-15 | 2024-07-05 | 广州珠江在线多媒体信息有限公司 | Learning method and system for three-dimensional model-text joint expression |
CN115840827B (en) * | 2022-11-07 | 2023-09-19 | 重庆师范大学 | Deep unsupervised cross-modal hash retrieval method |
CN116108215A (en) * | 2023-02-21 | 2023-05-12 | 湖北工业大学 | Cross-modal big data retrieval method and system based on depth fusion |
CN116821408B (en) * | 2023-08-29 | 2023-12-01 | 南京航空航天大学 | Multi-task consistency countermeasure retrieval method and system |
CN116935329B (en) * | 2023-09-19 | 2023-12-01 | 山东大学 | Weak supervision text pedestrian retrieval method and system for class-level comparison learning |
CN117312592B (en) * | 2023-11-28 | 2024-02-09 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
CN117611924B (en) * | 2024-01-17 | 2024-04-09 | 贵州大学 | Plant leaf phenotype disease classification method based on graphic subspace joint learning |
CN117688193B (en) * | 2024-02-01 | 2024-05-31 | 湘江实验室 | Picture and text unified coding method, device, computer equipment and medium |
CN118227821B (en) * | 2024-05-24 | 2024-07-26 | 济南大学 | Sketch three-dimensional model retrieval method based on anti-noise network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211769A (en) * | 1997-06-26 | 1999-03-24 | 香港中文大学 | Method and equipment for file retrieval based on Bayesian network |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
CN103914711A (en) * | 2014-03-26 | 2014-07-09 | 中国科学院计算技术研究所 | Improved top speed learning model and method for classifying modes of improved top speed learning model |
CN105512289A (en) * | 2015-12-07 | 2016-04-20 | 郑州金惠计算机系统工程有限公司 | Image retrieval method based on deep learning and Hash |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
CN106202413A (en) * | 2016-07-11 | 2016-12-07 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN106649715A (en) * | 2016-12-21 | 2017-05-10 | 中国人民解放军国防科学技术大学 | Cross-media retrieval method based on local sensitive hash algorithm and neural network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9507816B2 (en) * | 2011-05-24 | 2016-11-29 | Nintendo Co., Ltd. | Partitioned database model to increase the scalability of an information system |
CN104346440B (en) * | 2014-10-10 | 2017-06-23 | 浙江大学 | A kind of across media hash indexing methods based on neutral net |
CN106095893B (en) * | 2016-06-06 | 2018-11-20 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN108319686B (en) * | 2018-02-01 | 2021-07-30 | 北京大学深圳研究生院 | Antagonism cross-media retrieval method based on limited text space |
-
2018
- 2018-02-01 CN CN201810101127.0A patent/CN108319686B/en not_active Expired - Fee Related
- 2018-10-23 WO PCT/CN2018/111327 patent/WO2019148898A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211769A (en) * | 1997-06-26 | 1999-03-24 | 香港中文大学 | Method and equipment for file retrieval based on Bayesian network |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
CN103914711A (en) * | 2014-03-26 | 2014-07-09 | 中国科学院计算技术研究所 | Improved top speed learning model and method for classifying modes of improved top speed learning model |
CN105512289A (en) * | 2015-12-07 | 2016-04-20 | 郑州金惠计算机系统工程有限公司 | Image retrieval method based on deep learning and Hash |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
CN106202413A (en) * | 2016-07-11 | 2016-12-07 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN106649715A (en) * | 2016-12-21 | 2017-05-10 | 中国人民解放军国防科学技术大学 | Cross-media retrieval method based on local sensitive hash algorithm and neural network |
Non-Patent Citations (1)
Title |
---|
一种多模式匹配高效算法的设计与实现;李辉等;《北京工商大学学报( 自然科学版)》;20090531;第27卷(第3期);第65-68页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108319686A (en) | 2018-07-24 |
WO2019148898A1 (en) | 2019-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108319686B (en) | Antagonism cross-media retrieval method based on limited text space | |
CN109783657B (en) | Multi-step self-attention cross-media retrieval method and system based on limited text space | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN110717017B (en) | Method for processing corpus | |
CN109844743B (en) | Generating responses in automated chat | |
CN112800292B (en) | Cross-modal retrieval method based on modal specific and shared feature learning | |
CN108549658B (en) | Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree | |
Karpathy | Connecting images and natural language | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN114565104A (en) | Language model pre-training method, result recommendation method and related device | |
CN110704601A (en) | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network | |
CN108804677A (en) | In conjunction with the deep learning question classification method and system of multi-layer attention mechanism | |
CN114818691A (en) | Article content evaluation method, device, equipment and medium | |
CN113722474A (en) | Text classification method, device, equipment and storage medium | |
CN111598183A (en) | Multi-feature fusion image description method | |
CN112818889A (en) | Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network | |
CN113380360B (en) | Similar medical record retrieval method and system based on multi-mode medical record map | |
CN112257841A (en) | Data processing method, device and equipment in graph neural network and storage medium | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
Ji et al. | Fusion-attention network for person search with free-form natural language | |
CN113822125A (en) | Processing method and device of lip language recognition model, computer equipment and storage medium | |
CN111538841A (en) | Comment emotion analysis method, device and system based on knowledge mutual distillation | |
CN113934835B (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
KR20240128645A (en) | Method, apparatus and computer program for buildding knowledge graph using qa model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210730 |