CN108319686A - Antagonism cross-media retrieval method based on limited text space - Google Patents

Antagonism cross-media retrieval method based on limited text space Download PDF

Info

Publication number
CN108319686A
CN108319686A CN201810101127.0A CN201810101127A CN108319686A CN 108319686 A CN108319686 A CN 108319686A CN 201810101127 A CN201810101127 A CN 201810101127A CN 108319686 A CN108319686 A CN 108319686A
Authority
CN
China
Prior art keywords
feature
text
image
network
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810101127.0A
Other languages
Chinese (zh)
Other versions
CN108319686B (en
Inventor
王文敏
余政
王荣刚
李革
王振宇
赵辉
高文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201810101127.0A priority Critical patent/CN108319686B/en
Publication of CN108319686A publication Critical patent/CN108319686A/en
Priority to PCT/CN2018/111327 priority patent/WO2019148898A1/en
Application granted granted Critical
Publication of CN108319686B publication Critical patent/CN108319686B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Library & Information Science (AREA)
  • Fuzzy Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of antagonism cross-media retrieval methods based on limited text space, design feature extracts network, Feature Mapping network and mode grader, limited text space is obtained by study, image and text feature of the extraction suitable for cross-media retrieval realize mapping of the characteristics of image from image space to text space;Constantly to reduce in learning process the otherness of feature distribution between different modalities data by antagonistic training mechanism;It is achieved in cross-media retrieval.The present invention can preferably be fitted behavior expression of the mankind in cross-media retrieval task;The image and text feature that are more suitable for cross-media retrieval task are obtained, shortcoming of the pre-training feature in ability to express is compensated for;The mechanism for introducing confrontation inquiry learning further improves retrieval rate by the minimax game between mode grader and Feature Mapping network.

Description

Antagonism cross-media retrieval method based on limited text space
Technical field
The present invention relates to technical field of computer vision more particularly to a kind of antagonism based on limited text space across matchmaker Body search method.
Background technology
With the arriving in 2.0 epoch of Web, a large amount of multi-medium datas (image, text, video, audio etc.) start interconnecting Online accumulation and propagation.It is different from traditional single mode retrieval tasks, cross-media retrieval for realizing different modalities data it Between two-way retrieval, such as text retrieval image and image retrieval text.However, the isomery having since multi-medium data is congenital Characteristic, their similitude can not be weighed directly.Therefore, the key problem of the generic task is how to find an isomorphism Mapping space so that the similitude between the multi-medium data of isomery can be weighed directly.In current cross-media retrieval field In, people have carried out a large amount of research on the basis of this problem, and propose a series of typical cross-media retrieval algorithms, Such as CCA (Canonical Correlation Analysis, canonical correlation analysis), DeViSE (Deep Visual- Semantic Embedding, deep vision semantic embedding) and DSPE (Deep Structure-Preserving Image- Text Embeddings, the constant text image incorporation model of depth structure).But these methods also suffer from certain drawbacks.
First defect is embodied on the character representation of multi-medium data.Existing method mostly uses the CNN of pre-training (Convolutional neural network) model extracts characteristics of image, such as VGG (Visual Geometry The neural network structure that Group is proposed).However, these models are usually all that pre-training is carried out in image classification task, this The classification information that the characteristics of image that extraction obtains only includes object is had led to, to have lost a part for cross-media retrieval For may be critically important information, such as the interactive process etc. between the behavior act and object of object.For text For, Word2Vec, LDA (Latent Dirichlet Allocation) and FV (Fisher Vector) they are some mainstreams Text feature.However, they are also to carry out pre-training on the data set that some are different from cross-media retrieval, because This feature extracted is not particularly suited for cross-media retrieval.
Second defect is embodied in the selection of isomorphism feature space.There are three types of the selections substantially of the isomorphic space, is respectively Public space, text space and image space.From the perspective of human cognitive, understanding process of the brain for text and image It is not quite similar.For text, brain can directly extract feature and understand;And for an image, brain is total before understanding It is subconsciously first to describe it with text, i.e., is first converted from image space to text space.Therefore, it is carried out in text space Cross-media retrieval can more simulate the cognitive style of the mankind.The existing cross-media retrieval method based on text space mostly uses As final text space, character representation of the image in the space is then the classification by objects in images in the spaces Word2Vec What information combined.Therefore this feature can equally lose the information of the abundant action and interaction contained in image, this also table It is bright for cross-media retrieval, the spaces Word2Vec are not an effective text feature space.
Third defect is embodied in the otherness of different modalities data characteristics distribution.Although existing method all can will not The feature space of a certain isomorphism is mapped to the data characteristics of mode, but the mode wide gap (modality gap) between them is still So exist, and there is also apparent differences for feature distribution, this can lead to the decline of cross-media retrieval performance.
Invention content
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of antagonism based on limited text space across matchmaker Body search method obtains image corresponding with cross-media retrieval task by study first and text feature describes, secondly logical The cognitive style for crossing the simulation mankind finds a limited text space, for realizing the measuring similarity between image and text; This method also introduces antagonistic training mechanism, it is intended to reduce in text space learning process feature point between different modalities data The otherness of cloth, and then increase retrieval accuracy.
The principle of the present invention is:As described in the background art, the key problem of cross-media retrieval is how to find one together The mapping space of structure so that the similitude between the multi-medium data of isomery can be weighed directly.More precisely, this core Heart problem can be subdivided into two sub-problems.First subproblem is how to learn to obtain efficient multimedia data characteristics table Show.Second subproblem is how to find a suitable isomorphism feature space.It is proposed by the present invention to be based on limited text space Cross-media retrieval method include feature extraction network, Feature Mapping network and mode grader.For first subproblem, originally Invention obtains effective image and Text Representation using feature extraction e-learning.Based on iamge description (image Caption) task, the present invention learn to obtain a kind of new characteristics of image in such a way that iamge description algorithm is combined CNN. This kind of feature not only includes the classification information of objects in images, also comprising interactive information abundant between object;For text spy For sign, is started from scratch using Recognition with Recurrent Neural Network (RNN) and learn to be suitable for the text feature of cross-media retrieval task.For Two subproblems, the present invention obtain a limited text space using Feature Mapping e-learning;In order to further drop Otherness between low different modalities feature, the present invention devise a mode grader, for realizing with Feature Mapping network Minimax game.Specifically, mode grader is used to distinguish the mode of current limited text space feature, Feature Mapping Network is then used to learning to obtain the constant feature of mode and confuses mode grader whereby.During training, in addition to passing The triple of system is lost, and a kind of additional antagonism loss can propagate back to Feature Mapping network from mode grader, be used for Further decrease the otherness between different modalities feature." limited text space " indicates that the text obtained by party's calligraphy learning is empty Between be to be made of a series of base vector, these base vectors can be regarded as the various words in dictionary.Therefore the text is empty Between ability to express restricted by word quantity in dictionary, thus be limited.The method of the present invention is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on limited text space, by simulating people The cognitive style of class, extraction suitable for cross-media retrieval image and text feature, realize characteristics of image from image space to The mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce in learning process between different modalities data The otherness of feature distribution.This method achieves accurate retrieval result in cross-media retrieval classics data set.
Technical solution provided by the invention is:
A kind of antagonism cross-media retrieval method based on limited text space, utilizes feature extraction network, Feature Mapping Network and mode grader obtain limited text space by study, and extraction is special suitable for the image and text of cross-media retrieval Sign realizes mapping of the characteristics of image from image space to text space;Made in learning process not by antagonistic training mechanism The otherness of feature distribution between disconnected reduction different modalities data;The invention firstly uses data set D training characteristics extraction network, Feature Mapping network and mode grader, then realize antagonism across matchmaker using trained character network for retrieval request data Physical examination rope;It is as follows:
Assuming that training dataset D={ D1,D2,…,DnShare n sample, each sample DiIncluding a pictures IiWith one Segment description text Ti, i.e. Di=(Ii,Ti), each section of text is made of multiple (5) sentences, each sentence is independently The picture to match is described;Every image all includes 5 similar imports but different descriptive sentence;
1) feature of image and text in D is extracted by feature extraction network.
For image, existing VGG models and iamge description algorithm (Neural Image Captioning, NIC) are used The mode being combined extracts characteristics of image;For text, LSTM (Long Short Term Memory networks, length are used Short-term memory Recognition with Recurrent Neural Network) network extraction text feature.Since LSTM networks are without pre-training, its parameter with The parameter synchronization of Feature Mapping network updates.
The calculating process of image characteristics extraction is indicated such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC () is iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Cincatenate () is feature articulamentum, is used In by IVGGAnd INIGConnect into the feature I of 4608 dimensionsConcat
Text character extraction specifically executes following steps:
Text S=(the s that a given segment length is T0,s1,…,sT), each word s in StCompiled using 1-of-k Code indicates that k represents the number of word in dictionary;Before being sent into LSTM networks, word stNeed first to be mapped to one more Dense space is expressed as formula 2:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension;
The term vector in obtained dense space is sent into LSTM networks, formula 3 is expressed as:
Wherein, it,ft,ot,ct,htIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively The output of member and hidden layer;xtIndicate the word vector input at current time;ht-1It is the LSTM unit hidden layers of previous moment It is defeated;σ indicates tangent bend function;⊙ is indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive; The hidden layer of T moment LSTM networks exports hTThe as character representation of text S.
2) in one Fusion Features layer of the Top-layer Design Method of Feature Mapping network, by IVGG_txtAnd INIC_txtIt is fused into Ifinal, It is indicated as d dimensional feature of the input picture in limited text space;The dimension of limited text space is d;Feature Mapping network Text and step 1) are obtained into the limited text space that characteristics of image is respectively mapped under original state, then first by similar Property measure function comparative feature vector between similarity (calculating distance between the two), obtain current triple damage It loses;Secondly the feature vector of different modalities data is sent into mode grader to classify, obtains current confrontation loss, finally Limited text space is trained by optimizing the assembling loss function of triple loss and confrontation loss.
Here text feature Feature Mapping network is not sent into, the reason is that feature extraction network (LSTM networks) is carried in feature Mapping of the text to feature space is had been realized in during taking;
It is handled by formula 5 and obtains the Fusion Features layer in Feature Mapping network top:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is by iamge description algorithm NIC Extract 512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g () Indicate two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.
Similitude measure function is expressed as:S (v, t)=vt;Wherein, v and t respectively represents characteristics of image and text is special Sign;V and t first passes through L2 normalization layers and is normalized before comparison, so that s is of equal value with COS distance.
It is specific to execute following behaviour by optimizing triple loss function and confrontation loss function training characteristics mapping network Make:
Setting input picture or text and matched text or the distance between to match image be d1, with mismatch text or not It is d to match the distance between image2, d1At least compare d2Closely-spaced m;Interval m is a hyper parameter determined by the external world;Triple Loss function is expressed as formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is Minimum range interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network;Unmatched sample is each Cycle of training randomly selects from data set;
Antagonism in mode grader loses LadvSynchronous backward propagates to Feature Mapping network;
Define total loss function L such as formulas 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is Additional antagonism loss function;
In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda can be real by formula 8 It is existing:
Wherein, p represents the percentage that current iterations account for total iterations;λ is auto-adaptive parameter;
Using above-mentioned loss function L training characteristics mapping networks, the parameter θ of Feature Mapping network is updated by formula 9f
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is that feature is reflected Penetrate the parameter of network.
3) by what step 2) obtained mode classification is respectively fed to positioned at the image and text feature of same limited text space Device is classified, and trains mode grader by intersecting entropy loss;It is specific to execute following operation:
The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode The training of grader is realized by two classification cross entropy loss function of optimization, is expressed as formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates current input Feature samples sum;θdIndicate the training parameter of mode grader;Function is for predicting current text space characteristics Mode, i.e. text or picture;LadvIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping The additional confrontation loss function of network;
The parameter θ of mode grader is updated by formula 10d
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode The parameter of grader.
4) step 2) and step 3) are repeated, until Feature Mapping network convergence;
5) to retrieval request be calculated the retrieval request data (image or text) in limited text space with data Collect the distance between another modal data in D, retrieval result is ranked up according to distance, and then obtains most similar retrieval knot Fruit.Distance is then calculated by being limited the dot product in text space between the feature vector of different modalities data.
Through the above steps, the antagonism cross-media retrieval based on limited text space is realized.
Compared with prior art, the beneficial effects of the invention are as follows:
The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process Between feature distribution otherness.This method achieves accurate retrieval result in cross-media retrieval classics data set. Specifically, the present invention obtains effective image and Text Representation using feature extraction e-learning, and characteristics of image is by into one Step is sent into Feature Mapping network, realizes the mapping from image space to text space.Finally in order to further decrease different moulds The otherness of feature distribution between state data, antagonism loss, which is reversed, caused by mode grader propagates to Feature Mapping net Network so that retrieval result is further promoted.Specifically, the present invention has following technical advantage:
(1) the present invention is directed to by way of simulating human cognitive, across media inspections are carried out in a limited text space Rope.With it is existing based on the method for public space or image space compared with, the present invention can preferably be fitted the mankind across matchmaker Behavior expression in body retrieval tasks;
(2) feature extraction network can learn to obtain the image and text feature that are more suitable for cross-media retrieval task, more Shortcoming of the pre-training feature in ability to express is mended;
(3) in order to further decrease the otherness of feature distribution between different modalities data, invention introduces antagonism It is accurate further to improve retrieval by the minimax game between mode grader and Feature Mapping network for the mechanism of study True rate.
Description of the drawings
Fig. 1 is the flow diagram of the method for the present invention;
Wherein, (a) indicates that the present invention includes three feature extraction network, Feature Mapping network and mode grader parts; (b) and (c) be respectively Feature Mapping network and mode grader network structure block diagram.
Fig. 2 is the schematic network structure of the feature extraction network of the present invention;
Wherein, (a) is image characteristics extraction network, passes through the knot of 19 layers of VGG models VGGNet and iamge description algorithm NIC Close extraction characteristics of image;(b) it is the Recognition with Recurrent Neural Network (LSTM) for extracting text feature.
Fig. 3 is the cross-media retrieval effect sectional drawing that the embodiment of the present invention is implemented in Flickr8K test data sets.
Specific implementation mode
Below in conjunction with the accompanying drawings, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.
The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process Between feature distribution otherness.Feature extraction network, Feature Mapping network and mode point in the present invention described in detail below The training step of class device and its realization and network.
1, feature extraction network
Feature extraction network includes mainly Liang Ge branches, including image characteristics extraction network and Text character extraction network, Correspond respectively to the feature extraction of image and text.
1) image characteristics extraction has obtained characteristics of image I by image characteristics extraction e-learningConcat, including 4096 dimensions Feature IVGGWith the characteristics of image I extracted by iamge description algorithmNIC
Image characteristics extraction network can be regarded as the VGGNet (nerve nets that Visual Geometry Group are proposed Network structure) and NIC (Neural Image Caption, the iamge description based on neural network) combination, VGGNet is 19 layers VGG models, NIC are iamge description algorithms.Wherein VGGNet carries out pre-training in image classification task, for extracting comprising rich The characteristics of image of rich object category information;Opposite, NIC carries out pre-training in iamge description task, for extracting comprising rich The characteristics of image of interactive information between rich object.Therefore the characteristics of image that the two extraction obtains is complementary.
Specifically, by it is one big it is small be 224 × 224 image be sent into VGGNet after, network can export one 4096 dimension Feature IVGG;At the same time, the information loss in order to avoid characteristics of image in translation process, image mapping layer (Image in NIC Embedding Layer) output be regarded as the characteristics of image I that iamge description algorithm is extractedNIC.Finally, the feature of image IConcatBe the equal of IVGGAnd INICCombination.Calculating process is indicated such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC () is iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Concatenate () is feature articulamentum, is used In by IVGGAnd INICConnect into the feature I of 4608 dimensionsConcat
2) Text character extraction
Text character extraction Web vector graphic shot and long term remembers the text feature that Recognition with Recurrent Neural Network (LSTM) extracts d dimensions.Together When, d is also the dimension of limited text space.Assuming that the text S=(s that a given segment length is T0,s1,…,sT), it is each in S A word stIt is encoded using 1-of-k to indicate, k represents the number of word in dictionary.Before being sent into LSTM networks, word st It needs first to be mapped to a more dense space:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is a term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension. After the term vector for obtaining dense space indicates, we are sent into them in LSTM networks, and mathematic(al) representation is expressed as formula 3:
Wherein, it,ft,ot,ct,htIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively The output of member and hidden layer;xtIndicate the word vector input at current time;ht-1It is the LSTM unit hidden layers of previous moment It is defeated;σ indicates that tangent bend function, ⊙ are indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive; The character representation of text S is exactly the hidden layer output of T moment LSTM networks, i.e. hT
Fig. 3 is the network structure of feature of present invention extraction network;In the training process, the parameter of VGGNet is by solid always Fixed, NIC carries out pre-training using Flickr30K or MSCOCO training datasets in iamge description task.Specifically, we are first All images in data set are first sized to 256 × 256, then obtain size in such a way that single center is cut For 224 × 224 image block, finally it is sent into feature extraction network and extracts characteristics of image;For text, we used LSTM and two-way LSTM networks extract text feature, and the hidden layer number of network nodes of wherein LSTM units is 1024.
2, mode grader
In order to further decrease the otherness between different modalities feature distribution, we devise a mode grader, Function as the discriminator generated in confrontation network.The text space feature tag of given image is [0 1], the text of text Space characteristics label is [1 0], and the training of mode grader is realized by two classification cross entropy loss function of optimization, is expressed as Formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates current input Feature samples sum;θdIndicate the training parameter of mode grader;Function is for predicting current text space characteristics Mode, i.e. text or picture;LadvIndicate two classification cross entropy loss functions of mode grader, while it is also that feature is reflected Penetrate the additional confrontation loss function of network.
3, Feature Mapping network
The parameter θ that the present invention passes through Feature Mapping networkfStudy obtains a limited text space.Feature extraction network science Characteristics of image I has been arrived in acquistionConcat, including IVGGAnd INICTwo parts.For characteristics of image IConcatTwo parts, Wo Men Two mapping function f () and g () are devised in Feature Mapping network, are respectively used to realize IVGGAnd INICTo this sky of d Balakrishnans Between feature IVGG_txtAnd INIC_txtMapping.With IVGGAnd INICIt is similar, IVGG_txtAnd INIC_txtBe also complementary, thus we One Fusion Features layer of the Top-layer Design Method of Feature Mapping network, for realizing the mutual supplement with each other's advantages of the two.Processing procedure defines such as Formula 5:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is by iamge description algorithm NIC Extract 512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g () Indicate two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.It is worth note Meaning, the characteristic extraction procedure of text is equivalent to is mapped to the limited text space text.Therefore, Feature Mapping network Parameter θf(see formula 9) includes the parameter of LSTM networks.
(b) and (c) in Fig. 2 indicates the network structure of Feature Mapping network and mode grader respectively.Feature Mapping net Network includes two Feature Mapping network f () and g (), and a fused layer (fusion layer) and a L2 normalize layer (L2 Norm).F () includes two full articulamentums, and hidden layer number of network nodes is 2048 and 1024 respectively.Between each full articulamentum It uses ReLU as activation primitive, and is added to after ReLU Dropout layers to prevent over-fitting, wherein Dropout Rate is 0.5;G () includes a full articulamentum, and hidden layer number of network nodes is 1024;Fused layer (fusion layer) realizes Add operation as unit of matrix element;L2 normalization layers allow the similarity between the feature for learning to obtain directly to lead to Dot product is crossed to weigh, accelerates model convergence rate, increases trained stability.
It is exactly to compare in next step after the limited text space that image and text are respectively mapped under an original state Compared with the similarity between feature, corresponding triple loss is calculated.We define a similitude measure function s (v, t)= Vt, wherein v and t respectively represent image and text feature.In order to enable s is of equal value with COS distance, v and t are needed before comparison L2 normalization layers are first passed through to be normalized.Triple loss function being widely used in cross-media retrieval field It is general.In the case of given input picture (text), the distance between input picture (text) and matched text (image) are d1, It is d with the distance between text (image) is mismatched2, it is intended that d1At least compare d2Closely-spaced m.Interval m is one by extraneous true Fixed hyper parameter, for the ease of optimization, we fix m=0.3 and are applied in all data sets.Therefore, in the present invention, Triple loss function is expressed as formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is Minimum range interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network.In order to obtain these mismatches Sample, we randomly select in each cycle of training from data set.
Secondly, the feature vector of different modalities data is sent into mode grader to classify, is obtained current to damage-retardation It loses.In addition to triple is lost, the antagonism in mode grader loses LadvAlso can synchronous backward propagate to Feature Mapping network.
Finally, L is lost by optimizing tripleembL is lost with confrontationadvAssembling loss function train the limited text empty Between.Due to LembAnd LadvOptimization aim on the contrary, total loss function L can be defined such as formula 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is Additional antagonism loss function.In order to inhibit mode grader in the noise signal of training incipient stage, the update of parameter lambda can To be realized by mathematic(al) representation shown in formula 8:
Wherein, p represents the percentage that current iterations account for total iterations, and λ is an auto-adaptive parameter.
Fig. 3 illustrates present invention cross-media retrieval effect actual in Flickr8K test data sets.Table first row The image and text question for retrieval has been set out;Secondary series is respectively shown to the 4th row for each problem, LTS-A (VGG+BLSTM), before LTS-A (NIC+BLSTM) and LTS-A (VGG+NIC+BLSTM) ranking 5 search result.For figure For search text, the text being correctly retrieved is indicated with red font;It is correct to retrieve for text search image Image out all includes one to hooking.In terms of from the left side of table toward the right, search result has obtained significantly being promoted, especially From LTS-A (VGG+BLSTM) to LTS-A (NIC+BLSTM);In addition to this, those by false retrieval come out samples from certain It also can be good at matching with problem in degree.
4, training method
The training process of the present invention includes four-stage.
One:In initial training stage, we fix the parameter of VGGNet, and using Flickr30K, (image data is from refined Brave photograph album website Flickr, amount to 30000 pictures) or MSCOCO (Microsoft use Amazon Company " robot of Turkey " The data set of service-creation) training data set pair NIC progress pre-training.After the completion of training, we can pass through feature extraction net Network extracts characteristics of image.
Two:In being extracted data set after the feature of all images, the second training stage was mainly used for study and obtains one A limited text space.After the loss function L given Feature Mapping network, we fix the parameter of mode grader θd, the parameter θ of Feature Mapping network is updated by following mathematic(al) representationf, it is expressed as formula 9:
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is that feature is reflected Penetrate the parameter of network.
Three:After the second training stage, the third training stage is mainly used for enhancing the discriminating power of mode grader. Given the loss function L of mode graderadvLater, the parameter θ of our fixed character mapping networksf, pass through following mathematical table The parameter θ of mode grader is updated up to formulad
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode The parameter of grader.
Four:For every batch of training data, repeatedly the second training stage and third training stage always, until model is restrained.
Table 1 gives the present invention experimental result that cross-media retrieval is carried out in Flickr8K test data sets.In order to comment Valence retrieval effectiveness, we have followed the sorting measure standard of standard, use Recall@K and Median Rank.Recall@K are logical It crosses the correct matched data of calculating and comes the probability in preceding K (K=1,5,10) a retrieval result come to retrieving accuracy degree of progress Amount;Median Rank represent the median of ranking residing for correct matched data.Higher Recall@K and lower Median Rank indicates accurate retrieval effectiveness.The present invention is listed in figure compared with the effect of other existing advanced algorithms, including DeViSE (Deep Visual-Semantic Embedding, deep vision semantic embedding), m-RNN (Deep captioning With multimodal recurrent neural networks, the iamge description of multimedia Recognition with Recurrent Neural Network), Deep Fragment (Deep Fragment Embedding, the insertion of depth segment), DCCA (Deep Canonical Correlation Analysis, depth canonical correlation analysis), VSE (Unifying Visual-Semantic Embedding With Multimodal Neural Language Models, the unified embedded multimedia depth language model of vision semanteme), m-CNNENS(Multimodal Convolutional Neural Networks, multimedia convolutional neural networks), NIC (Neural Image Captioning, the iamge description based on neural network), HM-LSTM (Hierarchical Multimodal LSTM, with different levels multimedia LSTM networks).In addition to this, we also design based on the above method Four variants:
●LTS-A(VGG+LSTM):During image characteristics extraction, iamge description algorithm NIC is removed, rest part It immobilizes;
●LTS-A(NIC+LSTM):During image characteristics extraction, convolutional neural networks VGGNet is removed, remaining Part immobilizes;
●LTS-A(VGG+NIC+LSTM):The network structure that attached drawing 2 is shown;
●LTS-A(VGG+NIC+BLSTM):LSTM networks are replaced with two-way by the network structure that attached drawing 2 is shown LSTM networks (BLSTM).
Cross-media retrieval effect of 1 embodiment of table in Flickr8K test data sets.
In table 1, the retrieval of Img2Txt representative images to text;Txt2Img represents retrieval of the text to image.From table 1 I As can be seen that LTS-A (VGG+NIC+BLSTM) surmounted HM-LSTM in picture search text task, achieve at this stage Best retrieval effectiveness.However, effects of the LTS-A (VGG+NIC+BLSTM) in text search image task and being not so good as HM- LSTM.Most possible reason is that HM-LSTM uses a kind of tree-like LSTM network architectures, can be preferably to the layer of text Secondary structure is modeled.And present invention employs the chain type LSTM network architectures, with different levels semantic letter in text can not be obtained Breath.In addition to this, from the experimental result variation between four variants of the present invention as can be seen that when being used for image characteristics extraction Network is after VGGNet becomes NIC, and the accuracy rate of picture search text improves 22%, and the accuracy rate of text search image carries Rise 17%.This is also indicated that compares with traditional VGGNet, and NIC can extract more efficiently characteristics of image;When for image For the network of feature extraction after NIC becomes VGG+NIC, the accuracy rate of cross-media retrieval further improves 6%, this demonstrate Volume feature extraction network at this time can not only extract the careful object category information in image, also comprising abundant between object Interactive information;Finally, LSTM networks are substituted with two-way LSTM networks (BLSTM) and brings 2% additional retrieval rate It is promoted.
Table 2 illustrates cross-media retrieval effect of the embodiment in Flickr30K test data sets.In addition in Flickr8K In the existing advanced algorithm mentioned, we increase DAN (Dual Attention Networks, antithesis attention net Network), DSPE (Deep Structure-Preserving Image-Text Embeddings, the constant text image of depth structure Incorporation model), VSE++ (the enhancing model of Improving Visual-Semantic Embeddings, VSE).At this point, DAN Best retrieval effectiveness is achieved with DSPE, the wherein performance of DAN is better than DSPE.Due to the introducing of attention mechanism, DAN can The fine-grained information of data is given more sustained attention, these information are mostly beneficial to cross-media retrieval.Opposite, we are used only Global characteristics indicate image and text, therefore can be interfered by noise information in image or text.In addition to DAN, DSPE Performance also than we than get well, this is because DSPE used increasingly complex text feature (Fisher Vector) and damage Lose function.As for four variants of the present invention, their experiment performance is more similar with Flickr8K's.
Cross-media retrieval effect of 2 embodiment of table in Flickr30K test data sets
Cross-media retrieval effect of 3 embodiment of table in MSCOCO test data sets
Table 3 illustrates cross-media retrieval effect of the embodiment in MSCOCO test data sets.In addition in Flickr8K and The existing advanced algorithm mentioned in Flickr30K, we increase Order (Order-Embeddings Of Images The sequence of And Language, image and text is embedded in).At this point, LTS-A (VGG+NIC+LSTM) is in picture search text task On achieve best effect, about improve 2% on retrieval rate, and be less than DSPE only in 1 indexs of R@;Scheming As in retrieval text task, performances of the DSPE on Recall K is more outstanding than us, but LTS-A (VGG+NIC+LSTM) Best effect is achieved in Median Rank indexs.This is because chain type LSTM networks of the present invention cannot be very The good layering semantic information understood in text, therefore FV (Fisher are also just not so good as to the character representation ability of text Vector).As for four variants of the present invention, their experiment performance is similar to Flickr8K, Flickr30K's.
The cross-media retrieval effect of two variants LTS-A and LTS of 4 embodiment of table
Table 4 illustrates influence of the antagonism study mechanism to experimental result.We devise two primary two bright changes Body:LTS-A and LTS.LTS-A is exactly aforementioned LTS-A (VGG+NIC+LSTM);LTS is then in LTS-A (VGG+NIC + LSTM) on the basis of, eliminate the mechanism of confrontation inquiry learning.
From in table we can see that LTS-A is obviously improved in cross-media retrieval accuracy rate compared with LTS.LTS is only It has been more than LTS-A in 1 indexs of R@of picture search text.The experimental results showed that confrontation inquiry learning is to reducing different modalities data Otherness between feature distribution it is with obvious effects.
Retrieval effectiveness of 5 embodiment of table in MSCOCO test data sets
Table 6 indicates in MSCOCO test data sets, extracts figure by the mean value of single cutting and ten cuttings respectively As the retrieval effectiveness of feature.
In above-mentioned implementation process, we extract characteristics of image using the single cutting (1-crop) of image-region. For validity of the characteristic mean as characteristics of image (10-crops) of ten different zones of authentication image, we devise LTS-A (10-crops), wherein LTS-A refer to that the image that LTS-A (VGG+NIC+BLSTM), 10-crops are represented at this time is special Sign is described by the characteristic mean of ten different zones of image.As can be seen from Table 6, the retrieval of LTS-A (10-crops) Accuracy rate is obviously improved compared with LTS-A (1-crop), this also elaborate using the characteristic mean of ten different zones of image as The feasibility of characteristics of image.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel are appreciated that:It is not departing from the present invention and spirit and scope of the appended claims, various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims (7)

1. a kind of antagonism cross-media retrieval method based on limited text space, design feature extracts network, Feature Mapping net Network and mode grader obtain limited text space by study, and extraction is suitable for the image and text feature of cross-media retrieval, Realize mapping of the characteristics of image from image space to text space;Constantly to subtract in learning process by antagonistic training mechanism The otherness of feature distribution between small different modalities data;It is achieved in cross-media retrieval;Specifically:
A. feature extraction network includes image characteristics extraction network and Text character extraction network, is respectively used to image characteristics extraction And Text character extraction;Image characteristics extraction network has obtained image spy by one or both of VGGNet and NIC study Levy IConcat, include the feature I of 4096 dimensionsVGGWith the characteristics of image I extracted by iamge description algorithmNICOne or both of; Text character extraction Web vector graphic shot and long term remembers the text of Recognition with Recurrent Neural Network LSTM or two-way LSTM network Bs LSTM extraction d dimensions Eigen;
B. mode grader passes through two classification cross entropy of optimization as the discriminator in confrontation network to the training of mode grader Loss function is realized;The function is also the additional confrontation loss function of Feature Mapping network;
C. Feature Mapping network passes through parameter θfStudy obtains a limited text space;It is obtained for feature extraction e-learning Characteristics of image IConcatIncluding IVGGAnd INIC, design map function f () and g (), is respectively used in Feature Mapping network Realize IVGGAnd INICTo the mapping I of this space characteristics of d BalakrishnansVGG_txtAnd INIC_txt;In the Top-layer Design Method one of Feature Mapping network Fusion Features layer, by IVGG_txtAnd INIC_txtIt is fused into Ifinal, as d dimensional feature table of the input picture in limited text space Show;The dimension of limited text space is d;
Assuming that training dataset D={ D1,D2,…,DnShare n sample, each sample DiIncluding a pictures IiIt is retouched with one section The property stated text Ti, i.e. Di=(Ii,Ti), each section of text be made of 5 sentences, each sentence is independently to matching Picture is described;D for data sets executes following steps 1) -4) to the feature extraction network, Feature Mapping network and mould State grader is trained:
1) feature of image and text in D is extracted by feature extraction network:For the image in D, VGG models and image are used Description algorithm NIC extracts to obtain characteristics of image;For the text in D, extracted using shot and long term memory Recognition with Recurrent Neural Network LSTM Text feature is obtained, and realizes mapping of the text to feature space, the parameter of LSTM networks is needed with the parameter of Feature Mapping network Synchronized update;
2) text and step 1) are obtained the limited text sky that characteristics of image is respectively mapped under original state by Feature Mapping network Between, the distance between feature vector is calculated by similitude measure function first, the similarity between comparative feature vector obtains Current triple loss;The feature vector of different modalities data feeding mode grader is classified again, is obtained current Confrontation loss;It is limited text space finally by the assembling loss function training of the loss of optimization triple and confrontation loss;
3) by step 2) obtain positioned at the image and text feature of same limited text space be respectively fed to mode grader into Row classification, and train mode grader by intersecting entropy loss;
4) step 2) -3 is repeated), until Feature Mapping network convergence;
5) to retrieval request be calculated the retrieval request data image or text in limited text space in data set D The distance between another modal data is ranked up retrieval result according to distance, and then obtains most similar retrieval result;Tool Body calculates distance by the dot product between the feature vector of different modalities data in space;
Through the above steps, the antagonism cross-media retrieval based on limited text space is realized.
2. antagonism cross-media retrieval method as described in claim 1, characterized in that the calculating process table of image characteristics extraction Show such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC () is Iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Concatenate () is feature articulamentum, is used for IVGG And INICConnect into the feature I of 4608 dimensionsConcat
3. antagonism cross-media retrieval method as described in claim 1, characterized in that Text character extraction specifically executes as follows Step:
Text S=(the s that a given segment length is T0,s1,…,sT), each word s in StUse 1-of-k coding schedules Show, k represents the number of word in dictionary;Before being sent into LSTM networks, word stNeed first to be mapped to one it is more dense Space, be expressed as formula 2:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension;
The term vector in obtained dense space is sent into LSTM networks, formula 3 is expressed as:
Wherein, it,ft,ot,ct,htRespectively indicate LSTM units t moment input gate, forget door, out gate, mnemon and The output of hidden layer;xtIndicate the word vector input at current time;ht-1Be previous moment LSTM unit hidden layers it is defeated;σ tables Show tangent bend function;⊙ is indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive;The T moment The hidden layer of LSTM networks exports hTThe as character representation of text S.
4. antagonism cross-media retrieval method as described in claim 1, characterized in that the training of mode grader specifically executes Following operation:
The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode classification The training of device is realized by two classification cross entropy loss function of optimization, is expressed as formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates the spy currently inputted Levy total sample number;θdIndicate the training parameter of mode grader;Function is used to predict the mould of current text space characteristics State, i.e. text or picture;LadvIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping net The additional confrontation loss function of network;
The parameter θ of mode grader is updated by formula 10d
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode classification The parameter of device.
5. antagonism cross-media retrieval method as described in claim 1, characterized in that handled by formula 5 and obtain reflecting in feature Penetrate the Fusion Features layer of network top:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is to be extracted by iamge description algorithm NIC 512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates that f () and g () are indicated Two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.
6. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) is lost by optimizing triple Function and confrontation loss function training characteristics mapping network, it is specific to execute following operation:
Setting input picture or text and matched text or the distance between to match image be d1, with mismatch text or mismatch The distance between image is d2, d1At least compare d2Closely-spaced m;Interval m is a hyper parameter determined by the external world;Triple is lost Function representation is formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is minimum Distance interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network;Unmatched sample is in each training Period randomly selects from data set;
Antagonism in mode grader loses LadvSynchronous backward propagates to Feature Mapping network;
Define total loss function L such as formulas 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is additional Antagonism loss function;
In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda that can be realized by formula 8:
Wherein, p represents the percentage that current iterations account for total iterations;λ is auto-adaptive parameter;
Using above-mentioned loss function L training characteristics mapping networks, the parameter θ of Feature Mapping network is updated by formula 9f
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is Feature Mapping net The parameter of network.
7. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) the similitude measure function S (v, t) is expressed as:
S (v, t)=vt
Wherein, v and t respectively represent characteristics of image and text feature;V and t first passes through normalization layer and carries out normalizing before comparison Change is handled, so that s is of equal value with COS distance.
CN201810101127.0A 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space Expired - Fee Related CN108319686B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810101127.0A CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space
PCT/CN2018/111327 WO2019148898A1 (en) 2018-02-01 2018-10-23 Adversarial cross-media retrieving method based on restricted text space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810101127.0A CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space

Publications (2)

Publication Number Publication Date
CN108319686A true CN108319686A (en) 2018-07-24
CN108319686B CN108319686B (en) 2021-07-30

Family

ID=62888861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810101127.0A Expired - Fee Related CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space

Country Status (2)

Country Link
CN (1) CN108319686B (en)
WO (1) WO2019148898A1 (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344266A (en) * 2018-06-29 2019-02-15 北京大学深圳研究生院 A kind of antagonism cross-media retrieval method based on dual semantics space
CN109508400A (en) * 2018-10-09 2019-03-22 中国科学院自动化研究所 Picture and text abstraction generating method
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
CN109919162A (en) * 2019-01-25 2019-06-21 武汉纺织大学 For exporting the model and its method for building up of MR image characteristic point description vectors symbol
CN110059217A (en) * 2019-04-29 2019-07-26 广西师范大学 A kind of image text cross-media retrieval method of two-level network
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN110175256A (en) * 2019-05-30 2019-08-27 上海联影医疗科技有限公司 A kind of image data retrieval method, apparatus, equipment and storage medium
CN110189249A (en) * 2019-05-24 2019-08-30 深圳市商汤科技有限公司 A kind of image processing method and device, electronic equipment and storage medium
CN110502743A (en) * 2019-07-12 2019-11-26 北京邮电大学 Social networks based on confrontation study and semantic similarity is across media search method
CN110674688A (en) * 2019-08-19 2020-01-10 深圳力维智联技术有限公司 Face recognition model acquisition method, system and medium for video monitoring scene
CN110866129A (en) * 2019-11-01 2020-03-06 中电科大数据研究院有限公司 Cross-media retrieval method based on cross-media uniform characterization model
CN111259851A (en) * 2020-01-23 2020-06-09 清华大学 Multi-mode event detection method and device
CN111651660A (en) * 2020-05-28 2020-09-11 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
CN111782921A (en) * 2020-03-25 2020-10-16 北京沃东天骏信息技术有限公司 Method and device for searching target
CN112182281A (en) * 2019-07-05 2021-01-05 腾讯科技(深圳)有限公司 Audio recommendation method and device and storage medium
CN112818157A (en) * 2021-02-10 2021-05-18 浙江大学 Combined query image retrieval method based on multi-order confrontation characteristic learning
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113159071A (en) * 2021-04-20 2021-07-23 复旦大学 Cross-modal image-text association anomaly detection method
CN113254678A (en) * 2021-07-14 2021-08-13 北京邮电大学 Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof
CN113379603A (en) * 2021-06-10 2021-09-10 大连海事大学 Ship target detection method based on deep learning
CN113946710A (en) * 2021-10-12 2022-01-18 浙江大学 Video retrieval method based on multi-mode and self-supervision characterization learning
CN115114395A (en) * 2022-04-15 2022-09-27 腾讯科技(深圳)有限公司 Content retrieval and model training method and device, electronic equipment and storage medium
CN117312592A (en) * 2023-11-28 2023-12-29 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105013B (en) * 2019-11-05 2023-08-11 中国科学院深圳先进技术研究院 Optimization method of countermeasure network architecture, image description generation method and system
CN111179254B (en) * 2019-12-31 2023-05-30 复旦大学 Domain adaptive medical image segmentation method based on feature function and countermeasure learning
CN111198964B (en) * 2020-01-10 2023-04-25 中国科学院自动化研究所 Image retrieval method and system
CN111259152A (en) * 2020-01-20 2020-06-09 刘秀萍 Deep multilayer network driven feature aggregation category divider
CN111325319B (en) * 2020-02-02 2023-11-28 腾讯云计算(北京)有限责任公司 Neural network model detection method, device, equipment and storage medium
CN111353076B (en) * 2020-02-21 2023-10-10 华为云计算技术有限公司 Method for training cross-modal retrieval model, cross-modal retrieval method and related device
CN111368176B (en) * 2020-03-02 2023-08-18 南京财经大学 Cross-modal hash retrieval method and system based on supervision semantic coupling consistency
CN111597810B (en) * 2020-04-13 2024-01-05 广东工业大学 Named entity identification method for semi-supervised decoupling
CN113673635B (en) * 2020-05-15 2023-09-01 复旦大学 Hand-drawn sketch understanding deep learning method based on self-supervision learning task
CN111651577B (en) * 2020-06-01 2023-04-21 全球能源互联网研究院有限公司 Cross-media data association analysis model training and data association analysis method and system
CN111708745B (en) * 2020-06-18 2023-04-21 全球能源互联网研究院有限公司 Cross-media data sharing representation method and user behavior analysis method and system
CN111882032B (en) * 2020-07-13 2023-12-01 广东石油化工学院 Neural semantic memory storage method
CN111984800B (en) * 2020-08-16 2023-11-17 西安电子科技大学 Hash cross-modal information retrieval method based on dictionary pair learning
CN112256899B (en) * 2020-09-23 2022-05-10 华为技术有限公司 Image reordering method, related device and computer readable storage medium
CN112466281A (en) * 2020-10-13 2021-03-09 讯飞智元信息科技有限公司 Harmful audio recognition decoding method and device
CN112214988B (en) * 2020-10-14 2024-01-23 哈尔滨福涛科技有限责任公司 Deep learning and rule combination-based negotiable article structure analysis method
CN112396091B (en) * 2020-10-23 2024-02-09 西安电子科技大学 Social media image popularity prediction method, system, storage medium and application
CN112651448B (en) * 2020-12-29 2023-09-15 中山大学 Multi-mode emotion analysis method for social platform expression package
CN112949384B (en) * 2021-01-23 2024-03-08 西北工业大学 Remote sensing image scene classification method based on antagonistic feature extraction
CN112861977B (en) * 2021-02-19 2024-01-26 中国人民武装警察部队工程大学 Migration learning data processing method, system, medium, equipment, terminal and application
CN113052311B (en) * 2021-03-16 2024-01-19 西北工业大学 Feature extraction network with layer jump structure and method for generating features and descriptors
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN113537272B (en) * 2021-03-29 2024-03-19 之江实验室 Deep learning-based semi-supervised social network abnormal account detection method
CN113536013B (en) * 2021-06-03 2024-02-23 国家电网有限公司大数据中心 Cross-media image retrieval method and system
CN113656616B (en) * 2021-06-23 2024-02-27 同济大学 Three-dimensional model sketch retrieval method based on heterogeneous twin neural network
CN113360683B (en) * 2021-06-30 2024-04-19 北京百度网讯科技有限公司 Method for training cross-modal retrieval model and cross-modal retrieval method and device
CN113362416B (en) * 2021-07-01 2024-05-17 中国科学技术大学 Method for generating image based on text of target detection
CN113610128B (en) * 2021-07-28 2024-02-13 西北大学 Aesthetic attribute retrieval-based picture aesthetic description modeling and describing method and system
CN114022687B (en) * 2021-09-24 2024-05-10 之江实验室 Image description countermeasure generation method based on reinforcement learning
CN114022372B (en) * 2021-10-25 2024-04-16 大连理工大学 Mask image patching method for introducing semantic loss context encoder
CN114241517B (en) * 2021-12-02 2024-02-27 河南大学 Cross-mode pedestrian re-recognition method based on image generation and shared learning network
CN114298159B (en) * 2021-12-06 2024-04-09 湖南工业大学 Image similarity detection method based on text fusion under label-free sample
CN114443916B (en) * 2022-01-25 2024-02-06 中国人民解放军国防科技大学 Supply and demand matching method and system for test data
CN114677569B (en) * 2022-02-17 2024-05-10 之江实验室 Character-image pair generation method and device based on feature decoupling
CN115129917B (en) * 2022-06-06 2024-04-09 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics
CN115131613B (en) * 2022-07-01 2024-04-02 中国科学技术大学 Small sample image classification method based on multidirectional knowledge migration
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN115840827B (en) * 2022-11-07 2023-09-19 重庆师范大学 Deep unsupervised cross-modal hash retrieval method
CN116108215A (en) * 2023-02-21 2023-05-12 湖北工业大学 Cross-modal big data retrieval method and system based on depth fusion
CN116821408B (en) * 2023-08-29 2023-12-01 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN116935329B (en) * 2023-09-19 2023-12-01 山东大学 Weak supervision text pedestrian retrieval method and system for class-level comparison learning
CN117611924B (en) * 2024-01-17 2024-04-09 贵州大学 Plant leaf phenotype disease classification method based on graphic subspace joint learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis
US20120303628A1 (en) * 2011-05-24 2012-11-29 Brian Silvola Partitioned database model to increase the scalability of an information system
CN103914711A (en) * 2014-03-26 2014-07-09 中国科学院计算技术研究所 Improved top speed learning model and method for classifying modes of improved top speed learning model
CN105512289A (en) * 2015-12-07 2016-04-20 郑州金惠计算机系统工程有限公司 Image retrieval method based on deep learning and Hash
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346440B (en) * 2014-10-10 2017-06-23 浙江大学 A kind of across media hash indexing methods based on neutral net
CN106095893B (en) * 2016-06-06 2018-11-20 北京大学深圳研究生院 A kind of cross-media retrieval method
CN108319686B (en) * 2018-02-01 2021-07-30 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis
US20120303628A1 (en) * 2011-05-24 2012-11-29 Brian Silvola Partitioned database model to increase the scalability of an information system
CN103914711A (en) * 2014-03-26 2014-07-09 中国科学院计算技术研究所 Improved top speed learning model and method for classifying modes of improved top speed learning model
CN105512289A (en) * 2015-12-07 2016-04-20 郑州金惠计算机系统工程有限公司 Image retrieval method based on deep learning and Hash
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李辉等: "一种多模式匹配高效算法的设计与实现", 《北京工商大学学报( 自然科学版)》 *

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN109344266A (en) * 2018-06-29 2019-02-15 北京大学深圳研究生院 A kind of antagonism cross-media retrieval method based on dual semantics space
CN109344266B (en) * 2018-06-29 2021-08-06 北京大学深圳研究生院 Dual-semantic-space-based antagonistic cross-media retrieval method
CN109508400A (en) * 2018-10-09 2019-03-22 中国科学院自动化研究所 Picture and text abstraction generating method
CN109783655B (en) * 2018-12-07 2022-12-30 西安电子科技大学 Cross-modal retrieval method and device, computer equipment and storage medium
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN109783657B (en) * 2019-01-07 2022-12-30 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method and system based on limited text space
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN109919162A (en) * 2019-01-25 2019-06-21 武汉纺织大学 For exporting the model and its method for building up of MR image characteristic point description vectors symbol
CN110059217B (en) * 2019-04-29 2022-11-04 广西师范大学 Image text cross-media retrieval method for two-stage network
CN110059217A (en) * 2019-04-29 2019-07-26 广西师范大学 A kind of image text cross-media retrieval method of two-level network
CN110189249A (en) * 2019-05-24 2019-08-30 深圳市商汤科技有限公司 A kind of image processing method and device, electronic equipment and storage medium
CN110189249B (en) * 2019-05-24 2022-02-18 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN110175256A (en) * 2019-05-30 2019-08-27 上海联影医疗科技有限公司 A kind of image data retrieval method, apparatus, equipment and storage medium
CN112182281B (en) * 2019-07-05 2023-09-19 腾讯科技(深圳)有限公司 Audio recommendation method, device and storage medium
CN112182281A (en) * 2019-07-05 2021-01-05 腾讯科技(深圳)有限公司 Audio recommendation method and device and storage medium
CN110502743A (en) * 2019-07-12 2019-11-26 北京邮电大学 Social networks based on confrontation study and semantic similarity is across media search method
CN110674688B (en) * 2019-08-19 2023-10-31 深圳力维智联技术有限公司 Face recognition model acquisition method, system and medium for video monitoring scene
CN110674688A (en) * 2019-08-19 2020-01-10 深圳力维智联技术有限公司 Face recognition model acquisition method, system and medium for video monitoring scene
CN110866129A (en) * 2019-11-01 2020-03-06 中电科大数据研究院有限公司 Cross-media retrieval method based on cross-media uniform characterization model
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113094550B (en) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN111259851A (en) * 2020-01-23 2020-06-09 清华大学 Multi-mode event detection method and device
CN111782921A (en) * 2020-03-25 2020-10-16 北京沃东天骏信息技术有限公司 Method and device for searching target
WO2021190115A1 (en) * 2020-03-25 2021-09-30 北京沃东天骏信息技术有限公司 Method and apparatus for searching for target
CN111651660B (en) * 2020-05-28 2023-05-02 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
CN111651660A (en) * 2020-05-28 2020-09-11 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
CN112818157A (en) * 2021-02-10 2021-05-18 浙江大学 Combined query image retrieval method based on multi-order confrontation characteristic learning
CN113159071B (en) * 2021-04-20 2022-06-21 复旦大学 Cross-modal image-text association anomaly detection method
CN113159071A (en) * 2021-04-20 2021-07-23 复旦大学 Cross-modal image-text association anomaly detection method
CN113379603B (en) * 2021-06-10 2024-03-15 大连海事大学 Ship target detection method based on deep learning
CN113379603A (en) * 2021-06-10 2021-09-10 大连海事大学 Ship target detection method based on deep learning
CN113254678A (en) * 2021-07-14 2021-08-13 北京邮电大学 Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof
CN113254678B (en) * 2021-07-14 2021-10-01 北京邮电大学 Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof
CN113946710A (en) * 2021-10-12 2022-01-18 浙江大学 Video retrieval method based on multi-mode and self-supervision characterization learning
CN115114395A (en) * 2022-04-15 2022-09-27 腾讯科技(深圳)有限公司 Content retrieval and model training method and device, electronic equipment and storage medium
CN115114395B (en) * 2022-04-15 2024-03-19 腾讯科技(深圳)有限公司 Content retrieval and model training method and device, electronic equipment and storage medium
CN117312592A (en) * 2023-11-28 2023-12-29 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning
CN117312592B (en) * 2023-11-28 2024-02-09 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning

Also Published As

Publication number Publication date
CN108319686B (en) 2021-07-30
WO2019148898A1 (en) 2019-08-08

Similar Documents

Publication Publication Date Title
CN108319686A (en) Antagonism cross-media retrieval method based on limited text space
Spinde et al. Automated identification of bias inducing words in news articles using linguistic and context-oriented features
CN109753566A (en) The model training method of cross-cutting sentiment analysis based on convolutional neural networks
CN103229168B (en) The method and system that evidence spreads between multiple candidate answers during question and answer
CN107076567A (en) Multilingual image question and answer
US20160350288A1 (en) Multilingual embeddings for natural language processing
CN110390018A (en) A kind of social networks comment generation method based on LSTM
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN109614487A (en) A method of the emotional semantic classification based on tensor amalgamation mode
CN113254678B (en) Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof
Barua et al. F-NAD: an application for fake news article detection using machine learning techniques
Liu et al. Learning to predict population-level label distributions
CN109101490B (en) Factual implicit emotion recognition method and system based on fusion feature representation
CN112597302B (en) False comment detection method based on multi-dimensional comment representation
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN113722474A (en) Text classification method, device, equipment and storage medium
CN111639176A (en) Real-time event summarization method based on consistency monitoring
Nasrullah et al. Detection of types of mental illness through the social network using ensembled deep learning model
CN113821587B (en) Text relevance determining method, model training method, device and storage medium
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Wang et al. A meta-learning based stress category detection framework on social media
Yoon et al. Image classification and captioning model considering a CAM‐based disagreement loss
Wijaya et al. Hate Speech Detection Using Convolutional Neural Network and Gated Recurrent Unit with FastText Feature Expansion on Twitter
CN109325096A (en) A kind of knowledge resource search system of knowledge based resource classification
Oak et al. Generating clinically relevant texts: A case study on life-changing events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210730