CN108319686A - Antagonism cross-media retrieval method based on limited text space - Google Patents
Antagonism cross-media retrieval method based on limited text space Download PDFInfo
- Publication number
- CN108319686A CN108319686A CN201810101127.0A CN201810101127A CN108319686A CN 108319686 A CN108319686 A CN 108319686A CN 201810101127 A CN201810101127 A CN 201810101127A CN 108319686 A CN108319686 A CN 108319686A
- Authority
- CN
- China
- Prior art keywords
- feature
- text
- image
- network
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000008485 antagonism Effects 0.000 title claims abstract description 28
- 238000013507 mapping Methods 0.000 claims abstract description 86
- 238000000605 extraction Methods 0.000 claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 54
- 239000000284 extract Substances 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 16
- 230000007246 mechanism Effects 0.000 claims abstract description 12
- 230000003042 antagnostic effect Effects 0.000 claims abstract description 7
- 238000013461 design Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 63
- 239000013598 vector Substances 0.000 claims description 30
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 230000000306 recurrent effect Effects 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000005259 measurement Methods 0.000 claims description 3
- 230000000052 comparative effect Effects 0.000 claims description 2
- 230000007774 longterm Effects 0.000 claims description 2
- 230000001360 synchronised effect Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims 1
- 230000007787 long-term memory Effects 0.000 claims 1
- 230000006399 behavior Effects 0.000 abstract description 4
- 230000000694 effects Effects 0.000 description 13
- 238000012360 testing method Methods 0.000 description 10
- LVNGJLRDBYCPGB-LDLOPFEMSA-N (R)-1,2-distearoylphosphatidylethanolamine Chemical compound CCCCCCCCCCCCCCCCCC(=O)OC[C@H](COP([O-])(=O)OCC[NH3+])OC(=O)CCCCCCCCCCCCCCCCC LVNGJLRDBYCPGB-LDLOPFEMSA-N 0.000 description 8
- 230000001149 cognitive effect Effects 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 7
- 230000007423 decrease Effects 0.000 description 5
- 238000010219 correlation analysis Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 4
- 230000001537 neural effect Effects 0.000 description 4
- 210000004556 brain Anatomy 0.000 description 3
- 238000005520 cutting process Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 206010010356 Congenital anomaly Diseases 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Library & Information Science (AREA)
- Fuzzy Systems (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of antagonism cross-media retrieval methods based on limited text space, design feature extracts network, Feature Mapping network and mode grader, limited text space is obtained by study, image and text feature of the extraction suitable for cross-media retrieval realize mapping of the characteristics of image from image space to text space;Constantly to reduce in learning process the otherness of feature distribution between different modalities data by antagonistic training mechanism;It is achieved in cross-media retrieval.The present invention can preferably be fitted behavior expression of the mankind in cross-media retrieval task;The image and text feature that are more suitable for cross-media retrieval task are obtained, shortcoming of the pre-training feature in ability to express is compensated for;The mechanism for introducing confrontation inquiry learning further improves retrieval rate by the minimax game between mode grader and Feature Mapping network.
Description
Technical field
The present invention relates to technical field of computer vision more particularly to a kind of antagonism based on limited text space across matchmaker
Body search method.
Background technology
With the arriving in 2.0 epoch of Web, a large amount of multi-medium datas (image, text, video, audio etc.) start interconnecting
Online accumulation and propagation.It is different from traditional single mode retrieval tasks, cross-media retrieval for realizing different modalities data it
Between two-way retrieval, such as text retrieval image and image retrieval text.However, the isomery having since multi-medium data is congenital
Characteristic, their similitude can not be weighed directly.Therefore, the key problem of the generic task is how to find an isomorphism
Mapping space so that the similitude between the multi-medium data of isomery can be weighed directly.In current cross-media retrieval field
In, people have carried out a large amount of research on the basis of this problem, and propose a series of typical cross-media retrieval algorithms,
Such as CCA (Canonical Correlation Analysis, canonical correlation analysis), DeViSE (Deep Visual-
Semantic Embedding, deep vision semantic embedding) and DSPE (Deep Structure-Preserving Image-
Text Embeddings, the constant text image incorporation model of depth structure).But these methods also suffer from certain drawbacks.
First defect is embodied on the character representation of multi-medium data.Existing method mostly uses the CNN of pre-training
(Convolutional neural network) model extracts characteristics of image, such as VGG (Visual Geometry
The neural network structure that Group is proposed).However, these models are usually all that pre-training is carried out in image classification task, this
The classification information that the characteristics of image that extraction obtains only includes object is had led to, to have lost a part for cross-media retrieval
For may be critically important information, such as the interactive process etc. between the behavior act and object of object.For text
For, Word2Vec, LDA (Latent Dirichlet Allocation) and FV (Fisher Vector) they are some mainstreams
Text feature.However, they are also to carry out pre-training on the data set that some are different from cross-media retrieval, because
This feature extracted is not particularly suited for cross-media retrieval.
Second defect is embodied in the selection of isomorphism feature space.There are three types of the selections substantially of the isomorphic space, is respectively
Public space, text space and image space.From the perspective of human cognitive, understanding process of the brain for text and image
It is not quite similar.For text, brain can directly extract feature and understand;And for an image, brain is total before understanding
It is subconsciously first to describe it with text, i.e., is first converted from image space to text space.Therefore, it is carried out in text space
Cross-media retrieval can more simulate the cognitive style of the mankind.The existing cross-media retrieval method based on text space mostly uses
As final text space, character representation of the image in the space is then the classification by objects in images in the spaces Word2Vec
What information combined.Therefore this feature can equally lose the information of the abundant action and interaction contained in image, this also table
It is bright for cross-media retrieval, the spaces Word2Vec are not an effective text feature space.
Third defect is embodied in the otherness of different modalities data characteristics distribution.Although existing method all can will not
The feature space of a certain isomorphism is mapped to the data characteristics of mode, but the mode wide gap (modality gap) between them is still
So exist, and there is also apparent differences for feature distribution, this can lead to the decline of cross-media retrieval performance.
Invention content
In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of antagonism based on limited text space across matchmaker
Body search method obtains image corresponding with cross-media retrieval task by study first and text feature describes, secondly logical
The cognitive style for crossing the simulation mankind finds a limited text space, for realizing the measuring similarity between image and text;
This method also introduces antagonistic training mechanism, it is intended to reduce in text space learning process feature point between different modalities data
The otherness of cloth, and then increase retrieval accuracy.
The principle of the present invention is:As described in the background art, the key problem of cross-media retrieval is how to find one together
The mapping space of structure so that the similitude between the multi-medium data of isomery can be weighed directly.More precisely, this core
Heart problem can be subdivided into two sub-problems.First subproblem is how to learn to obtain efficient multimedia data characteristics table
Show.Second subproblem is how to find a suitable isomorphism feature space.It is proposed by the present invention to be based on limited text space
Cross-media retrieval method include feature extraction network, Feature Mapping network and mode grader.For first subproblem, originally
Invention obtains effective image and Text Representation using feature extraction e-learning.Based on iamge description (image
Caption) task, the present invention learn to obtain a kind of new characteristics of image in such a way that iamge description algorithm is combined CNN.
This kind of feature not only includes the classification information of objects in images, also comprising interactive information abundant between object;For text spy
For sign, is started from scratch using Recognition with Recurrent Neural Network (RNN) and learn to be suitable for the text feature of cross-media retrieval task.For
Two subproblems, the present invention obtain a limited text space using Feature Mapping e-learning;In order to further drop
Otherness between low different modalities feature, the present invention devise a mode grader, for realizing with Feature Mapping network
Minimax game.Specifically, mode grader is used to distinguish the mode of current limited text space feature, Feature Mapping
Network is then used to learning to obtain the constant feature of mode and confuses mode grader whereby.During training, in addition to passing
The triple of system is lost, and a kind of additional antagonism loss can propagate back to Feature Mapping network from mode grader, be used for
Further decrease the otherness between different modalities feature." limited text space " indicates that the text obtained by party's calligraphy learning is empty
Between be to be made of a series of base vector, these base vectors can be regarded as the various words in dictionary.Therefore the text is empty
Between ability to express restricted by word quantity in dictionary, thus be limited.The method of the present invention is mainly obtained by study
Limited text space, realizes the measuring similarity between image and text.This method is based on limited text space, by simulating people
The cognitive style of class, extraction suitable for cross-media retrieval image and text feature, realize characteristics of image from image space to
The mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce in learning process between different modalities data
The otherness of feature distribution.This method achieves accurate retrieval result in cross-media retrieval classics data set.
Technical solution provided by the invention is:
A kind of antagonism cross-media retrieval method based on limited text space, utilizes feature extraction network, Feature Mapping
Network and mode grader obtain limited text space by study, and extraction is special suitable for the image and text of cross-media retrieval
Sign realizes mapping of the characteristics of image from image space to text space;Made in learning process not by antagonistic training mechanism
The otherness of feature distribution between disconnected reduction different modalities data;The invention firstly uses data set D training characteristics extraction network,
Feature Mapping network and mode grader, then realize antagonism across matchmaker using trained character network for retrieval request data
Physical examination rope;It is as follows:
Assuming that training dataset D={ D1,D2,…,DnShare n sample, each sample DiIncluding a pictures IiWith one
Segment description text Ti, i.e. Di=(Ii,Ti), each section of text is made of multiple (5) sentences, each sentence is independently
The picture to match is described;Every image all includes 5 similar imports but different descriptive sentence;
1) feature of image and text in D is extracted by feature extraction network.
For image, existing VGG models and iamge description algorithm (Neural Image Captioning, NIC) are used
The mode being combined extracts characteristics of image;For text, LSTM (Long Short Term Memory networks, length are used
Short-term memory Recognition with Recurrent Neural Network) network extraction text feature.Since LSTM networks are without pre-training, its parameter with
The parameter synchronization of Feature Mapping network updates.
The calculating process of image characteristics extraction is indicated such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC
() is iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Cincatenate () is feature articulamentum, is used
In by IVGGAnd INIGConnect into the feature I of 4608 dimensionsConcat。
Text character extraction specifically executes following steps:
Text S=(the s that a given segment length is T0,s1,…,sT), each word s in StCompiled using 1-of-k
Code indicates that k represents the number of word in dictionary;Before being sent into LSTM networks, word stNeed first to be mapped to one more
Dense space is expressed as formula 2:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension;
The term vector in obtained dense space is sent into LSTM networks, formula 3 is expressed as:
Wherein, it,ft,ot,ct,htIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively
The output of member and hidden layer;xtIndicate the word vector input at current time;ht-1It is the LSTM unit hidden layers of previous moment
It is defeated;σ indicates tangent bend function;⊙ is indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive;
The hidden layer of T moment LSTM networks exports hTThe as character representation of text S.
2) in one Fusion Features layer of the Top-layer Design Method of Feature Mapping network, by IVGG_txtAnd INIC_txtIt is fused into Ifinal,
It is indicated as d dimensional feature of the input picture in limited text space;The dimension of limited text space is d;Feature Mapping network
Text and step 1) are obtained into the limited text space that characteristics of image is respectively mapped under original state, then first by similar
Property measure function comparative feature vector between similarity (calculating distance between the two), obtain current triple damage
It loses;Secondly the feature vector of different modalities data is sent into mode grader to classify, obtains current confrontation loss, finally
Limited text space is trained by optimizing the assembling loss function of triple loss and confrontation loss.
Here text feature Feature Mapping network is not sent into, the reason is that feature extraction network (LSTM networks) is carried in feature
Mapping of the text to feature space is had been realized in during taking;
It is handled by formula 5 and obtains the Fusion Features layer in Feature Mapping network top:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is by iamge description algorithm NIC
Extract 512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g ()
Indicate two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.
Similitude measure function is expressed as:S (v, t)=vt;Wherein, v and t respectively represents characteristics of image and text is special
Sign;V and t first passes through L2 normalization layers and is normalized before comparison, so that s is of equal value with COS distance.
It is specific to execute following behaviour by optimizing triple loss function and confrontation loss function training characteristics mapping network
Make:
Setting input picture or text and matched text or the distance between to match image be d1, with mismatch text or not
It is d to match the distance between image2, d1At least compare d2Closely-spaced m;Interval m is a hyper parameter determined by the external world;Triple
Loss function is expressed as formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is
Minimum range interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network;Unmatched sample is each
Cycle of training randomly selects from data set;
Antagonism in mode grader loses LadvSynchronous backward propagates to Feature Mapping network;
Define total loss function L such as formulas 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is
Additional antagonism loss function;
In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda can be real by formula 8
It is existing:
Wherein, p represents the percentage that current iterations account for total iterations;λ is auto-adaptive parameter;
Using above-mentioned loss function L training characteristics mapping networks, the parameter θ of Feature Mapping network is updated by formula 9f:
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is that feature is reflected
Penetrate the parameter of network.
3) by what step 2) obtained mode classification is respectively fed to positioned at the image and text feature of same limited text space
Device is classified, and trains mode grader by intersecting entropy loss;It is specific to execute following operation:
The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode
The training of grader is realized by two classification cross entropy loss function of optimization, is expressed as formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates current input
Feature samples sum;θdIndicate the training parameter of mode grader;Function is for predicting current text space characteristics
Mode, i.e. text or picture;LadvIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping
The additional confrontation loss function of network;
The parameter θ of mode grader is updated by formula 10d:
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode
The parameter of grader.
4) step 2) and step 3) are repeated, until Feature Mapping network convergence;
5) to retrieval request be calculated the retrieval request data (image or text) in limited text space with data
Collect the distance between another modal data in D, retrieval result is ranked up according to distance, and then obtains most similar retrieval knot
Fruit.Distance is then calculated by being limited the dot product in text space between the feature vector of different modalities data.
Through the above steps, the antagonism cross-media retrieval based on limited text space is realized.
Compared with prior art, the beneficial effects of the invention are as follows:
The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study
Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould
The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky
Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process
Between feature distribution otherness.This method achieves accurate retrieval result in cross-media retrieval classics data set.
Specifically, the present invention obtains effective image and Text Representation using feature extraction e-learning, and characteristics of image is by into one
Step is sent into Feature Mapping network, realizes the mapping from image space to text space.Finally in order to further decrease different moulds
The otherness of feature distribution between state data, antagonism loss, which is reversed, caused by mode grader propagates to Feature Mapping net
Network so that retrieval result is further promoted.Specifically, the present invention has following technical advantage:
(1) the present invention is directed to by way of simulating human cognitive, across media inspections are carried out in a limited text space
Rope.With it is existing based on the method for public space or image space compared with, the present invention can preferably be fitted the mankind across matchmaker
Behavior expression in body retrieval tasks;
(2) feature extraction network can learn to obtain the image and text feature that are more suitable for cross-media retrieval task, more
Shortcoming of the pre-training feature in ability to express is mended;
(3) in order to further decrease the otherness of feature distribution between different modalities data, invention introduces antagonism
It is accurate further to improve retrieval by the minimax game between mode grader and Feature Mapping network for the mechanism of study
True rate.
Description of the drawings
Fig. 1 is the flow diagram of the method for the present invention;
Wherein, (a) indicates that the present invention includes three feature extraction network, Feature Mapping network and mode grader parts;
(b) and (c) be respectively Feature Mapping network and mode grader network structure block diagram.
Fig. 2 is the schematic network structure of the feature extraction network of the present invention;
Wherein, (a) is image characteristics extraction network, passes through the knot of 19 layers of VGG models VGGNet and iamge description algorithm NIC
Close extraction characteristics of image;(b) it is the Recognition with Recurrent Neural Network (LSTM) for extracting text feature.
Fig. 3 is the cross-media retrieval effect sectional drawing that the embodiment of the present invention is implemented in Flickr8K test data sets.
Specific implementation mode
Below in conjunction with the accompanying drawings, the present invention, the model of but do not limit the invention in any way are further described by embodiment
It encloses.
The present invention provides a kind of antagonism cross-media retrieval method based on limited text space, is mainly obtained by study
Limited text space, realizes the measuring similarity between image and text.This method is based on a limited text space, passes through mould
The cognitive style of anthropomorphic class, image and text feature of the extraction suitable for cross-media retrieval realize characteristics of image from image sky
Between arrive the mapping of text space, and introduce antagonistic training mechanism, it is intended to constantly reduce different modalities data in learning process
Between feature distribution otherness.Feature extraction network, Feature Mapping network and mode point in the present invention described in detail below
The training step of class device and its realization and network.
1, feature extraction network
Feature extraction network includes mainly Liang Ge branches, including image characteristics extraction network and Text character extraction network,
Correspond respectively to the feature extraction of image and text.
1) image characteristics extraction has obtained characteristics of image I by image characteristics extraction e-learningConcat, including 4096 dimensions
Feature IVGGWith the characteristics of image I extracted by iamge description algorithmNIC;
Image characteristics extraction network can be regarded as the VGGNet (nerve nets that Visual Geometry Group are proposed
Network structure) and NIC (Neural Image Caption, the iamge description based on neural network) combination, VGGNet is 19 layers
VGG models, NIC are iamge description algorithms.Wherein VGGNet carries out pre-training in image classification task, for extracting comprising rich
The characteristics of image of rich object category information;Opposite, NIC carries out pre-training in iamge description task, for extracting comprising rich
The characteristics of image of interactive information between rich object.Therefore the characteristics of image that the two extraction obtains is complementary.
Specifically, by it is one big it is small be 224 × 224 image be sent into VGGNet after, network can export one 4096 dimension
Feature IVGG;At the same time, the information loss in order to avoid characteristics of image in translation process, image mapping layer (Image in NIC
Embedding Layer) output be regarded as the characteristics of image I that iamge description algorithm is extractedNIC.Finally, the feature of image
IConcatBe the equal of IVGGAnd INICCombination.Calculating process is indicated such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC
() is iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Concatenate () is feature articulamentum, is used
In by IVGGAnd INICConnect into the feature I of 4608 dimensionsConcat。
2) Text character extraction
Text character extraction Web vector graphic shot and long term remembers the text feature that Recognition with Recurrent Neural Network (LSTM) extracts d dimensions.Together
When, d is also the dimension of limited text space.Assuming that the text S=(s that a given segment length is T0,s1,…,sT), it is each in S
A word stIt is encoded using 1-of-k to indicate, k represents the number of word in dictionary.Before being sent into LSTM networks, word st
It needs first to be mapped to a more dense space:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is a term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension.
After the term vector for obtaining dense space indicates, we are sent into them in LSTM networks, and mathematic(al) representation is expressed as formula 3:
Wherein, it,ft,ot,ct,htIndicate that LSTM units are single in the input gate of t moment, forgetting door, out gate, memory respectively
The output of member and hidden layer;xtIndicate the word vector input at current time;ht-1It is the LSTM unit hidden layers of previous moment
It is defeated;σ indicates that tangent bend function, ⊙ are indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive;
The character representation of text S is exactly the hidden layer output of T moment LSTM networks, i.e. hT。
Fig. 3 is the network structure of feature of present invention extraction network;In the training process, the parameter of VGGNet is by solid always
Fixed, NIC carries out pre-training using Flickr30K or MSCOCO training datasets in iamge description task.Specifically, we are first
All images in data set are first sized to 256 × 256, then obtain size in such a way that single center is cut
For 224 × 224 image block, finally it is sent into feature extraction network and extracts characteristics of image;For text, we used
LSTM and two-way LSTM networks extract text feature, and the hidden layer number of network nodes of wherein LSTM units is 1024.
2, mode grader
In order to further decrease the otherness between different modalities feature distribution, we devise a mode grader,
Function as the discriminator generated in confrontation network.The text space feature tag of given image is [0 1], the text of text
Space characteristics label is [1 0], and the training of mode grader is realized by two classification cross entropy loss function of optimization, is expressed as
Formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates current input
Feature samples sum;θdIndicate the training parameter of mode grader;Function is for predicting current text space characteristics
Mode, i.e. text or picture;LadvIndicate two classification cross entropy loss functions of mode grader, while it is also that feature is reflected
Penetrate the additional confrontation loss function of network.
3, Feature Mapping network
The parameter θ that the present invention passes through Feature Mapping networkfStudy obtains a limited text space.Feature extraction network science
Characteristics of image I has been arrived in acquistionConcat, including IVGGAnd INICTwo parts.For characteristics of image IConcatTwo parts, Wo Men
Two mapping function f () and g () are devised in Feature Mapping network, are respectively used to realize IVGGAnd INICTo this sky of d Balakrishnans
Between feature IVGG_txtAnd INIC_txtMapping.With IVGGAnd INICIt is similar, IVGG_txtAnd INIC_txtBe also complementary, thus we
One Fusion Features layer of the Top-layer Design Method of Feature Mapping network, for realizing the mutual supplement with each other's advantages of the two.Processing procedure defines such as
Formula 5:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is by iamge description algorithm NIC
Extract 512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates, f () and g ()
Indicate two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.It is worth note
Meaning, the characteristic extraction procedure of text is equivalent to is mapped to the limited text space text.Therefore, Feature Mapping network
Parameter θf(see formula 9) includes the parameter of LSTM networks.
(b) and (c) in Fig. 2 indicates the network structure of Feature Mapping network and mode grader respectively.Feature Mapping net
Network includes two Feature Mapping network f () and g (), and a fused layer (fusion layer) and a L2 normalize layer
(L2 Norm).F () includes two full articulamentums, and hidden layer number of network nodes is 2048 and 1024 respectively.Between each full articulamentum
It uses ReLU as activation primitive, and is added to after ReLU Dropout layers to prevent over-fitting, wherein Dropout
Rate is 0.5;G () includes a full articulamentum, and hidden layer number of network nodes is 1024;Fused layer (fusion layer) realizes
Add operation as unit of matrix element;L2 normalization layers allow the similarity between the feature for learning to obtain directly to lead to
Dot product is crossed to weigh, accelerates model convergence rate, increases trained stability.
It is exactly to compare in next step after the limited text space that image and text are respectively mapped under an original state
Compared with the similarity between feature, corresponding triple loss is calculated.We define a similitude measure function s (v, t)=
Vt, wherein v and t respectively represent image and text feature.In order to enable s is of equal value with COS distance, v and t are needed before comparison
L2 normalization layers are first passed through to be normalized.Triple loss function being widely used in cross-media retrieval field
It is general.In the case of given input picture (text), the distance between input picture (text) and matched text (image) are d1,
It is d with the distance between text (image) is mismatched2, it is intended that d1At least compare d2Closely-spaced m.Interval m is one by extraneous true
Fixed hyper parameter, for the ease of optimization, we fix m=0.3 and are applied in all data sets.Therefore, in the present invention,
Triple loss function is expressed as formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is
Minimum range interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network.In order to obtain these mismatches
Sample, we randomly select in each cycle of training from data set.
Secondly, the feature vector of different modalities data is sent into mode grader to classify, is obtained current to damage-retardation
It loses.In addition to triple is lost, the antagonism in mode grader loses LadvAlso can synchronous backward propagate to Feature Mapping network.
Finally, L is lost by optimizing tripleembL is lost with confrontationadvAssembling loss function train the limited text empty
Between.Due to LembAnd LadvOptimization aim on the contrary, total loss function L can be defined such as formula 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is
Additional antagonism loss function.In order to inhibit mode grader in the noise signal of training incipient stage, the update of parameter lambda can
To be realized by mathematic(al) representation shown in formula 8:
Wherein, p represents the percentage that current iterations account for total iterations, and λ is an auto-adaptive parameter.
Fig. 3 illustrates present invention cross-media retrieval effect actual in Flickr8K test data sets.Table first row
The image and text question for retrieval has been set out;Secondary series is respectively shown to the 4th row for each problem, LTS-A
(VGG+BLSTM), before LTS-A (NIC+BLSTM) and LTS-A (VGG+NIC+BLSTM) ranking 5 search result.For figure
For search text, the text being correctly retrieved is indicated with red font;It is correct to retrieve for text search image
Image out all includes one to hooking.In terms of from the left side of table toward the right, search result has obtained significantly being promoted, especially
From LTS-A (VGG+BLSTM) to LTS-A (NIC+BLSTM);In addition to this, those by false retrieval come out samples from certain
It also can be good at matching with problem in degree.
4, training method
The training process of the present invention includes four-stage.
One:In initial training stage, we fix the parameter of VGGNet, and using Flickr30K, (image data is from refined
Brave photograph album website Flickr, amount to 30000 pictures) or MSCOCO (Microsoft use Amazon Company " robot of Turkey "
The data set of service-creation) training data set pair NIC progress pre-training.After the completion of training, we can pass through feature extraction net
Network extracts characteristics of image.
Two:In being extracted data set after the feature of all images, the second training stage was mainly used for study and obtains one
A limited text space.After the loss function L given Feature Mapping network, we fix the parameter of mode grader
θd, the parameter θ of Feature Mapping network is updated by following mathematic(al) representationf, it is expressed as formula 9:
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is that feature is reflected
Penetrate the parameter of network.
Three:After the second training stage, the third training stage is mainly used for enhancing the discriminating power of mode grader.
Given the loss function L of mode graderadvLater, the parameter θ of our fixed character mapping networksf, pass through following mathematical table
The parameter θ of mode grader is updated up to formulad:
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode
The parameter of grader.
Four:For every batch of training data, repeatedly the second training stage and third training stage always, until model is restrained.
Table 1 gives the present invention experimental result that cross-media retrieval is carried out in Flickr8K test data sets.In order to comment
Valence retrieval effectiveness, we have followed the sorting measure standard of standard, use Recall@K and Median Rank.Recall@K are logical
It crosses the correct matched data of calculating and comes the probability in preceding K (K=1,5,10) a retrieval result come to retrieving accuracy degree of progress
Amount;Median Rank represent the median of ranking residing for correct matched data.Higher Recall@K and lower Median
Rank indicates accurate retrieval effectiveness.The present invention is listed in figure compared with the effect of other existing advanced algorithms, including
DeViSE (Deep Visual-Semantic Embedding, deep vision semantic embedding), m-RNN (Deep captioning
With multimodal recurrent neural networks, the iamge description of multimedia Recognition with Recurrent Neural Network), Deep
Fragment (Deep Fragment Embedding, the insertion of depth segment), DCCA (Deep Canonical
Correlation Analysis, depth canonical correlation analysis), VSE (Unifying Visual-Semantic Embedding
With Multimodal Neural Language Models, the unified embedded multimedia depth language model of vision semanteme),
m-CNNENS(Multimodal Convolutional Neural Networks, multimedia convolutional neural networks), NIC
(Neural Image Captioning, the iamge description based on neural network), HM-LSTM (Hierarchical
Multimodal LSTM, with different levels multimedia LSTM networks).In addition to this, we also design based on the above method
Four variants:
●LTS-A(VGG+LSTM):During image characteristics extraction, iamge description algorithm NIC is removed, rest part
It immobilizes;
●LTS-A(NIC+LSTM):During image characteristics extraction, convolutional neural networks VGGNet is removed, remaining
Part immobilizes;
●LTS-A(VGG+NIC+LSTM):The network structure that attached drawing 2 is shown;
●LTS-A(VGG+NIC+BLSTM):LSTM networks are replaced with two-way by the network structure that attached drawing 2 is shown
LSTM networks (BLSTM).
Cross-media retrieval effect of 1 embodiment of table in Flickr8K test data sets.
In table 1, the retrieval of Img2Txt representative images to text;Txt2Img represents retrieval of the text to image.From table 1 I
As can be seen that LTS-A (VGG+NIC+BLSTM) surmounted HM-LSTM in picture search text task, achieve at this stage
Best retrieval effectiveness.However, effects of the LTS-A (VGG+NIC+BLSTM) in text search image task and being not so good as HM-
LSTM.Most possible reason is that HM-LSTM uses a kind of tree-like LSTM network architectures, can be preferably to the layer of text
Secondary structure is modeled.And present invention employs the chain type LSTM network architectures, with different levels semantic letter in text can not be obtained
Breath.In addition to this, from the experimental result variation between four variants of the present invention as can be seen that when being used for image characteristics extraction
Network is after VGGNet becomes NIC, and the accuracy rate of picture search text improves 22%, and the accuracy rate of text search image carries
Rise 17%.This is also indicated that compares with traditional VGGNet, and NIC can extract more efficiently characteristics of image;When for image
For the network of feature extraction after NIC becomes VGG+NIC, the accuracy rate of cross-media retrieval further improves 6%, this demonstrate
Volume feature extraction network at this time can not only extract the careful object category information in image, also comprising abundant between object
Interactive information;Finally, LSTM networks are substituted with two-way LSTM networks (BLSTM) and brings 2% additional retrieval rate
It is promoted.
Table 2 illustrates cross-media retrieval effect of the embodiment in Flickr30K test data sets.In addition in Flickr8K
In the existing advanced algorithm mentioned, we increase DAN (Dual Attention Networks, antithesis attention net
Network), DSPE (Deep Structure-Preserving Image-Text Embeddings, the constant text image of depth structure
Incorporation model), VSE++ (the enhancing model of Improving Visual-Semantic Embeddings, VSE).At this point, DAN
Best retrieval effectiveness is achieved with DSPE, the wherein performance of DAN is better than DSPE.Due to the introducing of attention mechanism, DAN can
The fine-grained information of data is given more sustained attention, these information are mostly beneficial to cross-media retrieval.Opposite, we are used only
Global characteristics indicate image and text, therefore can be interfered by noise information in image or text.In addition to DAN, DSPE
Performance also than we than get well, this is because DSPE used increasingly complex text feature (Fisher Vector) and damage
Lose function.As for four variants of the present invention, their experiment performance is more similar with Flickr8K's.
Cross-media retrieval effect of 2 embodiment of table in Flickr30K test data sets
Cross-media retrieval effect of 3 embodiment of table in MSCOCO test data sets
Table 3 illustrates cross-media retrieval effect of the embodiment in MSCOCO test data sets.In addition in Flickr8K and
The existing advanced algorithm mentioned in Flickr30K, we increase Order (Order-Embeddings Of Images
The sequence of And Language, image and text is embedded in).At this point, LTS-A (VGG+NIC+LSTM) is in picture search text task
On achieve best effect, about improve 2% on retrieval rate, and be less than DSPE only in 1 indexs of R@;Scheming
As in retrieval text task, performances of the DSPE on Recall K is more outstanding than us, but LTS-A (VGG+NIC+LSTM)
Best effect is achieved in Median Rank indexs.This is because chain type LSTM networks of the present invention cannot be very
The good layering semantic information understood in text, therefore FV (Fisher are also just not so good as to the character representation ability of text
Vector).As for four variants of the present invention, their experiment performance is similar to Flickr8K, Flickr30K's.
The cross-media retrieval effect of two variants LTS-A and LTS of 4 embodiment of table
Table 4 illustrates influence of the antagonism study mechanism to experimental result.We devise two primary two bright changes
Body:LTS-A and LTS.LTS-A is exactly aforementioned LTS-A (VGG+NIC+LSTM);LTS is then in LTS-A (VGG+NIC
+ LSTM) on the basis of, eliminate the mechanism of confrontation inquiry learning.
From in table we can see that LTS-A is obviously improved in cross-media retrieval accuracy rate compared with LTS.LTS is only
It has been more than LTS-A in 1 indexs of R@of picture search text.The experimental results showed that confrontation inquiry learning is to reducing different modalities data
Otherness between feature distribution it is with obvious effects.
Retrieval effectiveness of 5 embodiment of table in MSCOCO test data sets
Table 6 indicates in MSCOCO test data sets, extracts figure by the mean value of single cutting and ten cuttings respectively
As the retrieval effectiveness of feature.
In above-mentioned implementation process, we extract characteristics of image using the single cutting (1-crop) of image-region.
For validity of the characteristic mean as characteristics of image (10-crops) of ten different zones of authentication image, we devise
LTS-A (10-crops), wherein LTS-A refer to that the image that LTS-A (VGG+NIC+BLSTM), 10-crops are represented at this time is special
Sign is described by the characteristic mean of ten different zones of image.As can be seen from Table 6, the retrieval of LTS-A (10-crops)
Accuracy rate is obviously improved compared with LTS-A (1-crop), this also elaborate using the characteristic mean of ten different zones of image as
The feasibility of characteristics of image.
It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field
Art personnel are appreciated that:It is not departing from the present invention and spirit and scope of the appended claims, various substitutions and modifications are all
It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim
Subject to the range that book defines.
Claims (7)
1. a kind of antagonism cross-media retrieval method based on limited text space, design feature extracts network, Feature Mapping net
Network and mode grader obtain limited text space by study, and extraction is suitable for the image and text feature of cross-media retrieval,
Realize mapping of the characteristics of image from image space to text space;Constantly to subtract in learning process by antagonistic training mechanism
The otherness of feature distribution between small different modalities data;It is achieved in cross-media retrieval;Specifically:
A. feature extraction network includes image characteristics extraction network and Text character extraction network, is respectively used to image characteristics extraction
And Text character extraction;Image characteristics extraction network has obtained image spy by one or both of VGGNet and NIC study
Levy IConcat, include the feature I of 4096 dimensionsVGGWith the characteristics of image I extracted by iamge description algorithmNICOne or both of;
Text character extraction Web vector graphic shot and long term remembers the text of Recognition with Recurrent Neural Network LSTM or two-way LSTM network Bs LSTM extraction d dimensions
Eigen;
B. mode grader passes through two classification cross entropy of optimization as the discriminator in confrontation network to the training of mode grader
Loss function is realized;The function is also the additional confrontation loss function of Feature Mapping network;
C. Feature Mapping network passes through parameter θfStudy obtains a limited text space;It is obtained for feature extraction e-learning
Characteristics of image IConcatIncluding IVGGAnd INIC, design map function f () and g (), is respectively used in Feature Mapping network
Realize IVGGAnd INICTo the mapping I of this space characteristics of d BalakrishnansVGG_txtAnd INIC_txt;In the Top-layer Design Method one of Feature Mapping network
Fusion Features layer, by IVGG_txtAnd INIC_txtIt is fused into Ifinal, as d dimensional feature table of the input picture in limited text space
Show;The dimension of limited text space is d;
Assuming that training dataset D={ D1,D2,…,DnShare n sample, each sample DiIncluding a pictures IiIt is retouched with one section
The property stated text Ti, i.e. Di=(Ii,Ti), each section of text be made of 5 sentences, each sentence is independently to matching
Picture is described;D for data sets executes following steps 1) -4) to the feature extraction network, Feature Mapping network and mould
State grader is trained:
1) feature of image and text in D is extracted by feature extraction network:For the image in D, VGG models and image are used
Description algorithm NIC extracts to obtain characteristics of image;For the text in D, extracted using shot and long term memory Recognition with Recurrent Neural Network LSTM
Text feature is obtained, and realizes mapping of the text to feature space, the parameter of LSTM networks is needed with the parameter of Feature Mapping network
Synchronized update;
2) text and step 1) are obtained the limited text sky that characteristics of image is respectively mapped under original state by Feature Mapping network
Between, the distance between feature vector is calculated by similitude measure function first, the similarity between comparative feature vector obtains
Current triple loss;The feature vector of different modalities data feeding mode grader is classified again, is obtained current
Confrontation loss;It is limited text space finally by the assembling loss function training of the loss of optimization triple and confrontation loss;
3) by step 2) obtain positioned at the image and text feature of same limited text space be respectively fed to mode grader into
Row classification, and train mode grader by intersecting entropy loss;
4) step 2) -3 is repeated), until Feature Mapping network convergence;
5) to retrieval request be calculated the retrieval request data image or text in limited text space in data set D
The distance between another modal data is ranked up retrieval result according to distance, and then obtains most similar retrieval result;Tool
Body calculates distance by the dot product between the feature vector of different modalities data in space;
Through the above steps, the antagonism cross-media retrieval based on limited text space is realized.
2. antagonism cross-media retrieval method as described in claim 1, characterized in that the calculating process table of image characteristics extraction
Show such as formula 1:
Wherein, VGGNet () is 19 layers of VGG models, the 4096 dimensional feature I for extracting input pictureVGG;NIC () is
Iamge description algorithm, the 512 dimensional feature I for extracting imageNIC;Concatenate () is feature articulamentum, is used for IVGG
And INICConnect into the feature I of 4608 dimensionsConcat。
3. antagonism cross-media retrieval method as described in claim 1, characterized in that Text character extraction specifically executes as follows
Step:
Text S=(the s that a given segment length is T0,s1,…,sT), each word s in StUse 1-of-k coding schedules
Show, k represents the number of word in dictionary;Before being sent into LSTM networks, word stNeed first to be mapped to one it is more dense
Space, be expressed as formula 2:
xt=West, t ∈ { 0L T }, (formula 2)
Wherein, WeIt is term vector mapping matrix, is used for 1-of-k vectors stIt is encoded into the word vector of a d dimension;
The term vector in obtained dense space is sent into LSTM networks, formula 3 is expressed as:
Wherein, it,ft,ot,ct,htRespectively indicate LSTM units t moment input gate, forget door, out gate, mnemon and
The output of hidden layer;xtIndicate the word vector input at current time;ht-1Be previous moment LSTM unit hidden layers it is defeated;σ tables
Show tangent bend function;⊙ is indicated using matrix element as the multiplying of unit;Tanh indicates tanh activation primitive;The T moment
The hidden layer of LSTM networks exports hTThe as character representation of text S.
4. antagonism cross-media retrieval method as described in claim 1, characterized in that the training of mode grader specifically executes
Following operation:
The text space feature tag of given image is [0 1], and the text space feature tag of text is [1 0], mode classification
The training of device is realized by two classification cross entropy loss function of optimization, is expressed as formula 4:
Wherein, xiAnd yiI-th of input text space feature and its corresponding label are indicated respectively;N indicates the spy currently inputted
Levy total sample number;θdIndicate the training parameter of mode grader;Function is used to predict the mould of current text space characteristics
State, i.e. text or picture;LadvIt indicates two classification cross entropy loss functions of mode grader, while being also Feature Mapping net
The additional confrontation loss function of network;
The parameter θ of mode grader is updated by formula 10d:
Wherein, the learning rate of μ representing optimizeds algorithm, LadvRepresent the total loss function of Feature Mapping network, θdIt is mode classification
The parameter of device.
5. antagonism cross-media retrieval method as described in claim 1, characterized in that handled by formula 5 and obtain reflecting in feature
Penetrate the Fusion Features layer of network top:
Wherein, IVGGIt is the 4096 dimension characteristics of image extracted by VGGNet, INICIt is to be extracted by iamge description algorithm NIC
512 obtained dimension characteristics of image, IfinalIt is that d dimensional feature of the input picture in limited text space indicates that f () and g () are indicated
Two Feature Mapping functions, IVGG_txtAnd INIC_txtIt is I respectivelyVGGAnd INICThe mapping of d Balakrishnans this space characteristics.
6. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) is lost by optimizing triple
Function and confrontation loss function training characteristics mapping network, it is specific to execute following operation:
Setting input picture or text and matched text or the distance between to match image be d1, with mismatch text or mismatch
The distance between image is d2, d1At least compare d2Closely-spaced m;Interval m is a hyper parameter determined by the external world;Triple is lost
Function representation is formula 6:
Wherein, tkIt is k-th of mismatch text of input picture v;vkIt is k-th of mismatch image for inputting text t;M is minimum
Distance interval;S (v, t) is similarity measurements flow function;θfIt is the parameter of Feature Mapping network;Unmatched sample is in each training
Period randomly selects from data set;
Antagonism in mode grader loses LadvSynchronous backward propagates to Feature Mapping network;
Define total loss function L such as formulas 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is an auto-adaptive parameter, and value range changes from 0 to 1;LembRepresent triple loss function;LadvIt is additional
Antagonism loss function;
In order to inhibit noise signal of the mode grader in the training incipient stage, the update of parameter lambda that can be realized by formula 8:
Wherein, p represents the percentage that current iterations account for total iterations;λ is auto-adaptive parameter;
Using above-mentioned loss function L training characteristics mapping networks, the parameter θ of Feature Mapping network is updated by formula 9f:
Wherein, the learning rate of μ representing optimizeds algorithm, L represent the total loss function of Feature Mapping network, θfIt is Feature Mapping net
The parameter of network.
7. antagonism cross-media retrieval method as described in claim 1, characterized in that step 2) the similitude measure function
S (v, t) is expressed as:
S (v, t)=vt
Wherein, v and t respectively represent characteristics of image and text feature;V and t first passes through normalization layer and carries out normalizing before comparison
Change is handled, so that s is of equal value with COS distance.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810101127.0A CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
PCT/CN2018/111327 WO2019148898A1 (en) | 2018-02-01 | 2018-10-23 | Adversarial cross-media retrieving method based on restricted text space |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810101127.0A CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108319686A true CN108319686A (en) | 2018-07-24 |
CN108319686B CN108319686B (en) | 2021-07-30 |
Family
ID=62888861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810101127.0A Expired - Fee Related CN108319686B (en) | 2018-02-01 | 2018-02-01 | Antagonism cross-media retrieval method based on limited text space |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108319686B (en) |
WO (1) | WO2019148898A1 (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344266A (en) * | 2018-06-29 | 2019-02-15 | 北京大学深圳研究生院 | A kind of antagonism cross-media retrieval method based on dual semantics space |
CN109508400A (en) * | 2018-10-09 | 2019-03-22 | 中国科学院自动化研究所 | Picture and text abstraction generating method |
CN109783657A (en) * | 2019-01-07 | 2019-05-21 | 北京大学深圳研究生院 | Multistep based on limited text space is from attention cross-media retrieval method and system |
CN109783655A (en) * | 2018-12-07 | 2019-05-21 | 西安电子科技大学 | A kind of cross-module state search method, device, computer equipment and storage medium |
CN109919162A (en) * | 2019-01-25 | 2019-06-21 | 武汉纺织大学 | For exporting the model and its method for building up of MR image characteristic point description vectors symbol |
CN110059217A (en) * | 2019-04-29 | 2019-07-26 | 广西师范大学 | A kind of image text cross-media retrieval method of two-level network |
WO2019148898A1 (en) * | 2018-02-01 | 2019-08-08 | 北京大学深圳研究生院 | Adversarial cross-media retrieving method based on restricted text space |
CN110175256A (en) * | 2019-05-30 | 2019-08-27 | 上海联影医疗科技有限公司 | A kind of image data retrieval method, apparatus, equipment and storage medium |
CN110189249A (en) * | 2019-05-24 | 2019-08-30 | 深圳市商汤科技有限公司 | A kind of image processing method and device, electronic equipment and storage medium |
CN110502743A (en) * | 2019-07-12 | 2019-11-26 | 北京邮电大学 | Social networks based on confrontation study and semantic similarity is across media search method |
CN110674688A (en) * | 2019-08-19 | 2020-01-10 | 深圳力维智联技术有限公司 | Face recognition model acquisition method, system and medium for video monitoring scene |
CN110866129A (en) * | 2019-11-01 | 2020-03-06 | 中电科大数据研究院有限公司 | Cross-media retrieval method based on cross-media uniform characterization model |
CN111259851A (en) * | 2020-01-23 | 2020-06-09 | 清华大学 | Multi-mode event detection method and device |
CN111651660A (en) * | 2020-05-28 | 2020-09-11 | 拾音智能科技有限公司 | Method for cross-media retrieval of difficult samples |
CN111782921A (en) * | 2020-03-25 | 2020-10-16 | 北京沃东天骏信息技术有限公司 | Method and device for searching target |
CN112182281A (en) * | 2019-07-05 | 2021-01-05 | 腾讯科技(深圳)有限公司 | Audio recommendation method and device and storage medium |
CN112818157A (en) * | 2021-02-10 | 2021-05-18 | 浙江大学 | Combined query image retrieval method based on multi-order confrontation characteristic learning |
CN113094550A (en) * | 2020-01-08 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN113159071A (en) * | 2021-04-20 | 2021-07-23 | 复旦大学 | Cross-modal image-text association anomaly detection method |
CN113254678A (en) * | 2021-07-14 | 2021-08-13 | 北京邮电大学 | Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof |
CN113379603A (en) * | 2021-06-10 | 2021-09-10 | 大连海事大学 | Ship target detection method based on deep learning |
CN113946710A (en) * | 2021-10-12 | 2022-01-18 | 浙江大学 | Video retrieval method based on multi-mode and self-supervision characterization learning |
CN115114395A (en) * | 2022-04-15 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Content retrieval and model training method and device, electronic equipment and storage medium |
CN117312592A (en) * | 2023-11-28 | 2023-12-29 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
Families Citing this family (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111105013B (en) * | 2019-11-05 | 2023-08-11 | 中国科学院深圳先进技术研究院 | Optimization method of countermeasure network architecture, image description generation method and system |
CN111179254B (en) * | 2019-12-31 | 2023-05-30 | 复旦大学 | Domain adaptive medical image segmentation method based on feature function and countermeasure learning |
CN111198964B (en) * | 2020-01-10 | 2023-04-25 | 中国科学院自动化研究所 | Image retrieval method and system |
CN111259152A (en) * | 2020-01-20 | 2020-06-09 | 刘秀萍 | Deep multilayer network driven feature aggregation category divider |
CN111325319B (en) * | 2020-02-02 | 2023-11-28 | 腾讯云计算(北京)有限责任公司 | Neural network model detection method, device, equipment and storage medium |
CN111353076B (en) * | 2020-02-21 | 2023-10-10 | 华为云计算技术有限公司 | Method for training cross-modal retrieval model, cross-modal retrieval method and related device |
CN111368176B (en) * | 2020-03-02 | 2023-08-18 | 南京财经大学 | Cross-modal hash retrieval method and system based on supervision semantic coupling consistency |
CN111597810B (en) * | 2020-04-13 | 2024-01-05 | 广东工业大学 | Named entity identification method for semi-supervised decoupling |
CN113673635B (en) * | 2020-05-15 | 2023-09-01 | 复旦大学 | Hand-drawn sketch understanding deep learning method based on self-supervision learning task |
CN111651577B (en) * | 2020-06-01 | 2023-04-21 | 全球能源互联网研究院有限公司 | Cross-media data association analysis model training and data association analysis method and system |
CN111708745B (en) * | 2020-06-18 | 2023-04-21 | 全球能源互联网研究院有限公司 | Cross-media data sharing representation method and user behavior analysis method and system |
CN111882032B (en) * | 2020-07-13 | 2023-12-01 | 广东石油化工学院 | Neural semantic memory storage method |
CN111984800B (en) * | 2020-08-16 | 2023-11-17 | 西安电子科技大学 | Hash cross-modal information retrieval method based on dictionary pair learning |
CN112256899B (en) * | 2020-09-23 | 2022-05-10 | 华为技术有限公司 | Image reordering method, related device and computer readable storage medium |
CN112466281A (en) * | 2020-10-13 | 2021-03-09 | 讯飞智元信息科技有限公司 | Harmful audio recognition decoding method and device |
CN112214988B (en) * | 2020-10-14 | 2024-01-23 | 哈尔滨福涛科技有限责任公司 | Deep learning and rule combination-based negotiable article structure analysis method |
CN112396091B (en) * | 2020-10-23 | 2024-02-09 | 西安电子科技大学 | Social media image popularity prediction method, system, storage medium and application |
CN112651448B (en) * | 2020-12-29 | 2023-09-15 | 中山大学 | Multi-mode emotion analysis method for social platform expression package |
CN112949384B (en) * | 2021-01-23 | 2024-03-08 | 西北工业大学 | Remote sensing image scene classification method based on antagonistic feature extraction |
CN112861977B (en) * | 2021-02-19 | 2024-01-26 | 中国人民武装警察部队工程大学 | Migration learning data processing method, system, medium, equipment, terminal and application |
CN113052311B (en) * | 2021-03-16 | 2024-01-19 | 西北工业大学 | Feature extraction network with layer jump structure and method for generating features and descriptors |
CN113420166A (en) * | 2021-03-26 | 2021-09-21 | 阿里巴巴新加坡控股有限公司 | Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment |
CN113537272B (en) * | 2021-03-29 | 2024-03-19 | 之江实验室 | Deep learning-based semi-supervised social network abnormal account detection method |
CN113536013B (en) * | 2021-06-03 | 2024-02-23 | 国家电网有限公司大数据中心 | Cross-media image retrieval method and system |
CN113656616B (en) * | 2021-06-23 | 2024-02-27 | 同济大学 | Three-dimensional model sketch retrieval method based on heterogeneous twin neural network |
CN113360683B (en) * | 2021-06-30 | 2024-04-19 | 北京百度网讯科技有限公司 | Method for training cross-modal retrieval model and cross-modal retrieval method and device |
CN113362416B (en) * | 2021-07-01 | 2024-05-17 | 中国科学技术大学 | Method for generating image based on text of target detection |
CN113610128B (en) * | 2021-07-28 | 2024-02-13 | 西北大学 | Aesthetic attribute retrieval-based picture aesthetic description modeling and describing method and system |
CN114022687B (en) * | 2021-09-24 | 2024-05-10 | 之江实验室 | Image description countermeasure generation method based on reinforcement learning |
CN114022372B (en) * | 2021-10-25 | 2024-04-16 | 大连理工大学 | Mask image patching method for introducing semantic loss context encoder |
CN114241517B (en) * | 2021-12-02 | 2024-02-27 | 河南大学 | Cross-mode pedestrian re-recognition method based on image generation and shared learning network |
CN114298159B (en) * | 2021-12-06 | 2024-04-09 | 湖南工业大学 | Image similarity detection method based on text fusion under label-free sample |
CN114443916B (en) * | 2022-01-25 | 2024-02-06 | 中国人民解放军国防科技大学 | Supply and demand matching method and system for test data |
CN114677569B (en) * | 2022-02-17 | 2024-05-10 | 之江实验室 | Character-image pair generation method and device based on feature decoupling |
CN115129917B (en) * | 2022-06-06 | 2024-04-09 | 武汉大学 | optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics |
CN115131613B (en) * | 2022-07-01 | 2024-04-02 | 中国科学技术大学 | Small sample image classification method based on multidirectional knowledge migration |
CN115909317A (en) * | 2022-07-15 | 2023-04-04 | 广东工业大学 | Learning method and system for three-dimensional model-text joint expression |
CN115840827B (en) * | 2022-11-07 | 2023-09-19 | 重庆师范大学 | Deep unsupervised cross-modal hash retrieval method |
CN116108215A (en) * | 2023-02-21 | 2023-05-12 | 湖北工业大学 | Cross-modal big data retrieval method and system based on depth fusion |
CN116821408B (en) * | 2023-08-29 | 2023-12-01 | 南京航空航天大学 | Multi-task consistency countermeasure retrieval method and system |
CN116935329B (en) * | 2023-09-19 | 2023-12-01 | 山东大学 | Weak supervision text pedestrian retrieval method and system for class-level comparison learning |
CN117611924B (en) * | 2024-01-17 | 2024-04-09 | 贵州大学 | Plant leaf phenotype disease classification method based on graphic subspace joint learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211769A (en) * | 1997-06-26 | 1999-03-24 | 香港中文大学 | Method and equipment for file retrieval based on Bayesian network |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
US20120303628A1 (en) * | 2011-05-24 | 2012-11-29 | Brian Silvola | Partitioned database model to increase the scalability of an information system |
CN103914711A (en) * | 2014-03-26 | 2014-07-09 | 中国科学院计算技术研究所 | Improved top speed learning model and method for classifying modes of improved top speed learning model |
CN105512289A (en) * | 2015-12-07 | 2016-04-20 | 郑州金惠计算机系统工程有限公司 | Image retrieval method based on deep learning and Hash |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
CN106202413A (en) * | 2016-07-11 | 2016-12-07 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN106649715A (en) * | 2016-12-21 | 2017-05-10 | 中国人民解放军国防科学技术大学 | Cross-media retrieval method based on local sensitive hash algorithm and neural network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346440B (en) * | 2014-10-10 | 2017-06-23 | 浙江大学 | A kind of across media hash indexing methods based on neutral net |
CN106095893B (en) * | 2016-06-06 | 2018-11-20 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN108319686B (en) * | 2018-02-01 | 2021-07-30 | 北京大学深圳研究生院 | Antagonism cross-media retrieval method based on limited text space |
-
2018
- 2018-02-01 CN CN201810101127.0A patent/CN108319686B/en not_active Expired - Fee Related
- 2018-10-23 WO PCT/CN2018/111327 patent/WO2019148898A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1211769A (en) * | 1997-06-26 | 1999-03-24 | 香港中文大学 | Method and equipment for file retrieval based on Bayesian network |
CN1920818A (en) * | 2006-09-14 | 2007-02-28 | 浙江大学 | Transmedia search method based on multi-mode information convergence analysis |
US20120303628A1 (en) * | 2011-05-24 | 2012-11-29 | Brian Silvola | Partitioned database model to increase the scalability of an information system |
CN103914711A (en) * | 2014-03-26 | 2014-07-09 | 中国科学院计算技术研究所 | Improved top speed learning model and method for classifying modes of improved top speed learning model |
CN105512289A (en) * | 2015-12-07 | 2016-04-20 | 郑州金惠计算机系统工程有限公司 | Image retrieval method based on deep learning and Hash |
CN105718532A (en) * | 2016-01-15 | 2016-06-29 | 北京大学 | Cross-media sequencing method based on multi-depth network structure |
CN106202413A (en) * | 2016-07-11 | 2016-12-07 | 北京大学深圳研究生院 | A kind of cross-media retrieval method |
CN106649715A (en) * | 2016-12-21 | 2017-05-10 | 中国人民解放军国防科学技术大学 | Cross-media retrieval method based on local sensitive hash algorithm and neural network |
Non-Patent Citations (1)
Title |
---|
李辉等: "一种多模式匹配高效算法的设计与实现", 《北京工商大学学报( 自然科学版)》 * |
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019148898A1 (en) * | 2018-02-01 | 2019-08-08 | 北京大学深圳研究生院 | Adversarial cross-media retrieving method based on restricted text space |
CN109344266A (en) * | 2018-06-29 | 2019-02-15 | 北京大学深圳研究生院 | A kind of antagonism cross-media retrieval method based on dual semantics space |
CN109344266B (en) * | 2018-06-29 | 2021-08-06 | 北京大学深圳研究生院 | Dual-semantic-space-based antagonistic cross-media retrieval method |
CN109508400A (en) * | 2018-10-09 | 2019-03-22 | 中国科学院自动化研究所 | Picture and text abstraction generating method |
CN109783655B (en) * | 2018-12-07 | 2022-12-30 | 西安电子科技大学 | Cross-modal retrieval method and device, computer equipment and storage medium |
CN109783655A (en) * | 2018-12-07 | 2019-05-21 | 西安电子科技大学 | A kind of cross-module state search method, device, computer equipment and storage medium |
WO2020143137A1 (en) * | 2019-01-07 | 2020-07-16 | 北京大学深圳研究生院 | Multi-step self-attention cross-media retrieval method based on restricted text space and system |
CN109783657B (en) * | 2019-01-07 | 2022-12-30 | 北京大学深圳研究生院 | Multi-step self-attention cross-media retrieval method and system based on limited text space |
CN109783657A (en) * | 2019-01-07 | 2019-05-21 | 北京大学深圳研究生院 | Multistep based on limited text space is from attention cross-media retrieval method and system |
CN109919162A (en) * | 2019-01-25 | 2019-06-21 | 武汉纺织大学 | For exporting the model and its method for building up of MR image characteristic point description vectors symbol |
CN110059217B (en) * | 2019-04-29 | 2022-11-04 | 广西师范大学 | Image text cross-media retrieval method for two-stage network |
CN110059217A (en) * | 2019-04-29 | 2019-07-26 | 广西师范大学 | A kind of image text cross-media retrieval method of two-level network |
CN110189249A (en) * | 2019-05-24 | 2019-08-30 | 深圳市商汤科技有限公司 | A kind of image processing method and device, electronic equipment and storage medium |
CN110189249B (en) * | 2019-05-24 | 2022-02-18 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN110175256A (en) * | 2019-05-30 | 2019-08-27 | 上海联影医疗科技有限公司 | A kind of image data retrieval method, apparatus, equipment and storage medium |
CN112182281B (en) * | 2019-07-05 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Audio recommendation method, device and storage medium |
CN112182281A (en) * | 2019-07-05 | 2021-01-05 | 腾讯科技(深圳)有限公司 | Audio recommendation method and device and storage medium |
CN110502743A (en) * | 2019-07-12 | 2019-11-26 | 北京邮电大学 | Social networks based on confrontation study and semantic similarity is across media search method |
CN110674688B (en) * | 2019-08-19 | 2023-10-31 | 深圳力维智联技术有限公司 | Face recognition model acquisition method, system and medium for video monitoring scene |
CN110674688A (en) * | 2019-08-19 | 2020-01-10 | 深圳力维智联技术有限公司 | Face recognition model acquisition method, system and medium for video monitoring scene |
CN110866129A (en) * | 2019-11-01 | 2020-03-06 | 中电科大数据研究院有限公司 | Cross-media retrieval method based on cross-media uniform characterization model |
CN113094550A (en) * | 2020-01-08 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN113094550B (en) * | 2020-01-08 | 2023-10-24 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN111259851A (en) * | 2020-01-23 | 2020-06-09 | 清华大学 | Multi-mode event detection method and device |
CN111782921A (en) * | 2020-03-25 | 2020-10-16 | 北京沃东天骏信息技术有限公司 | Method and device for searching target |
WO2021190115A1 (en) * | 2020-03-25 | 2021-09-30 | 北京沃东天骏信息技术有限公司 | Method and apparatus for searching for target |
CN111651660B (en) * | 2020-05-28 | 2023-05-02 | 拾音智能科技有限公司 | Method for cross-media retrieval of difficult samples |
CN111651660A (en) * | 2020-05-28 | 2020-09-11 | 拾音智能科技有限公司 | Method for cross-media retrieval of difficult samples |
CN112818157A (en) * | 2021-02-10 | 2021-05-18 | 浙江大学 | Combined query image retrieval method based on multi-order confrontation characteristic learning |
CN113159071B (en) * | 2021-04-20 | 2022-06-21 | 复旦大学 | Cross-modal image-text association anomaly detection method |
CN113159071A (en) * | 2021-04-20 | 2021-07-23 | 复旦大学 | Cross-modal image-text association anomaly detection method |
CN113379603B (en) * | 2021-06-10 | 2024-03-15 | 大连海事大学 | Ship target detection method based on deep learning |
CN113379603A (en) * | 2021-06-10 | 2021-09-10 | 大连海事大学 | Ship target detection method based on deep learning |
CN113254678A (en) * | 2021-07-14 | 2021-08-13 | 北京邮电大学 | Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof |
CN113254678B (en) * | 2021-07-14 | 2021-10-01 | 北京邮电大学 | Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof |
CN113946710A (en) * | 2021-10-12 | 2022-01-18 | 浙江大学 | Video retrieval method based on multi-mode and self-supervision characterization learning |
CN115114395A (en) * | 2022-04-15 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Content retrieval and model training method and device, electronic equipment and storage medium |
CN115114395B (en) * | 2022-04-15 | 2024-03-19 | 腾讯科技(深圳)有限公司 | Content retrieval and model training method and device, electronic equipment and storage medium |
CN117312592A (en) * | 2023-11-28 | 2023-12-29 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
CN117312592B (en) * | 2023-11-28 | 2024-02-09 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
Also Published As
Publication number | Publication date |
---|---|
CN108319686B (en) | 2021-07-30 |
WO2019148898A1 (en) | 2019-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108319686A (en) | Antagonism cross-media retrieval method based on limited text space | |
Spinde et al. | Automated identification of bias inducing words in news articles using linguistic and context-oriented features | |
CN109753566A (en) | The model training method of cross-cutting sentiment analysis based on convolutional neural networks | |
CN103229168B (en) | The method and system that evidence spreads between multiple candidate answers during question and answer | |
CN107076567A (en) | Multilingual image question and answer | |
US20160350288A1 (en) | Multilingual embeddings for natural language processing | |
CN110390018A (en) | A kind of social networks comment generation method based on LSTM | |
CN107688870B (en) | Text stream input-based hierarchical factor visualization analysis method and device for deep neural network | |
CN109614487A (en) | A method of the emotional semantic classification based on tensor amalgamation mode | |
CN113254678B (en) | Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof | |
Barua et al. | F-NAD: an application for fake news article detection using machine learning techniques | |
Liu et al. | Learning to predict population-level label distributions | |
CN109101490B (en) | Factual implicit emotion recognition method and system based on fusion feature representation | |
CN112597302B (en) | False comment detection method based on multi-dimensional comment representation | |
CN108763211A (en) | The automaticabstracting and system of knowledge are contained in fusion | |
CN113722474A (en) | Text classification method, device, equipment and storage medium | |
CN111639176A (en) | Real-time event summarization method based on consistency monitoring | |
Nasrullah et al. | Detection of types of mental illness through the social network using ensembled deep learning model | |
CN113821587B (en) | Text relevance determining method, model training method, device and storage medium | |
CN113934835A (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation | |
Wang et al. | A meta-learning based stress category detection framework on social media | |
Yoon et al. | Image classification and captioning model considering a CAM‐based disagreement loss | |
Wijaya et al. | Hate Speech Detection Using Convolutional Neural Network and Gated Recurrent Unit with FastText Feature Expansion on Twitter | |
CN109325096A (en) | A kind of knowledge resource search system of knowledge based resource classification | |
Oak et al. | Generating clinically relevant texts: A case study on life-changing events |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210730 |