CN108319686B - Antagonism cross-media retrieval method based on limited text space - Google Patents

Antagonism cross-media retrieval method based on limited text space Download PDF

Info

Publication number
CN108319686B
CN108319686B CN201810101127.0A CN201810101127A CN108319686B CN 108319686 B CN108319686 B CN 108319686B CN 201810101127 A CN201810101127 A CN 201810101127A CN 108319686 B CN108319686 B CN 108319686B
Authority
CN
China
Prior art keywords
text
feature
image
network
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810101127.0A
Other languages
Chinese (zh)
Other versions
CN108319686A (en
Inventor
王文敏
余政
王荣刚
李革
王振宇
赵辉
高文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201810101127.0A priority Critical patent/CN108319686B/en
Publication of CN108319686A publication Critical patent/CN108319686A/en
Priority to PCT/CN2018/111327 priority patent/WO2019148898A1/en
Application granted granted Critical
Publication of CN108319686B publication Critical patent/CN108319686B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Library & Information Science (AREA)
  • Fuzzy Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a antagonism cross-media retrieval method based on a limited text space, which comprises the steps of designing a feature extraction network, a feature mapping network and a modal classifier, obtaining the limited text space through learning, extracting image and text features suitable for cross-media retrieval, and realizing the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; thereby enabling cross-media retrieval. The invention can better fit the human behavior in the cross-media retrieval task; obtaining the image and text characteristics more suitable for the cross-media retrieval task, and making up the deficiency of the pre-training characteristics in the expression capability; and a antagonism learning mechanism is introduced, and the retrieval accuracy is further improved through the maximum and minimum games between the modal classifier and the feature mapping network.

Description

Antagonism cross-media retrieval method based on limited text space
Technical Field
The invention relates to the technical field of computer vision, in particular to a antagonism cross-media retrieval method based on a limited text space.
Background
With the advent of the Web 2.0 era, a large amount of multimedia data (images, text, video, audio, etc.) began to accumulate and spread over the internet. Unlike traditional single modality retrieval tasks, cross-media retrieval is used to enable two-way retrieval between different modality data, such as text retrieval images and image retrieval text. However, due to the inherent heterogeneous nature of multimedia data, their similarity cannot be directly measured. Therefore, the core problem of this kind of task is how to find a homogeneous mapping space, so that the similarity between heterogeneous multimedia data can be directly measured. In the current cross-media search field, a lot of research is carried out on the basis of the problem, and a series of typical cross-media search algorithms such as CCA (Canonical Correlation Analysis), Deep Visual-Semantic Embedding (Deep Visual-Semantic Embedding), and DSPE (Deep Structure-invariant Text Image Embedding) are proposed. However, these methods also have certain drawbacks.
The first drawback is represented by the characterization of the multimedia data. The existing methods mostly adopt a pre-trained cnn (conditional neural network) model to extract image features, such as a neural network structure proposed by VGG (Visual Geometry Group). However, these models are usually pre-trained on the task of image classification, which results in that the extracted image features only contain class information of the object, thereby losing some information that may be important for cross-media retrieval, such as behavior and motion of the object and interaction process between the objects. For text, Word2Vec, lda (late Dirichlet allocation) and fv (fisher vector) are some mainstream text feature extraction methods. However, they are also pre-trained on some data sets other than cross-media retrieval, so the extracted features are not suitable for cross-media retrieval.
A second drawback is represented by the choice of homogeneous feature space. There are roughly three choices of isomorphic space, namely public space, text space and image space. From a human cognitive perspective, the brain's understanding process for text and images is not the same. For text, the brain can directly extract features and understand; for an image, the brain always subconsciously describes it first in text before understanding, i.e. first converts from image space to text space. Therefore, cross-media retrieval in text space can better simulate human cognitive manner. The existing cross-media retrieval method based on the text space mostly adopts the Word2Vec space as the final text space, and the feature representation of the image in the space is obtained by combining the class information of the object in the image. Therefore, the feature also loses the information of rich actions and interactions contained in the image, which also shows that the Word2Vec space is not a valid text feature space for cross-media retrieval.
A third drawback is represented by the variability of the distribution of characteristics of the different modalities. Although existing methods map data features of different modalities to some homogeneous feature space, a modality gap (modality gap) still exists between them, and there is also a significant difference in feature distribution, which may result in a decrease in cross-media retrieval performance.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a antagonism cross-media retrieval method based on a limited text space, firstly, image and text feature descriptions corresponding to a cross-media retrieval task are obtained through learning, and secondly, a limited text space is found through simulating a human cognition mode and is used for realizing similarity measurement between images and texts; the method also introduces a antagonism training mechanism, aims to reduce the difference of feature distribution among different modal data in the text space learning process, and further increases the retrieval accuracy.
The principle of the invention is as follows: as described in the background, the core problem of cross-media retrieval is how to find a homogeneous mapping space so that the similarity between heterogeneous multimedia data can be directly measured. More precisely, this core problem can be subdivided into two sub-problems. The first sub-problem is how to learn to get an efficient representation of multimedia data features. The second sub-problem is how to find a suitable isomorphic feature space. The invention provides a cross-media retrieval method based on a limited text space. For the first sub-problem, the invention uses feature extraction network learning to obtain effective image and text feature representation. Based on an image description (image capture) task, the invention learns to obtain a new image feature by combining a CNN (computer network) and an image description algorithm. The characteristics not only comprise the category information of the objects in the image, but also comprise rich interaction information among the objects; for text features, learning from scratch with a Recurrent Neural Network (RNN) is applied to the cross-media retrieval task. For the second subproblem, the method adopts feature mapping network learning to obtain a limited text space; in order to further reduce the difference between different modal characteristics, the invention designs a modal classifier for realizing the maximum and minimum game with the characteristic mapping network. In particular, the modality classifier is used for distinguishing the modality of the current limited text space feature, and the feature mapping network is used for learning the feature with the unchanged modality and thereby confusing the modality classifier. In addition to the conventional triplet penalty, an additional antagonism penalty is propagated back from the modality classifier to the feature mapping network during the training process, for further reducing the difference between different modality features. "constrained text space" means that the text space learned by this method is made up of a series of basis vectors that can be thought of as different words in a dictionary. The expressive power of this text space is therefore limited by the number of words in the dictionary and is therefore limited. The method of the invention mainly obtains the limited text space through learning, and realizes the similarity measurement between the image and the text. The method is based on the limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The method obtains more accurate retrieval results in the cross-media retrieval classical data set.
The technical scheme provided by the invention is as follows:
a antagonism cross-media retrieval method based on a limited text space utilizes a feature extraction network, a feature mapping network and a modal classifier to obtain the limited text space through learning, extracts image and text features suitable for cross-media retrieval and realizes the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; firstly, training a feature extraction network, a feature mapping network and a modal classifier by using a data set D, and then realizing antagonistic cross-media retrieval by using the trained feature network aiming at retrieval request data; the method comprises the following specific steps:
let training dataset D ═ D be assumed1,D2,…,DnThere are n samples, each sample DiComprising a picture IiAnd a piece of descriptive text TiI.e. Di=(Ii,Ti) Each text segment is composed of a plurality (5) of sentences, each sentence describing a matching picture independently; each image contains 5 descriptive sentences with similar but different meanings;
1) and extracting the features of the image and the text in the D through a feature extraction network.
For an Image, extracting Image features by combining an existing VGG model and an Image description algorithm (NIC); for text, the LSTM (Long Short Term Memory networks) network is used to extract text features. Since the LSTM network is not pre-trained, its parameters are updated synchronously with the parameters of the feature mapping network.
The calculation process of image feature extraction is expressed as formula 1:
Figure BDA0001566360800000031
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC;Cincatenate (. cndot.) is a feature connection layer for connecting IVGGAnd INIGFeature I connected in 4608 dimensionsConcat
The text feature extraction specifically executes the following steps:
giving a text of length T S ═ (S)0,s1,…,sT) Each word S in StAll using 1-of-k coding representation, k representing the number of words in the dictionary; before entering the LSTM network, the word stNeeds to be mapped to a more dense space first, represented by equation 2:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs a word vector mapping matrix for mapping 1-of-k vectors stEncoding into a d-dimensional word vector;
the resulting dense-space word vectors are fed into the LSTM network, represented as equation 3:
Figure BDA0001566360800000041
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ represents a double bending function; an element indicates a multiplication operation in a matrix element; tanh represents a hyperbolic tangent activation function; hidden layer output h of LSTM network at T momentTI.e. a feature representation of the text S.
2) Designing a feature fusion layer on the top layer of the feature mapping network, and designing IVGG_txtAnd INIC_txtIs fused into IfinalAs a d-dimensional feature representation of the input image in a restricted text space; the dimension of the restricted text space is d; respectively mapping the text and the image characteristics obtained in the step 1) to a limited text space in an initial state by a characteristic mapping network, and then comparing the characteristics through a similarity measurement functionThe similarity between the eigenvectors (namely, the distance between the two vectors is calculated) to obtain the current triplet loss; secondly, feature vectors of different modal data are sent to a modal classifier for classification to obtain the current confrontation loss, and finally the limited text space is trained by optimizing a combined loss function of the triple loss and the confrontation loss.
The text features are not sent to the feature mapping network here, because the feature extraction network (LSTM network) already realizes the mapping of the text to the feature space in the process of feature extraction;
and obtaining a feature fusion layer at the top layer of the feature mapping network by the processing of the formula 5:
Figure BDA0001566360800000042
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs 512-dimensional image features, I, extracted by an image description algorithm NICfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping.
The similarity metric function is expressed as: s (v, t) ═ v · t; wherein v and t represent image features and text features, respectively; v and t are normalized by the L2 normalization layer prior to comparison, so that s is equivalent to the cosine distance.
Training a feature mapping network by optimizing a triple loss function and a resistance loss function, and specifically executing the following operations:
setting the distance between the input image or text and the matching text or matching image to be d1The distance from unmatched text or unmatched image is d2,d1At least the ratio d2A small interval m; the interval m is a hyper-parameter determined by the outside world; the triplet loss function is represented by equation 6:
Figure BDA0001566360800000051
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network; the unmatched samples are randomly selected from the data set in each training period;
antagonism loss L in modal classifieradvSynchronously and reversely propagating to the feature mapping network;
defining the overall loss function L as equation 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function;
in order to suppress the noise signal of the modal classifier at the initial stage of training, the update of the parameter λ can be implemented by equation 8:
Figure BDA0001566360800000052
wherein p represents the percentage of the current iteration times in the total iteration times; λ is an adaptive parameter;
training the feature mapping network by adopting the loss function L, and updating the parameter theta of the feature mapping network through formula 9f
Figure BDA0001566360800000053
Wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
3) Respectively sending the images and the text features which are obtained in the step 2) and are positioned in the same limited text space into a modal classifier for classification, and training the modal classifier through cross entropy loss; the following operations are specifically executed:
given the text spatial feature label of the image as [ 01 ], the text spatial feature label of the text as [ 10 ], the training of the modal classifier is realized by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
Figure BDA0001566360800000054
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;
Figure BDA0001566360800000062
the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresenting a two-class cross entropy loss function of the modal classifier and an additional countermeasure loss function of the feature mapping network;
updating parameter θ of modal classifier by equation 10d
Figure BDA0001566360800000061
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function, θ, of the feature mapping networkdAre parameters of the modality classifier.
4) Repeating the step 2) and the step 3) until the feature mapping network is converged;
5) and calculating the distance between the data (image or text) of the retrieval request and the data of the other modality in the data set D in the limited text space according to the retrieval request, and sequencing the retrieval results according to the distance to further obtain the most similar retrieval result. The distance is then calculated by the dot product between the feature vectors of the different modality data in the restricted text space.
Through the steps, antagonism cross-media retrieval based on the limited text space is realized.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a antagonism cross-media retrieval method based on a limited text space, which mainly obtains the limited text space through learning and realizes similarity measurement between images and texts. The method is based on a limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The method obtains more accurate retrieval results in the cross-media retrieval classical data set. Specifically, the invention uses the feature extraction network to learn to obtain effective image and text feature representation, and the image features are further sent into the feature mapping network to realize the mapping from the image space to the text space. Finally, in order to further reduce the difference of feature distribution among different modal data, the antagonism loss generated by the modal classifier is reversely propagated to the feature mapping network, so that the retrieval result is further improved. Specifically, the present invention has the following technical advantages:
the invention is directed to cross-media retrieval in a restricted text space by way of simulating human cognition. Compared with the existing method based on public space or image space, the method can better fit the human behavior in the cross-media retrieval task;
(II) the feature extraction network can learn to obtain image and text features more suitable for a cross-media retrieval task, and the defects of pre-training features in expression ability are made up;
and thirdly, in order to further reduce the difference of feature distribution among different modal data, the invention introduces a mechanism of antagonism learning, and further improves the retrieval accuracy rate through the maximum and minimum game between the modal classifier and the feature mapping network.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention;
wherein, (a) shows that the invention comprises three parts of a feature extraction network, a feature mapping network and a modal classifier; (b) and (c) network structure diagrams of the feature mapping network and the modal classifier, respectively.
FIG. 2 is a schematic diagram of a network architecture of the feature extraction network of the present invention;
the method comprises the following steps that (a) an image feature extraction network is used for extracting image features through combination of a 19-layer VGG model VGGNet and an image description algorithm NIC; (b) is a recurrent neural network (LSTM) for extracting text features.
FIG. 3 is a cross-media retrieval effect screenshot implemented on a Flickr8K test data set according to an embodiment of the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The invention provides a antagonism cross-media retrieval method based on a limited text space, which mainly obtains the limited text space through learning and realizes similarity measurement between images and texts. The method is based on a limited text space, extracts the image and text features suitable for cross-media retrieval by simulating the cognitive mode of human, realizes the mapping of the image features from the image space to the text space, introduces a antagonism training mechanism and aims to continuously reduce the difference of feature distribution among different modal data in the learning process. The feature extraction network, the feature mapping network, the modality classifier and the implementation thereof, and the training procedure of the network in the present invention are described in detail below.
1. Feature extraction network
The feature extraction network mainly comprises two branches, including an image feature extraction network and a text feature extraction network, which respectively correspond to feature extraction of images and texts.
1) Extracting image characteristics, obtaining image characteristics I through image characteristic extraction network learningConcatIncluding 4096-dimensional feature IVGGAnd image features I extracted by an image description algorithmNIC
The Image feature extraction network can be regarded as a combination of VGGNet (Neural network structure proposed by Visual Geometry Group) which is a 19-layer VGG model and NIC (Neural network-based Image description) which is an Image description algorithm. The VGGNet is pre-trained on an image classification task and used for extracting image features containing rich object class information; in contrast, the NIC is pre-trained on image description tasks for extracting image features containing rich information of interactions between objects. Therefore, the image features extracted by the two are complementary.
Specifically, after an image of 224 × 224 size is fed into VGGNet, the network outputs a 4096-dimensional feature IVGG(ii) a Meanwhile, in order to avoid information loss of Image features in the translation process, the output of an Image mapping Layer (Image Embedding Layer) in the NIC is regarded as the Image features I extracted by the Image description algorithmNIC. Finally, the characteristics I of the imageConcatIs equivalent to IVGGAnd INICIn combination with (1). The calculation procedure is expressed as formula 1:
Figure BDA0001566360800000081
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC(ii) a Consatenate (-) is a feature connection layer for connecting IVGGAnd INICFeature I connected in 4608 dimensionsConcat
2) Text feature extraction
The text feature extraction network extracts d-dimensional text features using a long-short term memory recurrent neural network (LSTM). At the same time, d is also the dimension of the restricted text space. Suppose that a text of length T is given S ═ (S)0,s1,…,sT) Each word S in StAre all represented using 1-of-k encoding, with k representing the number of words in the dictionary. Before entering the LSTM network, the word stNeed to be mapped firstTo a more dense space:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs a word vector mapping matrix for mapping the 1-of-k vectors stCoded into a d-dimensional word vector. After obtaining the dense-space word vector representations, we feed them into the LSTM network, with the mathematical expression expressed as equation 3:
Figure BDA0001566360800000082
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ denotes a double curved function, and "-" denotes a multiplication operation in units of matrix elements; tanh represents a hyperbolic tangent activation function; the feature representation of the text S is the hidden layer output of the LSTM network at time T, namely hT
FIG. 3 is a network architecture of the feature extraction network of the present invention; during the training process, the parameters of VGGNet are fixed all the time, and the NIC is pre-trained on the image description task using Flickr30K or MSCOCO training data set. Specifically, firstly, setting the size of all images in a data set to be 256 × 256, then obtaining image blocks with the size of 224 × 224 by adopting a single central area cutting mode, and finally sending the image blocks into a feature extraction network to extract image features; for text, we use LSTM and bi-directional LSTM networks to extract text features, where the number of hidden layer nodes of LSTM units is 1024.
2. Modal classifier
To further reduce the difference between different modal feature distributions, we design a modal classifier that acts as a discriminator in the generation countermeasure network. Given an image text space feature label of [ 01 ], a text space feature label of [ 10 ], the training of the modal classifier is implemented by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
Figure BDA0001566360800000091
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;
Figure BDA0001566360800000093
the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresents the two-class cross-entropy loss function of the modality classifier, and is an additional countermeasure loss function of the feature mapping network.
3. Feature mapping network
The invention maps the parameter theta of the network through the characteristicfLearning results in a restricted text space. The image characteristics I are obtained by the characteristic extraction network learningConcatComprising IVGGAnd INICTwo parts. For image features IConcatTwo mapping functions f (-) and g (-) are designed in the feature mapping network and are respectively used for realizing IVGGAnd INICTo d-dimension text space characteristic IVGG_txtAnd INIC_txtTo (3) is performed. And IVGGAnd INICSimilarly, IVGG_txtAnd INIC_txtAnd the two characteristics are complementary, so that a characteristic fusion layer is designed at the top layer of the characteristic mapping network, and the advantages of the two characteristics are complementary. The process is defined as formula 5:
Figure BDA0001566360800000092
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs extracted by an image description algorithm NIC512 dimensional image features, IfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping. Notably, the feature extraction process for text amounts to mapping the text to the restricted text space. Thus, the parameter θ of the feature mapping networkf(see equation 9) contains the parameters of the LSTM network.
Fig. 2 (b) and (c) show the network structures of the feature mapping network and the modality classifier, respectively. The feature mapping network comprises two feature mapping networks f (-) and g (-) a fusion layer (fusion layer) and an L2 normalization layer (L2 Norm). f (-) contains two fully connected layers, the hidden layer node numbers are 2048 and 1024, respectively. ReLU is used as an activation function between all fully connected layers, and Dropout layers are added after the ReLU to prevent overfitting, wherein the Dropout rate is 0.5; g (-) contains one fully connected layer, the number of hidden layer nodes is 1024; the fusion layer (fusion layer) implements an addition operation in units of matrix elements; the L2 normalization layer enables the similarity between the learned features to be measured directly through point multiplication, the model convergence speed is increased, and the training stability is improved.
After mapping the image and the text to a restricted text space in an initial state, the next step is to compare the similarity between the features and calculate the corresponding triple loss. We define a similarity metric function s (v, t) ═ v · t, where v and t represent image and text features, respectively. To make s equivalent to the cosine distance, v and t need to be normalized by the L2 normalization layer before comparison. Triple loss functions have very wide application in the field of cross-media retrieval. Given an input image (text), the distance between the input image (text) and the matching text (image) is d1And the distance between the unmatched text (image) and the unmatched text is d2We hope d to d1At least the ratio d2A small interval m. The interval m is an externally determined hyperparameter, and for optimization purposes we fix m 0.3 and apply it to all datasetsIn (1). Thus, in the present invention, the triplet loss function is represented by equation 6:
Figure BDA0001566360800000101
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network. To obtain these unmatched samples, we randomly chosen from the dataset at each training cycle.
Secondly, the feature vectors of different modal data are sent to a modal classifier for classification, and the current confrontation loss is obtained. In addition to triple loss, antagonism loss L in modal classifiersadvIt will also be propagated back to the feature mapping network in synchronization.
Finally, by optimizing the triplet loss LembAnd to combat the loss LadvThe combined loss function of (a) to train the restricted text space. Due to LembAnd LadvIn contrast, the overall loss function L can be defined as:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function. In order to suppress the noise signal of the modal classifier at the initial stage of training, the updating of the parameter λ can be realized by the mathematical expression shown in equation 8:
Figure BDA0001566360800000111
where p represents the percentage of the current number of iterations to the total number of iterations, and λ is an adaptive parameter.
FIG. 3 shows the actual cross-media retrieval effect of the present invention on the Flickr8K test data set. The first column of the table lists the image and text questions for retrieval; the second column to the fourth column show the search results of LTS-A (VGG + BLSTM), LTS-A (NIC + BLSTM) and LTS-A (VGG + NIC + BLSTM) ranking 5 for each question, respectively. For image search text, correctly retrieved text is represented in red font; for text search images, the correctly retrieved images all contain a tick. The search results are significantly improved, looking from left to right of the table, particularly from LTS-A (VGG + BLSTM) to LTS-A (NIC + BLSTM); in addition, those samples that are erroneously retrieved can be matched to the problem to some extent.
4. Training mode
The training process of the present invention includes four phases.
Firstly, the method comprises the following steps: in the initial training phase, we fix the parameters of VGGNet, pre-training the NIC using Flickr30K (image data from the jaguar photo album website Flickr, 30000 pictures in total) or MSCOCO (a dataset created by microsoft using amazon's "turkish robot" service). After training is completed, the image features can be extracted through the feature extraction network.
II, secondly: after extracting the features of all images in the dataset, the second training phase is mainly used to learn a restricted text space. Given the loss function L of the feature mapping network, the parameter θ of our fixed-mode classifierdUpdating the parameter θ of the feature mapping network by the following mathematical expressionfExpressed by formula 9:
Figure BDA0001566360800000112
wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
Thirdly, the method comprises the following steps: after the second training phase, a third training phase is used primarily to enhance the discriminative power of the modality classifier. Loss function L given modal classifieradvThen we fixParameter θ of feature mapping networkfUpdating the parameter θ of the modal classifier by the following mathematical expressiond
Figure BDA0001566360800000121
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function, θ, of the feature mapping networkdAre parameters of the modality classifier.
Fourthly, the method comprises the following steps: the second training phase and the third training phase are repeated for each batch of training data until the model converges.
Table 1 shows the experimental results of cross-media retrieval of the present invention on the Flickr8K test data set. To evaluate the effectiveness of the search, we followed standard ranking metrics, using Recall @ K and Median Rank. Recall @ K measures the retrieval accuracy by calculating the probability that the correctly matched data is ranked in the top K (K ═ 1,5,10) retrieval results; median Rank represents the Median of the ranks where the correct match data is located. A higher Recall @ K and a lower Median Rank indicate a more accurate retrieval effect. The figure lists the comparison of the effect of the present invention with other advanced algorithms in the prior art, including DeVise (Deep Visual-Semantic Embedding), m-RNN (Deep targeting with multi-modal recurrent network, image description of multimedia recurrent Neural network), Deep Fragment (Deep Fragment Embedding), DCCA (Deep Canonical Correlation Analysis), VSE (unified Visual-Semantic Embedding with multi-modal Neural network), m-CNN (unified Visual-Semantic Embedding with multi-modal Semantic model), etcENS(multimedia Convolutional Neural Networks), NIC (Neural Image capturing, Image description based Neural Networks), HM-LSTM (Hierarchical multimedia LSTM network). In addition, we have designed four variants on the basis of the above method:
● LTS-A (VGG + LSTM): in the image feature extraction process, an image description algorithm NIC is removed, and the rest part is fixed;
● LTS-A (NIC + LSTM): in the image feature extraction process, the convolutional neural network VGGNet is removed, and the rest part is fixed;
● LTS-A (VGG + NIC + LSTM): the network architecture shown in fig. 2;
● LTS-A (VGG + NIC + BLSTM): fig. 2 shows a network architecture that replaces the LSTM network with a bidirectional LSTM network (BLSTM).
Table 1 the embodiment cross media retrieval effect on Flickr8K test data set.
Figure BDA0001566360800000131
In table 1, Img2Txt represents image-to-text retrieval; txt2Img represents text-to-image retrieval. As can be seen from Table 1, LTS-A (VGG + NIC + BLSTM) surpasses HM-LSTM in the task of image search o mutext, and achieves the best search effect at present. LTS-A (VGG + NIC + BLSTM), however, does not work as well as HM-LSTM on o mutext search image tasks. The most probable reason is that the HM-LSTM adopts a tree-shaped LSTM network architecture, and can better model the hierarchical structure of the text. The invention adopts the chain type LSTM network architecture, and cannot acquire hierarchical semantic information in the text. In addition, as can be seen from the experimental result variation among the four variants of the present invention, when the network for image feature extraction is changed from VGGNet to NIC, the accuracy of image search text is improved by 22%, and the accuracy of text search image is improved by 17%. This also indicates that the NIC is able to extract more efficient image features than the traditional VGGNet; after the network for image feature extraction is changed from NIC to VGG + NIC, the accuracy of cross-media retrieval is further improved by 6%, which shows that the network for image feature extraction can not only extract detailed object category information in the image, but also contain rich interaction information among objects; finally, replacing the LSTM network with a bidirectional LSTM network (BLSTM) brings an additional 2% retrieval accuracy improvement.
Table 2 shows the cross-media retrieval effect of the embodiment on the Flickr30K test data set. In addition to the existing advanced algorithms mentioned in Flickr8K, we add DAN (Dual Attention Networks ), DSPE (Deep Structure-Preserving Image-Text embedding model), VSE + + (enhanced Visual-Semantic embedding of VSE). At this time, DAN and DSPE, which perform better than DSPE, achieve the best search effect. Due to the introduction of attention mechanisms, DAN is able to continuously focus on fine-grained information of data, which is mostly beneficial for cross-media retrieval. Instead, we use only global features to represent images and text, and thus are disturbed by noise information in the images or text. Besides DAN, DSPE also performs better than our, because it uses more complex textual features (Fisher Vector) and loss functions. With respect to the four variants of the invention, their experimental performance was relatively similar to that of Flickr 8K.
Table 2 embodiment cross media retrieval effect on Flickr30K test data set
Figure BDA0001566360800000141
Table 3 cross media retrieval effect of the embodiment on MSCOCO test data set
Figure BDA0001566360800000142
Table 3 shows the cross-media retrieval effect of the embodiment on the MSCOCO test data set. In addition to the existing advanced algorithms mentioned in Flickr8K And Flickr30K, we add Order (Order-embedding Of Images And Language, sequential embedding Of Images And text). At the moment, the LTS-A (VGG + NIC + LSTM) achieves the best effect on the image search o mutext task, the retrieval accuracy is improved by about 2%, and the index R @1 is lower than the DSPE; in the task of image retrieval and o mutext, the DSPE has better performance on Recall @ K than that of us, but LTS-A (VGG + NIC + LSTM) has the best effect on the index of Median Rank. This is because the chain LSTM network used in the present invention does not have a good understanding of the hierarchical semantic information in the text, and therefore, the feature representation capability of the text is inferior to that of FV (Fisher vector). As for the four variants of the invention, their experimental performance was similar to that of Flickr8K, Flickr 30K.
TABLE 4 Cross-mediA retrieval Effect of two variants LTS-A and LTS of the embodiment
Figure BDA0001566360800000151
Table 4 shows the effect of antagonistic learning mechanisms on the experimental results. We have devised two variants of the original invention: LTS-A and LTS. LTS-A is the previously mentioned LTS-A (VGG + NIC + LSTM); LTS is based on LTS-A (VGG + NIC + LSTM), and the mechanism of adversity learning is removed.
From the table we can see that LTS-A has A significant improvement in cross-mediA retrieval accuracy over LTS. LTS only exceeds LTS-A in the R @1 index of the image search o mutext. The experimental results show that the antagonistic learning has obvious effect on reducing the difference between the characteristic distributions of different modal data.
Table 5 search effect of the embodiment on MSCOCO test data set
Figure BDA0001566360800000152
Table 6 shows the search effect of extracting image features by means of single cropping and ten cropping on the MSCOCO test dataset, respectively.
In the above implementation, we all use a single crop (1-crop) of the image region to extract the image features. In order to verify the validity of the feature mean of ten different regions of an image as an image feature (10-crops), we designed LTS-A (10-crops), wherein LTS-A refers to LTS-A (VGG + NIC + BLSTM), and 10-crops represents that the image feature at this time is described by the feature mean of ten different regions of the image. As can be seen from Table 6, the accuracy of LTS-A (10-crops) retrieval is significantly improved compared with LTS-A (1-crops), which also illustrates the feasibility of using the feature mean of ten different regions of an image as the image feature.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (7)

1. A antagonism cross-media retrieval method based on a limited text space comprises the steps of designing a feature extraction network, a feature mapping network and a modal classifier, obtaining the limited text space through learning, extracting image and text features suitable for cross-media retrieval, and realizing the mapping of the image features from the image space to the text space; continuously reducing the difference of feature distribution among different modal data in the learning process through a antagonism training mechanism; thereby enabling cross-media retrieval; specifically, the method comprises the following steps:
A. the feature extraction network comprises an image feature extraction network and a text feature extraction network which are respectively used for image feature extraction and text feature extraction; the image feature extraction network obtains the image feature I through learning of one or two of VGGNet and NICConcatIncluding 4096-dimensional feature IVGGAnd image features I extracted by an image description algorithmNICOne or two of them; the text feature extraction network extracts d-dimensional text features by using a long-short term memory recurrent neural network (LSTM) or a bidirectional LSTM network (BLSTM);
B. the modal classifier is used as a discriminator in the countermeasure network, and the training of the modal classifier is realized by optimizing a two-class cross entropy loss function; this function is also an additional penalty function for the feature mapping network;
C. feature mapping network pass parameter θfLearning to obtain a limited text space; learning to get for feature extraction networkImage characteristics IConcatComprising ofVGGAnd INICDesigning mapping functions f (-) and g (-) in the feature mapping network for realizing I respectivelyVGGAnd INICMapping I to d-dimensional text space featuresVGG_txtAnd INIC_txt(ii) a Designing a feature fusion layer on the top layer of the feature mapping network, and designing IVGG_txtAnd INIC_txtIs fused into IfinalAs a d-dimensional feature representation of the input image in a restricted text space; the dimension of the restricted text space is d;
let training dataset D ═ D be assumed1,D2,…,DnThere are n samples, each sample DiComprising a picture IiAnd a piece of descriptive text TiI.e. Di=(Ii,Ti) Each text segment consists of 5 sentences, and each sentence independently describes a matched picture; for the data set D, the following steps 1) -4) are executed to train the feature extraction network, the feature mapping network and the modal classifier:
1) and (3) extracting the features of the image and the text in the D through a feature extraction network: for the image in the D, extracting image features by using a VGG model and an image description algorithm NIC; for the text in the step D, extracting text features by using a long-short term memory recurrent neural network (LSTM), and realizing the mapping of the text to a feature space, wherein parameters of the LSTM network and parameters of a feature mapping network need to be updated synchronously;
2) respectively mapping the text and the image characteristics obtained in the step 1) to a limited text space in an initial state by a characteristic mapping network, firstly calculating the distance between characteristic vectors by a similarity measurement function, and comparing the similarity between the characteristic vectors to obtain the current triple loss; then, the feature vectors of different modal data are sent to a modal classifier for classification, and the current confrontation loss is obtained; finally, training a limited text space by optimizing a combined loss function of the triple loss and the countermeasure loss;
3) respectively sending the images and the text features which are obtained in the step 2) and are positioned in the same limited text space into a modal classifier for classification, and training the modal classifier through cross entropy loss;
4) repeating the steps 2) -3) until the feature mapping network is converged;
5) calculating the distance between the image or text of the retrieval request data and the other modal data in the data set D in the limited text space according to the retrieval request, and sequencing the retrieval results according to the distance to further obtain the most similar retrieval results; specifically, the distance is calculated through the dot product between the feature vectors of different modal data in the space;
through the steps, antagonism cross-media retrieval based on the limited text space is realized.
2. The adversarial cross-media retrieval method of claim 1, wherein the computation process of image feature extraction is expressed as formula 1:
Figure FDA0003121352230000021
wherein VGGNet (-) is a 19-layer VGG model for extracting 4096-dimensional feature I of an input imageVGG(ii) a NIC (-) is an image description algorithm for extracting 512-dimensional features I of an imageNIC(ii) a Consatenate (-) is a feature connection layer for connecting IVGGAnd INICFeature I connected in 4608 dimensionsConcat
3. The adversarial cross-media retrieval method as claimed in claim 1, wherein the text feature extraction specifically performs the following steps:
giving a text of length T S ═ (S)0,s1,…,sT) Each word vector S in StAll using 1-of-k coding representation, k representing the number of words in the dictionary; word vector s before entry into the LSTM networktNeeds to be mapped to a more dense space first, represented by equation 2:
xt=Westt is equal to {0L T }, (equation 2)
Wherein, WeIs wordVector mapping matrix for vector s of 1-of-k wordstEncoding into a d-dimensional word vector;
the resulting dense-space word vectors are fed into the LSTM network, represented as equation 3:
Figure FDA0003121352230000022
wherein it,ft,ot,ct,htRespectively representing the output of an input gate, a forgetting gate, an output gate, a memory unit and a hidden layer of the LSTM unit at the time t; x is the number oftA word vector input representing a current time; h ist-1Is the LSTM unit hidden layer input at the previous moment; σ represents a double bending function; an element indicates a multiplication operation in a matrix element; tanh represents a hyperbolic tangent activation function; hidden layer output h of LSTM network at T momentTI.e. a feature representation of the text S.
4. The adversarial cross-media retrieval method of claim 1, wherein the training of the modality classifier specifically performs the following operations:
given the text spatial feature label of the image as [ 01 ], the text spatial feature label of the text as [ 10 ], the training of the modal classifier is realized by optimizing a two-class cross entropy loss function, which is expressed as formula 4:
Figure FDA0003121352230000031
wherein x isiAnd yiRespectively representing the ith input text space characteristic and a label corresponding to the ith input text space characteristic; n represents the total number of feature samples currently input; thetadTraining parameters representing a modal classifier;
Figure FDA0003121352230000032
the function is used for predicting the mode of the current text space characteristic, namely text or picture; l isadvRepresenting a two-class cross entropy loss function of the modal classifier and an additional countermeasure loss function of the feature mapping network;
updating parameter θ of modal classifier by equation 10d
Figure FDA0003121352230000033
Where μ represents the learning rate of the optimization algorithm, LadvRepresenting the total loss function of the feature mapping network.
5. The adversarial cross-media retrieval method of claim 1, wherein the feature fusion layer at the top layer of the feature mapping network is obtained by the processing of equation 5:
Figure FDA0003121352230000034
wherein, IVGGIs a 4096-dimensional image feature, I, obtained by VGGNet extractionNICIs 512-dimensional image features, I, extracted by an image description algorithm NICfinalIs a d-dimensional feature representation of the input image in a restricted text space, f () and g () represent two feature mapping functions, IVGG_txtAnd INIC_txtAre each IVGGAnd INICD-dimensional text space feature mapping.
6. The adversarial cross-media retrieval method of claim 1, wherein the step 2) trains the feature mapping network by optimizing the triple loss function and the adversarial loss function, and specifically performs the following operations:
setting the distance between the input image or text and the matching text or matching image to be d1The distance from unmatched text or unmatched image is d2,d1At least the ratio d2A small interval m; the interval m is a hyper-parameter determined by the outside world; the triplet loss function is represented by equation 6:
Figure FDA0003121352230000041
wherein, tkIs the kth unmatched text of the input image v; v. ofkIs the kth unmatched image of the input text t; m is the minimum distance separation; s (v, t) is a similarity metric function; thetafIs a parameter of the feature mapping network; the unmatched samples are randomly selected from the data set in each training period;
antagonism loss L in modal classifieradvSynchronously and reversely propagating to the feature mapping network;
defining the overall loss function L as equation 7:
L=Lemb-λ·Ladv(formula 7)
Wherein, λ is a self-adaptive parameter, and the value range is changed from 0 to 1; l isembRepresenting a triplet loss function; l isadvIs an additional antagonism loss function;
in order to suppress the noise signal of the modal classifier at the initial stage of training, the update of the parameter λ can be implemented by equation 8:
Figure FDA0003121352230000042
wherein p represents the percentage of the current iteration times in the total iteration times; λ is an adaptive parameter;
training the feature mapping network by adopting the loss function L, and updating the parameter theta of the feature mapping network through formula 9f
Figure FDA0003121352230000043
Wherein mu represents the learning rate of the optimization algorithm, L represents the total loss function of the feature mapping network, and thetafIs a parameter of the feature mapping network.
7. The adversarial cross media retrieval method of claim 1, wherein the similarity measure function s (v, t) of step 2) is expressed as:
s(v,t)=v·t
wherein v and t represent image features and text features, respectively; v and t are normalized by a normalization layer prior to comparison to make s equivalent to the cosine distance.
CN201810101127.0A 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space Expired - Fee Related CN108319686B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810101127.0A CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space
PCT/CN2018/111327 WO2019148898A1 (en) 2018-02-01 2018-10-23 Adversarial cross-media retrieving method based on restricted text space

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810101127.0A CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space

Publications (2)

Publication Number Publication Date
CN108319686A CN108319686A (en) 2018-07-24
CN108319686B true CN108319686B (en) 2021-07-30

Family

ID=62888861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810101127.0A Expired - Fee Related CN108319686B (en) 2018-02-01 2018-02-01 Antagonism cross-media retrieval method based on limited text space

Country Status (2)

Country Link
CN (1) CN108319686B (en)
WO (1) WO2019148898A1 (en)

Families Citing this family (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319686B (en) * 2018-02-01 2021-07-30 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space
CN109344266B (en) * 2018-06-29 2021-08-06 北京大学深圳研究生院 Dual-semantic-space-based antagonistic cross-media retrieval method
CN109508400B (en) * 2018-10-09 2020-08-28 中国科学院自动化研究所 Method for generating image-text abstract
CN109783655B (en) * 2018-12-07 2022-12-30 西安电子科技大学 Cross-modal retrieval method and device, computer equipment and storage medium
CN109783657B (en) * 2019-01-07 2022-12-30 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method and system based on limited text space
CN109919162B (en) * 2019-01-25 2021-08-10 武汉纺织大学 Model for outputting MR image feature point description vector symbol and establishing method thereof
CN110059217B (en) * 2019-04-29 2022-11-04 广西师范大学 Image text cross-media retrieval method for two-stage network
CN110189249B (en) * 2019-05-24 2022-02-18 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN110175256A (en) * 2019-05-30 2019-08-27 上海联影医疗科技有限公司 A kind of image data retrieval method, apparatus, equipment and storage medium
CN112182281B (en) * 2019-07-05 2023-09-19 腾讯科技(深圳)有限公司 Audio recommendation method, device and storage medium
CN110502743A (en) * 2019-07-12 2019-11-26 北京邮电大学 Social networks based on confrontation study and semantic similarity is across media search method
CN110674688B (en) * 2019-08-19 2023-10-31 深圳力维智联技术有限公司 Face recognition model acquisition method, system and medium for video monitoring scene
CN110866129A (en) * 2019-11-01 2020-03-06 中电科大数据研究院有限公司 Cross-media retrieval method based on cross-media uniform characterization model
CN111105013B (en) * 2019-11-05 2023-08-11 中国科学院深圳先进技术研究院 Optimization method of countermeasure network architecture, image description generation method and system
CN111179254B (en) * 2019-12-31 2023-05-30 复旦大学 Domain adaptive medical image segmentation method based on feature function and countermeasure learning
CN113094550B (en) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN111198964B (en) * 2020-01-10 2023-04-25 中国科学院自动化研究所 Image retrieval method and system
CN111259152A (en) * 2020-01-20 2020-06-09 刘秀萍 Deep multilayer network driven feature aggregation category divider
CN111259851B (en) * 2020-01-23 2021-04-23 清华大学 Multi-mode event detection method and device
CN111325319B (en) * 2020-02-02 2023-11-28 腾讯云计算(北京)有限责任公司 Neural network model detection method, device, equipment and storage medium
CN111353076B (en) * 2020-02-21 2023-10-10 华为云计算技术有限公司 Method for training cross-modal retrieval model, cross-modal retrieval method and related device
CN111368176B (en) * 2020-03-02 2023-08-18 南京财经大学 Cross-modal hash retrieval method and system based on supervision semantic coupling consistency
CN111782921A (en) * 2020-03-25 2020-10-16 北京沃东天骏信息技术有限公司 Method and device for searching target
CN111597810B (en) * 2020-04-13 2024-01-05 广东工业大学 Named entity identification method for semi-supervised decoupling
CN113673635B (en) * 2020-05-15 2023-09-01 复旦大学 Hand-drawn sketch understanding deep learning method based on self-supervision learning task
CN111651660B (en) * 2020-05-28 2023-05-02 拾音智能科技有限公司 Method for cross-media retrieval of difficult samples
CN111651577B (en) * 2020-06-01 2023-04-21 全球能源互联网研究院有限公司 Cross-media data association analysis model training and data association analysis method and system
CN111708745B (en) * 2020-06-18 2023-04-21 全球能源互联网研究院有限公司 Cross-media data sharing representation method and user behavior analysis method and system
CN111882032B (en) * 2020-07-13 2023-12-01 广东石油化工学院 Neural semantic memory storage method
CN111984800B (en) * 2020-08-16 2023-11-17 西安电子科技大学 Hash cross-modal information retrieval method based on dictionary pair learning
CN112256899B (en) * 2020-09-23 2022-05-10 华为技术有限公司 Image reordering method, related device and computer readable storage medium
CN112466281A (en) * 2020-10-13 2021-03-09 讯飞智元信息科技有限公司 Harmful audio recognition decoding method and device
CN112214988B (en) * 2020-10-14 2024-01-23 哈尔滨福涛科技有限责任公司 Deep learning and rule combination-based negotiable article structure analysis method
CN112396091B (en) * 2020-10-23 2024-02-09 西安电子科技大学 Social media image popularity prediction method, system, storage medium and application
CN112651448B (en) * 2020-12-29 2023-09-15 中山大学 Multi-mode emotion analysis method for social platform expression package
CN112949384B (en) * 2021-01-23 2024-03-08 西北工业大学 Remote sensing image scene classification method based on antagonistic feature extraction
CN112818157B (en) * 2021-02-10 2022-09-16 浙江大学 Combined query image retrieval method based on multi-order confrontation characteristic learning
CN112861977B (en) * 2021-02-19 2024-01-26 中国人民武装警察部队工程大学 Migration learning data processing method, system, medium, equipment, terminal and application
CN113052311B (en) * 2021-03-16 2024-01-19 西北工业大学 Feature extraction network with layer jump structure and method for generating features and descriptors
CN113420166A (en) * 2021-03-26 2021-09-21 阿里巴巴新加坡控股有限公司 Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment
CN113537272B (en) * 2021-03-29 2024-03-19 之江实验室 Deep learning-based semi-supervised social network abnormal account detection method
CN113159071B (en) * 2021-04-20 2022-06-21 复旦大学 Cross-modal image-text association anomaly detection method
CN113536013B (en) * 2021-06-03 2024-02-23 国家电网有限公司大数据中心 Cross-media image retrieval method and system
CN113379603B (en) * 2021-06-10 2024-03-15 大连海事大学 Ship target detection method based on deep learning
CN113656616B (en) * 2021-06-23 2024-02-27 同济大学 Three-dimensional model sketch retrieval method based on heterogeneous twin neural network
CN113360683B (en) * 2021-06-30 2024-04-19 北京百度网讯科技有限公司 Method for training cross-modal retrieval model and cross-modal retrieval method and device
CN113362416B (en) * 2021-07-01 2024-05-17 中国科学技术大学 Method for generating image based on text of target detection
CN113254678B (en) * 2021-07-14 2021-10-01 北京邮电大学 Training method of cross-media retrieval model, cross-media retrieval method and equipment thereof
CN113610128B (en) * 2021-07-28 2024-02-13 西北大学 Aesthetic attribute retrieval-based picture aesthetic description modeling and describing method and system
CN114022687B (en) * 2021-09-24 2024-05-10 之江实验室 Image description countermeasure generation method based on reinforcement learning
CN113946710A (en) * 2021-10-12 2022-01-18 浙江大学 Video retrieval method based on multi-mode and self-supervision characterization learning
CN114022372B (en) * 2021-10-25 2024-04-16 大连理工大学 Mask image patching method for introducing semantic loss context encoder
CN114241517B (en) * 2021-12-02 2024-02-27 河南大学 Cross-mode pedestrian re-recognition method based on image generation and shared learning network
CN114298159B (en) * 2021-12-06 2024-04-09 湖南工业大学 Image similarity detection method based on text fusion under label-free sample
CN114443916B (en) * 2022-01-25 2024-02-06 中国人民解放军国防科技大学 Supply and demand matching method and system for test data
CN114677569B (en) * 2022-02-17 2024-05-10 之江实验室 Character-image pair generation method and device based on feature decoupling
CN115114395B (en) * 2022-04-15 2024-03-19 腾讯科技(深圳)有限公司 Content retrieval and model training method and device, electronic equipment and storage medium
CN115129917B (en) * 2022-06-06 2024-04-09 武汉大学 optical-SAR remote sensing image cross-modal retrieval method based on modal common characteristics
CN115131613B (en) * 2022-07-01 2024-04-02 中国科学技术大学 Small sample image classification method based on multidirectional knowledge migration
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN115840827B (en) * 2022-11-07 2023-09-19 重庆师范大学 Deep unsupervised cross-modal hash retrieval method
CN116108215A (en) * 2023-02-21 2023-05-12 湖北工业大学 Cross-modal big data retrieval method and system based on depth fusion
CN116821408B (en) * 2023-08-29 2023-12-01 南京航空航天大学 Multi-task consistency countermeasure retrieval method and system
CN116935329B (en) * 2023-09-19 2023-12-01 山东大学 Weak supervision text pedestrian retrieval method and system for class-level comparison learning
CN117312592B (en) * 2023-11-28 2024-02-09 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning
CN117611924B (en) * 2024-01-17 2024-04-09 贵州大学 Plant leaf phenotype disease classification method based on graphic subspace joint learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis
CN103914711A (en) * 2014-03-26 2014-07-09 中国科学院计算技术研究所 Improved top speed learning model and method for classifying modes of improved top speed learning model
CN105512289A (en) * 2015-12-07 2016-04-20 郑州金惠计算机系统工程有限公司 Image retrieval method based on deep learning and Hash
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9507816B2 (en) * 2011-05-24 2016-11-29 Nintendo Co., Ltd. Partitioned database model to increase the scalability of an information system
CN104346440B (en) * 2014-10-10 2017-06-23 浙江大学 A kind of across media hash indexing methods based on neutral net
CN106095893B (en) * 2016-06-06 2018-11-20 北京大学深圳研究生院 A kind of cross-media retrieval method
CN108319686B (en) * 2018-02-01 2021-07-30 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1211769A (en) * 1997-06-26 1999-03-24 香港中文大学 Method and equipment for file retrieval based on Bayesian network
CN1920818A (en) * 2006-09-14 2007-02-28 浙江大学 Transmedia search method based on multi-mode information convergence analysis
CN103914711A (en) * 2014-03-26 2014-07-09 中国科学院计算技术研究所 Improved top speed learning model and method for classifying modes of improved top speed learning model
CN105512289A (en) * 2015-12-07 2016-04-20 郑州金惠计算机系统工程有限公司 Image retrieval method based on deep learning and Hash
CN105718532A (en) * 2016-01-15 2016-06-29 北京大学 Cross-media sequencing method based on multi-depth network structure
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106649715A (en) * 2016-12-21 2017-05-10 中国人民解放军国防科学技术大学 Cross-media retrieval method based on local sensitive hash algorithm and neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种多模式匹配高效算法的设计与实现;李辉等;《北京工商大学学报( 自然科学版)》;20090531;第27卷(第3期);第65-68页 *

Also Published As

Publication number Publication date
CN108319686A (en) 2018-07-24
WO2019148898A1 (en) 2019-08-08

Similar Documents

Publication Publication Date Title
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN109783657B (en) Multi-step self-attention cross-media retrieval method and system based on limited text space
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN110717017B (en) Method for processing corpus
CN108733742B (en) Global normalized reader system and method
CN109844743B (en) Generating responses in automated chat
CN108549658B (en) Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
Karpathy Connecting images and natural language
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN111767405A (en) Training method, device and equipment of text classification model and storage medium
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
US11645479B1 (en) Method for AI language self-improvement agent using language modeling and tree search techniques
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111598183A (en) Multi-feature fusion image description method
CN112257841A (en) Data processing method, device and equipment in graph neural network and storage medium
CN113722474A (en) Text classification method, device, equipment and storage medium
CN114818691A (en) Article content evaluation method, device, equipment and medium
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN111538841A (en) Comment emotion analysis method, device and system based on knowledge mutual distillation
Guo et al. Matching visual features to hierarchical semantic topics for image paragraph captioning
Deorukhkar et al. A detailed review of prevailing image captioning methods using deep learning techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210730