CN111782833A - Fine-grained cross-media retrieval method based on multi-model network - Google Patents
Fine-grained cross-media retrieval method based on multi-model network Download PDFInfo
- Publication number
- CN111782833A CN111782833A CN202010526211.4A CN202010526211A CN111782833A CN 111782833 A CN111782833 A CN 111782833A CN 202010526211 A CN202010526211 A CN 202010526211A CN 111782833 A CN111782833 A CN 111782833A
- Authority
- CN
- China
- Prior art keywords
- media
- cross
- features
- extracting
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 40
- 239000013598 vector Substances 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000009826 distribution Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 9
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 230000000717 retained effect Effects 0.000 abstract 1
- 238000011160 research Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 241000271566 Aves Species 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241000272168 Laridae Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000036039 immunity Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001172 regenerating effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a fine-grained cross-media retrieval method based on a multi-model network, which comprises the following steps: acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data; respectively extracting the special characteristics of each media data; extracting common features of each media data; carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features; the cosine distances are used to measure the similarity between different media features and rank the media features by similarity. The invention constructs a media special network and a public network, wherein the media special network comprises a characteristic extractor of each media for extracting the special characteristics of each media; the public network comprises a unified network capable of simultaneously learning four media and is used for extracting the public characteristics of each media, and the combination of the two networks realizes that the heterogeneous gaps among the media are eliminated while the characteristics of the media are retained to the maximum extent, so that the effective cross-media retrieval is realized, and the method has a wide application prospect.
Description
Technical Field
The invention belongs to the technical fields of computer vision, natural language processing, fine-grained identification, multimedia retrieval and the like, and particularly relates to a fine-grained cross-media retrieval method based on a multi-model network.
Background
In recent years, with the rapid growth of multimedia data, multimedia data such as images, text, audio, and video has become a major form of people recognizing the world. Research on multimedia data has been ongoing for several years and past research has generally focused on a single media type, i.e. the results of queries and searches are of one media type. At present, the correlation among mass multimedia data is continuously improved, and meanwhile, the retrieval requirement of a user on the multimedia data becomes very flexible and can not only meet the retrieval requirement of a single media type, so that how to realize cross-media retrieval is a key problem to be solved at present. Cross-media retrieval refers to a user returning relevant data of other media types desired by the user by submitting data of a certain media type. For example, when the user currently has a photo of the hong bird family but does not know the category name of the photo content, the photo of the hong bird family can be submitted, and video, text and audio information related to the hong bird family can be returned. However, the existing cross-media retrieval generally focuses on the coarse level, and the research on the fine level is little. The coarse level is for large classes (e.g., birds, dogs, cats, etc.), while the fine level focuses on subclasses of a large class (e.g., gull-wing, avenaceae, etc.). In a general coarse-grained retrieval method, when a user submits a picture of the red bird family for retrieval, audio, video and text of the bird are returned instead of information related to the red bird family category, which does not meet the requirements of people. Based on the problems, the research on the cross-media retrieval method with the fine granularity level has wide practical significance.
Currently, there are some shortcomings in the research aiming at fine-grained cross-media retrieval, and the most important problem is two-fold, namely the heterogeneous gap between media, that is, the feature representations of data samples of different media types are very different, so that it is very difficult to measure the similarity between them directly. Another problem is that existing research does not fully consider the problem of small inter-class differences (very similar between different fine-grained categories, such as gull, and jacuckoo), and large intra-class differences (objects in the same category differ significantly due to differences in posture illumination, etc.) caused by fine-grained levels, which makes fine-grained level retrieval more challenging than coarse-grained levels.
Disclosure of Invention
The invention aims to provide a fine-grained cross-media retrieval method based on a multi-model network.
The technical solution for realizing the purpose of the invention is as follows: a fine-grained cross-media retrieval method based on a multi-model network comprises the following specific steps:
step 1, acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data;
step 2, respectively extracting the proprietary features of each media data;
step 3, extracting the public characteristics of each media data;
step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features;
and 5, measuring the similarity between different media characteristics by utilizing the cosine distance and sequencing the media characteristics according to the similarity.
Preferably, the cross-media data includes image, video, text and audio data.
Preferably, the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:
extracting image and video data characteristics by adopting a characteristic extractor based on bilinear CNN;
pre-training word vectors by adopting a word2vec model, and extracting text data features by adopting a feature extractor of a bidirectional long-short term network based on attention;
and extracting audio data features by using a VGG-based feature extractor.
Preferably, the specific process of extracting the image and video data features by using the feature extractor based on bilinear CNN is as follows:
the image or video data respectively passes through two CNN networks to obtain different characteristics, and obtains bilinear characteristics b (l, i) through bilinear operation, and the specific formula is as follows:
b(l,i)=Ea(l,i)TEb(l,i)
in the formula, Ea、EbRespectively extracting functions of the two CNN networks;
converging the bilinear features of all the positions L into one feature through a pooling function, wherein the pooling function specifically comprises:
P(i)=∑l∈Lb(l,i)
the converged features are passed through a full link layer to obtain final image and video features.
Preferably, the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:
receiving an input sentence T ═ T by an input layer1,t2,…,tn]Wherein t isiIs the ith word in the sentence, and n is the length of the sentence;
each word t in the sentence is combined by embedding a pre-training word vector W matrix in the layeriConversion into a particular word vector ei;
Obtaining a deeper feature representation through a bidirectional LSTM network, wherein the output of the ith word is specifically as follows:
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
A larger weight is assigned by the attention layer, and a weight matrix γ of H is obtained, where the weight matrix γ is expressed as:
γ=softmax(wTtanh(H))
w is a parameter vector obtained by training and learning;
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
obtaining a final text feature representation f through a softmax classifierpro。
Preferably, the specific method for extracting the common features of the media data is as follows:
constructing a public network based on FGCrossNet;
the four media data simultaneously pass through the convolution layer, the pooling layer and the full connection layer, and the common characteristics of the four media are learned at one time through the loss function.
Preferably, the loss function comprises:
the cross entropy loss function is specifically:
wherein I, T, V, A represents image, text, video and audio data, respectively, and l is the cross entropy loss function of all samples in a single media;
the central loss function is specifically:
wherein x isjIs a characteristic of the jth sample,features that are the center of the category to which the jth sample belongs;
the quadruple loss function is specifically as follows:
wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α1,α2Is a set hyper-parameter;
the distribution loss function is specifically:
wherein C represents a certain class, C represents the sum of classes,indicating the distance of the two distributions.
Compared with the prior art, the invention has the following remarkable advantages:
(1) the invention sets a feature extractor for each media individually, which fully considers the special characteristics of each media;
(2) the invention constructs a network which can process four media simultaneously, thereby reducing the heterogeneous gap problem as much as possible;
(3) the invention introduces four loss functions, comprehensively considers the problems of inter-class difference, intra-class difference and inter-media difference;
(4) the invention adopts a multi-model network method to obviously improve the accuracy of fine-grained cross-media retrieval.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
FIG. 1 is a flow chart of a fine-grained cross-media retrieval method based on a multi-model network.
Fig. 2 is a schematic diagram of a bilinear CNN-based feature extractor.
FIG. 3 is a schematic diagram of an attention-based bidirectional long-short term network (Att-BLSTM) feature extractor.
Fig. 4 is a schematic diagram of a public network.
Detailed Description
As shown in fig. 1, a fine-grained cross-media retrieval method based on a multi-model network includes the following steps:
step 1, acquiring a PKU FG-XMedia data set, wherein the data set is the only data set with fine granularity in the cross-media field at present, and comprises 200 fine-granularity categories of birds, including four media types of images, videos, texts and audios. And preprocessing the data set to obtain cross-media data.
Specifically, the pretreatment method comprises the following steps: for pictures and texts, no processing is needed; for video, taking 25 frames at equal intervals for each video as video data; for audio, a short-time fourier transform is used to obtain a spectrogram as audio data.
Step 2, respectively extracting the special characteristics of each media data, which specifically comprises the following steps:
the image and video data features are extracted by a feature extractor based on bilinear CNN, and the specific process is as follows:
as shown in FIG. 2, two CNN networks can be viewed as two feature extraction functions Ea、EbThe image or video data i respectively passes through two CNN networks to obtain different characteristics, and through bilinear operation, that is, for each position l of the space, the outer product of different spatial positions is calculated to obtain bilinear characteristics b (l, i):
b(l,i)=Ea(l,i)TEb(l,i)
the bilinear features of all positions L are converged into one feature through a pooling function, wherein P is the pooling function:
P(i)=∑l∈Lb(l,i)
then obtaining final image and video characteristics f through a full connection layerpro。
For text data, firstly, word vector W is trained and obtained by adopting a word2vec model (namely, the semantics of words are mapped into a vector space, and a word is represented by a specific vector);
features are then extracted using a feature extractor for the attention-based two-way long and short term network. As shown in fig. 3, the text feature extractor is composed of four layers, the first layer being an input layer for receiving an input sentence T ═ T1,t2,…,tn]WhereintiIs the ith word in the sentence, and n is the length of the sentence;
the second layer is an embedding layer, each word t in the sentence is pre-trained through a word vector W matrixiConversion into a particular word vector eiThe input corresponding word vector is denoted as E ═ E1,e2,…,en];
The third layer is the LSTM layer, with deeper feature representations obtained through the bi-directional LSTM network. The output of the ith word is hiCombining the forward and backward pass outputs using element summation:
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
The fourth layer is an attention layer and is used for distributing a larger weight to the important features, a weight matrix gamma of H is obtained, w is a parameter vector obtained by training and learning, and the weight matrix gamma is expressed as:
γ=softmax(wTtanh(H))
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
and then obtaining a final text feature representation f through a softmax classifierpro。
For audio data, because some noise, such as water noise, wind noise, etc., is contained in the audio data, the VGG16 has strong noise immunity. Extracting features by adopting a feature extractor based on VGG to obtain a final audio feature representation fpro。
Step 3, extracting the public characteristics of each media data based on the public network, wherein the specific method comprises the following steps:
a common network based on FGCrossNet and capable of modeling four media data simultaneously is constructed so as to extract common characteristics of the four media. As shown in fig. 4, the public network simultaneously passes four media data through a convolutional layer, a pooling layer and a full link layer, and learns the public characteristics of the four media at one time through an optimized loss function, wherein the optimized loss function includes a cross entropy loss function, a central loss function, a quadruple loss function and a distribution loss function;
the cross entropy loss function ensures a better differentiation of the network, representing images, text, video and audio as I, T, V, A. L iscroRepresents the sum of all media cross entropy loss functions:
in the equation, l is the cross entropy loss function of all samples in a single medium. At the beginning of each training stage, the training set is randomly selected again, and the selected sample numbers N of the four media types are the same, so that the problem that the training data of the media are different is solved.
The center loss function ensures that the sample features of the same class are closer to the sample center features, denoted as Lcen:
Wherein x isjIs a characteristic of the jth sample,is the feature of the center of the category to which the jth sample belongs.
The quadruple loss function ensures that the distances of different classes are as far as possible, denoted Lqua:
Wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α1,α2Is artificialThe set hyperparameters are used to balance the two terms in equation (8), typically set to 1 and 0.5, respectively.
By the quadruple loss function, the distance between different types of samples is further increased, and the positive sample pair (x) is also drawn closea,xp) The distance between them.
The distribution loss function ensures that the distributions of the same category of different media are as close as possible. Firstly, optimizing sample characteristic distribution difference h by adopting maximum mean difference loss function (MMD)i,hjSample feature distributions representing the same category of different media, i, j representing any two different media types, the distance of the two distributions being denoted gmmd(hi,hj):
Where φ () is a mapping function and H denotes that this distance is measured by φ () mapping data into the regenerative Hilbert space (RKHS). Distribution function LdisThe sum of the distribution differences for all categories of any two different media:
where C represents a certain class and C represents the sum of classes.
Step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain the final combined feature f:
f=α*fpro+(1-α)*fcom
wherein f isproIs a characteristic peculiar to each input, fcomIs the common feature of each input and α is the weight corresponding to the unique feature.
And 5, measuring the similarity between different media characteristics by using the cosine distance to obtain a returned result, and adopting an average precision average value (mAP) as an evaluation index of retrieval, wherein the higher the mAP value is, the more excellent the retrieval effect is. Tables 1 and 2 show the comparison between the effect of the fine-grained cross-media search method of the present invention and the existing methods, which correspond to the methods described in the following documents [1] to [8], respectively:
[1]Xiangteng He,Yuxin Peng,Liu Xie:A New Benchmark and Approach forFine-grained Cross-media Retrieval.ACM Multimedia 2019:1740-1748.
[2]Xin Huang,Yuxin Peng,Mingkuan Yuan:MHTN:Modal-adversarial HybridTransfer Network for Cross-modal Retrieval.CoRR abs/1708.04308(2017).
[3]Bokun Wang,Yang Yang,Xing Xu,Alan Hanjalic,Heng Tao Shen:Adversarial Cross-Modal Retrieval.ACM Multimedia 2017:154-162.
[4]Xiaohua Zhai,Yuxin Peng,Jianguo Xiao:Learning Cross-Media JointRepresentation With Sparse and Semisupervised Regularization.IEEETrans.Circuits Syst.Video Techn.24(6):965-978(2014).
[5]Devraj Mandal,Kunal N.Chaudhury,Soma Biswas:Generalized SemanticPreserving Hashing for N-Label Cross-Modal Retrieval.CVPR 2017:2633-2641.
[6]Yuxin Peng,Xin Huang,Jinwei Qi:Cross-Media Shared Representationby Hierarchical Learning with Multiple Deep Networks.IJCAI 2016:3846-3853.
[7]Kuang-Huei Lee,Xi Chen,Gang Hua,Houdong Hu,Xiaodong He:StackedCross Attention for Image-Text Matching.ECCV(4)2018:212-228.
[8]Jiuxiang Gu,Jianfei Cai,Shafiq R.Joty,Li Niu,Gang Wang:Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval WithGenerative Models.CVPR 2018:7181-7189.
table 1 shows the bimodal fine-grained cross-media search results on a PKU FG-XMedia dataset using the present invention and the existing method, wherein "-" indicates no search result:
table 2 shows the results of a multi-modal fine-grained cross-media search on a PKU FG-XMedia dataset using the methods provided by the present invention and existing methods:
as can be seen from the table, the invention obtains the best performance in both bimodal fine-grained cross-media retrieval and multimodal fine-grained cross-media retrieval, which shows the effectiveness of the combination of a private network and a public network in the method of the invention, and simultaneously, four different loss functions are introduced to comprehensively consider the differences in classes, between classes and between media, thereby fully verifying the effectiveness of the designed method.
Claims (7)
1. A fine-grained cross-media retrieval method based on a multi-model network is characterized by comprising the following specific steps:
step 1, acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data;
step 2, respectively extracting the proprietary features of each media data;
step 3, extracting the public characteristics of each media data;
step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features;
and 5, measuring the similarity between different media characteristics by utilizing the cosine distance and sequencing the media characteristics according to the similarity.
2. The multi-model network based fine-grained cross-media retrieval method of claim 1, wherein the cross-media data comprises image, video, text and audio data.
3. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:
extracting image and video data characteristics by adopting a characteristic extractor based on bilinear CNN;
pre-training word vectors by adopting a word2vec model, and extracting text data features by adopting a feature extractor of a bidirectional long-short term network based on attention;
and extracting audio data features by using a VGG-based feature extractor.
4. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting image and video data features by using the feature extractor based on bilinear CNN is as follows:
the image or video data respectively passes through two CNN networks to obtain different characteristics, and obtains bilinear characteristics b (l, i) through bilinear operation, and the specific formula is as follows:
b(l,i)=Ea(l,i)TEb(l,i)
in the formula, Ea、EbRespectively extracting functions of the two CNN networks;
converging the bilinear features of all the positions L into one feature through a pooling function, wherein the pooling function specifically comprises:
P(i)=Σl∈Lb(l,i)
the converged features are passed through a full link layer to obtain final image and video features.
5. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:
receiving an input sentence T ═ T by an input layer1,t2,…,tn]Wherein t isiIs the ith word in the sentence, and n is the length of the sentence;
each word t in the sentence is combined by embedding a pre-training word vector W matrix in the layeriConversion into a particular word vector ei;
Obtaining a deeper feature representation through a bidirectional LSTM network, wherein the output of the ith word is specifically as follows:
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
A larger weight is assigned by the attention layer, and a weight matrix γ of H is obtained, where the weight matrix γ is expressed as:
γ=softmax(wTtanh(H))
w is a parameter vector obtained by training and learning;
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
obtaining a final text feature representation f through a softmax classifierpro。
6. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for extracting the common features of each media data is as follows:
constructing a public network based on FGCrossNet;
the four media data simultaneously pass through the convolution layer, the pooling layer and the full connection layer, and the common characteristics of the four media are learned at one time through the loss function.
7. The fine-grained cross-media retrieval method based on multi-model network according to claim 6, wherein the loss function comprises:
the cross entropy loss function is specifically:
wherein I, T, V, A represents image, text, video and audio data, respectively, and l is the cross entropy loss function of all samples in a single media;
the central loss function is specifically:
wherein x isjIs a characteristic of the jth sample,features that are the center of the category to which the jth sample belongs;
the quadruple loss function is specifically as follows:
wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α1,α2Is a set hyper-parameter;
the distribution loss function is specifically:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010526211.4A CN111782833B (en) | 2020-06-09 | 2020-06-09 | Fine granularity cross-media retrieval method based on multi-model network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010526211.4A CN111782833B (en) | 2020-06-09 | 2020-06-09 | Fine granularity cross-media retrieval method based on multi-model network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111782833A true CN111782833A (en) | 2020-10-16 |
CN111782833B CN111782833B (en) | 2023-12-19 |
Family
ID=72755874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010526211.4A Active CN111782833B (en) | 2020-06-09 | 2020-06-09 | Fine granularity cross-media retrieval method based on multi-model network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111782833B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101045A (en) * | 2020-11-02 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Multi-mode semantic integrity recognition method and device and electronic equipment |
CN113486833A (en) * | 2021-07-15 | 2021-10-08 | 北京达佳互联信息技术有限公司 | Multi-modal feature extraction model training method and device and electronic equipment |
CN113537145A (en) * | 2021-06-28 | 2021-10-22 | 青鸟消防股份有限公司 | Method, device and storage medium for rapidly solving false detection and missed detection in target detection |
CN113704537A (en) * | 2021-10-28 | 2021-11-26 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on multi-scale feature union |
CN113779282A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network |
CN113792167A (en) * | 2021-11-11 | 2021-12-14 | 南京码极客科技有限公司 | Cross-media cross-retrieval method based on attention mechanism and modal dependence |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220337A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of cross-media retrieval method based on mixing migration network |
CN107480206A (en) * | 2017-07-25 | 2017-12-15 | 杭州电子科技大学 | A kind of picture material answering method based on multi-modal low-rank bilinearity pond |
-
2020
- 2020-06-09 CN CN202010526211.4A patent/CN111782833B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220337A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of cross-media retrieval method based on mixing migration network |
CN107480206A (en) * | 2017-07-25 | 2017-12-15 | 杭州电子科技大学 | A kind of picture material answering method based on multi-modal low-rank bilinearity pond |
Non-Patent Citations (2)
Title |
---|
刘虎: "基于多尺度双线性卷积神经网络的多角度下车型精细识别", 《计算机应用》 * |
罗建豪等: "基于深度卷积特征的细粒度图像分类研究综述", 《自动化学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101045A (en) * | 2020-11-02 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Multi-mode semantic integrity recognition method and device and electronic equipment |
CN112101045B (en) * | 2020-11-02 | 2021-12-14 | 北京淇瑀信息科技有限公司 | Multi-mode semantic integrity recognition method and device and electronic equipment |
CN113537145A (en) * | 2021-06-28 | 2021-10-22 | 青鸟消防股份有限公司 | Method, device and storage medium for rapidly solving false detection and missed detection in target detection |
CN113537145B (en) * | 2021-06-28 | 2024-02-09 | 青鸟消防股份有限公司 | Method, device and storage medium for rapidly solving false detection and missing detection in target detection |
CN113486833A (en) * | 2021-07-15 | 2021-10-08 | 北京达佳互联信息技术有限公司 | Multi-modal feature extraction model training method and device and electronic equipment |
CN113704537A (en) * | 2021-10-28 | 2021-11-26 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on multi-scale feature union |
CN113779282A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network |
CN113792167A (en) * | 2021-11-11 | 2021-12-14 | 南京码极客科技有限公司 | Cross-media cross-retrieval method based on attention mechanism and modal dependence |
Also Published As
Publication number | Publication date |
---|---|
CN111782833B (en) | 2023-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111782833B (en) | Fine granularity cross-media retrieval method based on multi-model network | |
Miech et al. | Learning a text-video embedding from incomplete and heterogeneous data | |
Jiang et al. | Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching. | |
CN109960763B (en) | Photography community personalized friend recommendation method based on user fine-grained photography preference | |
CN105718532B (en) | A kind of across media sort methods based on more depth network structures | |
CN110059217A (en) | A kind of image text cross-media retrieval method of two-level network | |
CN107203636B (en) | Multi-video abstract acquisition method based on hypergraph master set clustering | |
CN113268633B (en) | Short video recommendation method | |
CN105701225B (en) | A kind of cross-media retrieval method based on unified association hypergraph specification | |
Guo et al. | Jointly learning of visual and auditory: A new approach for RS image and audio cross-modal retrieval | |
Mei et al. | Patch based video summarization with block sparse representation | |
CN108388639B (en) | Cross-media retrieval method based on subspace learning and semi-supervised regularization | |
Lee et al. | Face image retrieval using sparse representation classifier with gabor-lbp histogram | |
Abdul-Rashid et al. | Shrec’18 track: 2d image-based 3d scene retrieval | |
Li et al. | Exploiting hierarchical activations of neural network for image retrieval | |
Yang et al. | STA-TSN: Spatial-temporal attention temporal segment network for action recognition in video | |
CN108427740A (en) | A kind of Image emotional semantic classification and searching algorithm based on depth measure study | |
Meng et al. | Few-shot image classification algorithm based on attention mechanism and weight fusion | |
Zhang et al. | Exploiting mid-level semantics for large-scale complex video classification | |
CN105701227B (en) | A kind of across media method for measuring similarity and search method based on local association figure | |
CN113779283B (en) | Fine-grained cross-media retrieval method with deep supervision and feature fusion | |
Qin et al. | SHREC’22 track: Sketch-based 3D shape retrieval in the wild | |
Hu et al. | Multimodal learning via exploring deep semantic similarity | |
CN113239159B (en) | Cross-modal retrieval method for video and text based on relational inference network | |
CN113934835A (en) | Retrieval type reply dialogue method and system combining keywords and semantic understanding representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |