CN111782833A - Fine-grained cross-media retrieval method based on multi-model network - Google Patents

Fine-grained cross-media retrieval method based on multi-model network Download PDF

Info

Publication number
CN111782833A
CN111782833A CN202010526211.4A CN202010526211A CN111782833A CN 111782833 A CN111782833 A CN 111782833A CN 202010526211 A CN202010526211 A CN 202010526211A CN 111782833 A CN111782833 A CN 111782833A
Authority
CN
China
Prior art keywords
media
cross
features
extracting
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010526211.4A
Other languages
Chinese (zh)
Other versions
CN111782833B (en
Inventor
王琼
柏洁咪
姚亚洲
唐振民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202010526211.4A priority Critical patent/CN111782833B/en
Publication of CN111782833A publication Critical patent/CN111782833A/en
Application granted granted Critical
Publication of CN111782833B publication Critical patent/CN111782833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a fine-grained cross-media retrieval method based on a multi-model network, which comprises the following steps: acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data; respectively extracting the special characteristics of each media data; extracting common features of each media data; carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features; the cosine distances are used to measure the similarity between different media features and rank the media features by similarity. The invention constructs a media special network and a public network, wherein the media special network comprises a characteristic extractor of each media for extracting the special characteristics of each media; the public network comprises a unified network capable of simultaneously learning four media and is used for extracting the public characteristics of each media, and the combination of the two networks realizes that the heterogeneous gaps among the media are eliminated while the characteristics of the media are retained to the maximum extent, so that the effective cross-media retrieval is realized, and the method has a wide application prospect.

Description

Fine-grained cross-media retrieval method based on multi-model network
Technical Field
The invention belongs to the technical fields of computer vision, natural language processing, fine-grained identification, multimedia retrieval and the like, and particularly relates to a fine-grained cross-media retrieval method based on a multi-model network.
Background
In recent years, with the rapid growth of multimedia data, multimedia data such as images, text, audio, and video has become a major form of people recognizing the world. Research on multimedia data has been ongoing for several years and past research has generally focused on a single media type, i.e. the results of queries and searches are of one media type. At present, the correlation among mass multimedia data is continuously improved, and meanwhile, the retrieval requirement of a user on the multimedia data becomes very flexible and can not only meet the retrieval requirement of a single media type, so that how to realize cross-media retrieval is a key problem to be solved at present. Cross-media retrieval refers to a user returning relevant data of other media types desired by the user by submitting data of a certain media type. For example, when the user currently has a photo of the hong bird family but does not know the category name of the photo content, the photo of the hong bird family can be submitted, and video, text and audio information related to the hong bird family can be returned. However, the existing cross-media retrieval generally focuses on the coarse level, and the research on the fine level is little. The coarse level is for large classes (e.g., birds, dogs, cats, etc.), while the fine level focuses on subclasses of a large class (e.g., gull-wing, avenaceae, etc.). In a general coarse-grained retrieval method, when a user submits a picture of the red bird family for retrieval, audio, video and text of the bird are returned instead of information related to the red bird family category, which does not meet the requirements of people. Based on the problems, the research on the cross-media retrieval method with the fine granularity level has wide practical significance.
Currently, there are some shortcomings in the research aiming at fine-grained cross-media retrieval, and the most important problem is two-fold, namely the heterogeneous gap between media, that is, the feature representations of data samples of different media types are very different, so that it is very difficult to measure the similarity between them directly. Another problem is that existing research does not fully consider the problem of small inter-class differences (very similar between different fine-grained categories, such as gull, and jacuckoo), and large intra-class differences (objects in the same category differ significantly due to differences in posture illumination, etc.) caused by fine-grained levels, which makes fine-grained level retrieval more challenging than coarse-grained levels.
Disclosure of Invention
The invention aims to provide a fine-grained cross-media retrieval method based on a multi-model network.
The technical solution for realizing the purpose of the invention is as follows: a fine-grained cross-media retrieval method based on a multi-model network comprises the following specific steps:
step 1, acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data;
step 2, respectively extracting the proprietary features of each media data;
step 3, extracting the public characteristics of each media data;
step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features;
and 5, measuring the similarity between different media characteristics by utilizing the cosine distance and sequencing the media characteristics according to the similarity.
Preferably, the cross-media data includes image, video, text and audio data.
Preferably, the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:
extracting image and video data characteristics by adopting a characteristic extractor based on bilinear CNN;
pre-training word vectors by adopting a word2vec model, and extracting text data features by adopting a feature extractor of a bidirectional long-short term network based on attention;
and extracting audio data features by using a VGG-based feature extractor.
Preferably, the specific process of extracting the image and video data features by using the feature extractor based on bilinear CNN is as follows:
the image or video data respectively passes through two CNN networks to obtain different characteristics, and obtains bilinear characteristics b (l, i) through bilinear operation, and the specific formula is as follows:
b(l,i)=Ea(l,i)TEb(l,i)
in the formula, Ea、EbRespectively extracting functions of the two CNN networks;
converging the bilinear features of all the positions L into one feature through a pooling function, wherein the pooling function specifically comprises:
P(i)=∑l∈Lb(l,i)
the converged features are passed through a full link layer to obtain final image and video features.
Preferably, the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:
receiving an input sentence T ═ T by an input layer1,t2,…,tn]Wherein t isiIs the ith word in the sentence, and n is the length of the sentence;
each word t in the sentence is combined by embedding a pre-training word vector W matrix in the layeriConversion into a particular word vector ei
Obtaining a deeper feature representation through a bidirectional LSTM network, wherein the output of the ith word is specifically as follows:
Figure BDA0002531365370000031
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
A larger weight is assigned by the attention layer, and a weight matrix γ of H is obtained, where the weight matrix γ is expressed as:
γ=softmax(wTtanh(H))
w is a parameter vector obtained by training and learning;
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
obtaining a final text feature representation f through a softmax classifierpro
Preferably, the specific method for extracting the common features of the media data is as follows:
constructing a public network based on FGCrossNet;
the four media data simultaneously pass through the convolution layer, the pooling layer and the full connection layer, and the common characteristics of the four media are learned at one time through the loss function.
Preferably, the loss function comprises:
the cross entropy loss function is specifically:
Figure BDA0002531365370000032
wherein I, T, V, A represents image, text, video and audio data, respectively, and l is the cross entropy loss function of all samples in a single media;
the central loss function is specifically:
Figure BDA0002531365370000033
wherein x isjIs a characteristic of the jth sample,
Figure BDA0002531365370000044
features that are the center of the category to which the jth sample belongs;
the quadruple loss function is specifically as follows:
Figure BDA0002531365370000041
wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α12Is a set hyper-parameter;
the distribution loss function is specifically:
Figure BDA0002531365370000042
wherein C represents a certain class, C represents the sum of classes,
Figure BDA0002531365370000043
indicating the distance of the two distributions.
Compared with the prior art, the invention has the following remarkable advantages:
(1) the invention sets a feature extractor for each media individually, which fully considers the special characteristics of each media;
(2) the invention constructs a network which can process four media simultaneously, thereby reducing the heterogeneous gap problem as much as possible;
(3) the invention introduces four loss functions, comprehensively considers the problems of inter-class difference, intra-class difference and inter-media difference;
(4) the invention adopts a multi-model network method to obviously improve the accuracy of fine-grained cross-media retrieval.
The present invention is described in further detail below with reference to the attached drawing figures.
Drawings
FIG. 1 is a flow chart of a fine-grained cross-media retrieval method based on a multi-model network.
Fig. 2 is a schematic diagram of a bilinear CNN-based feature extractor.
FIG. 3 is a schematic diagram of an attention-based bidirectional long-short term network (Att-BLSTM) feature extractor.
Fig. 4 is a schematic diagram of a public network.
Detailed Description
As shown in fig. 1, a fine-grained cross-media retrieval method based on a multi-model network includes the following steps:
step 1, acquiring a PKU FG-XMedia data set, wherein the data set is the only data set with fine granularity in the cross-media field at present, and comprises 200 fine-granularity categories of birds, including four media types of images, videos, texts and audios. And preprocessing the data set to obtain cross-media data.
Specifically, the pretreatment method comprises the following steps: for pictures and texts, no processing is needed; for video, taking 25 frames at equal intervals for each video as video data; for audio, a short-time fourier transform is used to obtain a spectrogram as audio data.
Step 2, respectively extracting the special characteristics of each media data, which specifically comprises the following steps:
the image and video data features are extracted by a feature extractor based on bilinear CNN, and the specific process is as follows:
as shown in FIG. 2, two CNN networks can be viewed as two feature extraction functions Ea、EbThe image or video data i respectively passes through two CNN networks to obtain different characteristics, and through bilinear operation, that is, for each position l of the space, the outer product of different spatial positions is calculated to obtain bilinear characteristics b (l, i):
b(l,i)=Ea(l,i)TEb(l,i)
the bilinear features of all positions L are converged into one feature through a pooling function, wherein P is the pooling function:
P(i)=∑l∈Lb(l,i)
then obtaining final image and video characteristics f through a full connection layerpro
For text data, firstly, word vector W is trained and obtained by adopting a word2vec model (namely, the semantics of words are mapped into a vector space, and a word is represented by a specific vector);
features are then extracted using a feature extractor for the attention-based two-way long and short term network. As shown in fig. 3, the text feature extractor is composed of four layers, the first layer being an input layer for receiving an input sentence T ═ T1,t2,…,tn]WhereintiIs the ith word in the sentence, and n is the length of the sentence;
the second layer is an embedding layer, each word t in the sentence is pre-trained through a word vector W matrixiConversion into a particular word vector eiThe input corresponding word vector is denoted as E ═ E1,e2,…,en];
The third layer is the LSTM layer, with deeper feature representations obtained through the bi-directional LSTM network. The output of the ith word is hiCombining the forward and backward pass outputs using element summation:
Figure BDA0002531365370000051
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
The fourth layer is an attention layer and is used for distributing a larger weight to the important features, a weight matrix gamma of H is obtained, w is a parameter vector obtained by training and learning, and the weight matrix gamma is expressed as:
γ=softmax(wTtanh(H))
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
and then obtaining a final text feature representation f through a softmax classifierpro
For audio data, because some noise, such as water noise, wind noise, etc., is contained in the audio data, the VGG16 has strong noise immunity. Extracting features by adopting a feature extractor based on VGG to obtain a final audio feature representation fpro
Step 3, extracting the public characteristics of each media data based on the public network, wherein the specific method comprises the following steps:
a common network based on FGCrossNet and capable of modeling four media data simultaneously is constructed so as to extract common characteristics of the four media. As shown in fig. 4, the public network simultaneously passes four media data through a convolutional layer, a pooling layer and a full link layer, and learns the public characteristics of the four media at one time through an optimized loss function, wherein the optimized loss function includes a cross entropy loss function, a central loss function, a quadruple loss function and a distribution loss function;
the cross entropy loss function ensures a better differentiation of the network, representing images, text, video and audio as I, T, V, A. L iscroRepresents the sum of all media cross entropy loss functions:
Figure BDA0002531365370000061
in the equation, l is the cross entropy loss function of all samples in a single medium. At the beginning of each training stage, the training set is randomly selected again, and the selected sample numbers N of the four media types are the same, so that the problem that the training data of the media are different is solved.
The center loss function ensures that the sample features of the same class are closer to the sample center features, denoted as Lcen
Figure BDA0002531365370000062
Wherein x isjIs a characteristic of the jth sample,
Figure BDA0002531365370000063
is the feature of the center of the category to which the jth sample belongs.
The quadruple loss function ensures that the distances of different classes are as far as possible, denoted Lqua
Figure BDA0002531365370000071
Wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α12Is artificialThe set hyperparameters are used to balance the two terms in equation (8), typically set to 1 and 0.5, respectively.
By the quadruple loss function, the distance between different types of samples is further increased, and the positive sample pair (x) is also drawn closea,xp) The distance between them.
The distribution loss function ensures that the distributions of the same category of different media are as close as possible. Firstly, optimizing sample characteristic distribution difference h by adopting maximum mean difference loss function (MMD)i,hjSample feature distributions representing the same category of different media, i, j representing any two different media types, the distance of the two distributions being denoted gmmd(hi,hj):
Figure BDA0002531365370000072
Where φ () is a mapping function and H denotes that this distance is measured by φ () mapping data into the regenerative Hilbert space (RKHS). Distribution function LdisThe sum of the distribution differences for all categories of any two different media:
Figure BDA0002531365370000073
where C represents a certain class and C represents the sum of classes.
Step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain the final combined feature f:
f=α*fpro+(1-α)*fcom
wherein f isproIs a characteristic peculiar to each input, fcomIs the common feature of each input and α is the weight corresponding to the unique feature.
And 5, measuring the similarity between different media characteristics by using the cosine distance to obtain a returned result, and adopting an average precision average value (mAP) as an evaluation index of retrieval, wherein the higher the mAP value is, the more excellent the retrieval effect is. Tables 1 and 2 show the comparison between the effect of the fine-grained cross-media search method of the present invention and the existing methods, which correspond to the methods described in the following documents [1] to [8], respectively:
[1]Xiangteng He,Yuxin Peng,Liu Xie:A New Benchmark and Approach forFine-grained Cross-media Retrieval.ACM Multimedia 2019:1740-1748.
[2]Xin Huang,Yuxin Peng,Mingkuan Yuan:MHTN:Modal-adversarial HybridTransfer Network for Cross-modal Retrieval.CoRR abs/1708.04308(2017).
[3]Bokun Wang,Yang Yang,Xing Xu,Alan Hanjalic,Heng Tao Shen:Adversarial Cross-Modal Retrieval.ACM Multimedia 2017:154-162.
[4]Xiaohua Zhai,Yuxin Peng,Jianguo Xiao:Learning Cross-Media JointRepresentation With Sparse and Semisupervised Regularization.IEEETrans.Circuits Syst.Video Techn.24(6):965-978(2014).
[5]Devraj Mandal,Kunal N.Chaudhury,Soma Biswas:Generalized SemanticPreserving Hashing for N-Label Cross-Modal Retrieval.CVPR 2017:2633-2641.
[6]Yuxin Peng,Xin Huang,Jinwei Qi:Cross-Media Shared Representationby Hierarchical Learning with Multiple Deep Networks.IJCAI 2016:3846-3853.
[7]Kuang-Huei Lee,Xi Chen,Gang Hua,Houdong Hu,Xiaodong He:StackedCross Attention for Image-Text Matching.ECCV(4)2018:212-228.
[8]Jiuxiang Gu,Jianfei Cai,Shafiq R.Joty,Li Niu,Gang Wang:Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval WithGenerative Models.CVPR 2018:7181-7189.
table 1 shows the bimodal fine-grained cross-media search results on a PKU FG-XMedia dataset using the present invention and the existing method, wherein "-" indicates no search result:
Figure BDA0002531365370000081
table 2 shows the results of a multi-modal fine-grained cross-media search on a PKU FG-XMedia dataset using the methods provided by the present invention and existing methods:
Figure BDA0002531365370000091
as can be seen from the table, the invention obtains the best performance in both bimodal fine-grained cross-media retrieval and multimodal fine-grained cross-media retrieval, which shows the effectiveness of the combination of a private network and a public network in the method of the invention, and simultaneously, four different loss functions are introduced to comprehensively consider the differences in classes, between classes and between media, thereby fully verifying the effectiveness of the designed method.

Claims (7)

1. A fine-grained cross-media retrieval method based on a multi-model network is characterized by comprising the following specific steps:
step 1, acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data;
step 2, respectively extracting the proprietary features of each media data;
step 3, extracting the public characteristics of each media data;
step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features;
and 5, measuring the similarity between different media characteristics by utilizing the cosine distance and sequencing the media characteristics according to the similarity.
2. The multi-model network based fine-grained cross-media retrieval method of claim 1, wherein the cross-media data comprises image, video, text and audio data.
3. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:
extracting image and video data characteristics by adopting a characteristic extractor based on bilinear CNN;
pre-training word vectors by adopting a word2vec model, and extracting text data features by adopting a feature extractor of a bidirectional long-short term network based on attention;
and extracting audio data features by using a VGG-based feature extractor.
4. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting image and video data features by using the feature extractor based on bilinear CNN is as follows:
the image or video data respectively passes through two CNN networks to obtain different characteristics, and obtains bilinear characteristics b (l, i) through bilinear operation, and the specific formula is as follows:
b(l,i)=Ea(l,i)TEb(l,i)
in the formula, Ea、EbRespectively extracting functions of the two CNN networks;
converging the bilinear features of all the positions L into one feature through a pooling function, wherein the pooling function specifically comprises:
P(i)=Σl∈Lb(l,i)
the converged features are passed through a full link layer to obtain final image and video features.
5. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:
receiving an input sentence T ═ T by an input layer1,t2,…,tn]Wherein t isiIs the ith word in the sentence, and n is the length of the sentence;
each word t in the sentence is combined by embedding a pre-training word vector W matrix in the layeriConversion into a particular word vector ei
Obtaining a deeper feature representation through a bidirectional LSTM network, wherein the output of the ith word is specifically as follows:
Figure FDA0002531365360000021
the set of output vectors is denoted by H, H ═ H1,h2,…,hn];
A larger weight is assigned by the attention layer, and a weight matrix γ of H is obtained, where the weight matrix γ is expressed as:
γ=softmax(wTtanh(H))
w is a parameter vector obtained by training and learning;
multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:
f=HγT
obtaining a final text feature representation f through a softmax classifierpro
6. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for extracting the common features of each media data is as follows:
constructing a public network based on FGCrossNet;
the four media data simultaneously pass through the convolution layer, the pooling layer and the full connection layer, and the common characteristics of the four media are learned at one time through the loss function.
7. The fine-grained cross-media retrieval method based on multi-model network according to claim 6, wherein the loss function comprises:
the cross entropy loss function is specifically:
Figure FDA0002531365360000022
wherein I, T, V, A represents image, text, video and audio data, respectively, and l is the cross entropy loss function of all samples in a single media;
the central loss function is specifically:
Figure FDA0002531365360000031
wherein x isjIs a characteristic of the jth sample,
Figure FDA0002531365360000032
features that are the center of the category to which the jth sample belongs;
the quadruple loss function is specifically as follows:
Figure FDA0002531365360000033
wherein xa,xp,xm1,xm2Belonging to four media types, xa,xpBelonging to the same class, xm1,xm2Belonging to different categories, d () representing the L2 distance, α1,α2Is a set hyper-parameter;
the distribution loss function is specifically:
Figure FDA0002531365360000034
wherein C represents a certain class, C represents the sum of classes,
Figure FDA0002531365360000035
indicating the distance of the two distributions.
CN202010526211.4A 2020-06-09 2020-06-09 Fine granularity cross-media retrieval method based on multi-model network Active CN111782833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010526211.4A CN111782833B (en) 2020-06-09 2020-06-09 Fine granularity cross-media retrieval method based on multi-model network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010526211.4A CN111782833B (en) 2020-06-09 2020-06-09 Fine granularity cross-media retrieval method based on multi-model network

Publications (2)

Publication Number Publication Date
CN111782833A true CN111782833A (en) 2020-10-16
CN111782833B CN111782833B (en) 2023-12-19

Family

ID=72755874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010526211.4A Active CN111782833B (en) 2020-06-09 2020-06-09 Fine granularity cross-media retrieval method based on multi-model network

Country Status (1)

Country Link
CN (1) CN111782833B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101045A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Multi-mode semantic integrity recognition method and device and electronic equipment
CN113486833A (en) * 2021-07-15 2021-10-08 北京达佳互联信息技术有限公司 Multi-modal feature extraction model training method and device and electronic equipment
CN113537145A (en) * 2021-06-28 2021-10-22 青鸟消防股份有限公司 Method, device and storage medium for rapidly solving false detection and missed detection in target detection
CN113704537A (en) * 2021-10-28 2021-11-26 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on multi-scale feature union
CN113779282A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network
CN113792167A (en) * 2021-11-11 2021-12-14 南京码极客科技有限公司 Cross-media cross-retrieval method based on attention mechanism and modal dependence

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220337A (en) * 2017-05-25 2017-09-29 北京大学 A kind of cross-media retrieval method based on mixing migration network
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220337A (en) * 2017-05-25 2017-09-29 北京大学 A kind of cross-media retrieval method based on mixing migration network
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘虎: "基于多尺度双线性卷积神经网络的多角度下车型精细识别", 《计算机应用》 *
罗建豪等: "基于深度卷积特征的细粒度图像分类研究综述", 《自动化学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101045A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Multi-mode semantic integrity recognition method and device and electronic equipment
CN112101045B (en) * 2020-11-02 2021-12-14 北京淇瑀信息科技有限公司 Multi-mode semantic integrity recognition method and device and electronic equipment
CN113537145A (en) * 2021-06-28 2021-10-22 青鸟消防股份有限公司 Method, device and storage medium for rapidly solving false detection and missed detection in target detection
CN113537145B (en) * 2021-06-28 2024-02-09 青鸟消防股份有限公司 Method, device and storage medium for rapidly solving false detection and missing detection in target detection
CN113486833A (en) * 2021-07-15 2021-10-08 北京达佳互联信息技术有限公司 Multi-modal feature extraction model training method and device and electronic equipment
CN113704537A (en) * 2021-10-28 2021-11-26 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on multi-scale feature union
CN113779282A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network
CN113792167A (en) * 2021-11-11 2021-12-14 南京码极客科技有限公司 Cross-media cross-retrieval method based on attention mechanism and modal dependence

Also Published As

Publication number Publication date
CN111782833B (en) 2023-12-19

Similar Documents

Publication Publication Date Title
CN111782833B (en) Fine granularity cross-media retrieval method based on multi-model network
Miech et al. Learning a text-video embedding from incomplete and heterogeneous data
Jiang et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching.
CN109960763B (en) Photography community personalized friend recommendation method based on user fine-grained photography preference
CN105718532B (en) A kind of across media sort methods based on more depth network structures
CN110059217A (en) A kind of image text cross-media retrieval method of two-level network
CN107203636B (en) Multi-video abstract acquisition method based on hypergraph master set clustering
CN113268633B (en) Short video recommendation method
CN105701225B (en) A kind of cross-media retrieval method based on unified association hypergraph specification
Guo et al. Jointly learning of visual and auditory: A new approach for RS image and audio cross-modal retrieval
Mei et al. Patch based video summarization with block sparse representation
CN108388639B (en) Cross-media retrieval method based on subspace learning and semi-supervised regularization
Lee et al. Face image retrieval using sparse representation classifier with gabor-lbp histogram
Abdul-Rashid et al. Shrec’18 track: 2d image-based 3d scene retrieval
Li et al. Exploiting hierarchical activations of neural network for image retrieval
Yang et al. STA-TSN: Spatial-temporal attention temporal segment network for action recognition in video
CN108427740A (en) A kind of Image emotional semantic classification and searching algorithm based on depth measure study
Meng et al. Few-shot image classification algorithm based on attention mechanism and weight fusion
Zhang et al. Exploiting mid-level semantics for large-scale complex video classification
CN105701227B (en) A kind of across media method for measuring similarity and search method based on local association figure
CN113779283B (en) Fine-grained cross-media retrieval method with deep supervision and feature fusion
Qin et al. SHREC’22 track: Sketch-based 3D shape retrieval in the wild
Hu et al. Multimodal learning via exploring deep semantic similarity
CN113239159B (en) Cross-modal retrieval method for video and text based on relational inference network
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant