CN111782833A

CN111782833A - Fine-grained cross-media retrieval method based on multi-model network

Info

Publication number: CN111782833A
Application number: CN202010526211.4A
Authority: CN
Inventors: 王琼; 柏洁咪; 姚亚洲; 唐振民
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-10-16
Anticipated expiration: 2040-06-09
Also published as: CN111782833B

Abstract

The invention discloses a fine-grained cross-media retrieval method based on a multi-model network, which comprises the following steps: acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data; respectively extracting the special characteristics of each media data; extracting common features of each media data; carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features; the cosine distances are used to measure the similarity between different media features and rank the media features by similarity. The invention constructs a media special network and a public network, wherein the media special network comprises a characteristic extractor of each media for extracting the special characteristics of each media; the public network comprises a unified network capable of simultaneously learning four media and is used for extracting the public characteristics of each media, and the combination of the two networks realizes that the heterogeneous gaps among the media are eliminated while the characteristics of the media are retained to the maximum extent, so that the effective cross-media retrieval is realized, and the method has a wide application prospect.

Description

Fine-grained cross-media retrieval method based on multi-model network

Technical Field

The invention belongs to the technical fields of computer vision, natural language processing, fine-grained identification, multimedia retrieval and the like, and particularly relates to a fine-grained cross-media retrieval method based on a multi-model network.

Background

In recent years, with the rapid growth of multimedia data, multimedia data such as images, text, audio, and video has become a major form of people recognizing the world. Research on multimedia data has been ongoing for several years and past research has generally focused on a single media type, i.e. the results of queries and searches are of one media type. At present, the correlation among mass multimedia data is continuously improved, and meanwhile, the retrieval requirement of a user on the multimedia data becomes very flexible and can not only meet the retrieval requirement of a single media type, so that how to realize cross-media retrieval is a key problem to be solved at present. Cross-media retrieval refers to a user returning relevant data of other media types desired by the user by submitting data of a certain media type. For example, when the user currently has a photo of the hong bird family but does not know the category name of the photo content, the photo of the hong bird family can be submitted, and video, text and audio information related to the hong bird family can be returned. However, the existing cross-media retrieval generally focuses on the coarse level, and the research on the fine level is little. The coarse level is for large classes (e.g., birds, dogs, cats, etc.), while the fine level focuses on subclasses of a large class (e.g., gull-wing, avenaceae, etc.). In a general coarse-grained retrieval method, when a user submits a picture of the red bird family for retrieval, audio, video and text of the bird are returned instead of information related to the red bird family category, which does not meet the requirements of people. Based on the problems, the research on the cross-media retrieval method with the fine granularity level has wide practical significance.

Currently, there are some shortcomings in the research aiming at fine-grained cross-media retrieval, and the most important problem is two-fold, namely the heterogeneous gap between media, that is, the feature representations of data samples of different media types are very different, so that it is very difficult to measure the similarity between them directly. Another problem is that existing research does not fully consider the problem of small inter-class differences (very similar between different fine-grained categories, such as gull, and jacuckoo), and large intra-class differences (objects in the same category differ significantly due to differences in posture illumination, etc.) caused by fine-grained levels, which makes fine-grained level retrieval more challenging than coarse-grained levels.

Disclosure of Invention

The invention aims to provide a fine-grained cross-media retrieval method based on a multi-model network.

The technical solution for realizing the purpose of the invention is as follows: a fine-grained cross-media retrieval method based on a multi-model network comprises the following specific steps:

step 1, acquiring a cross-media data set, and preprocessing the cross-media data set to acquire cross-media data;

step 2, respectively extracting the proprietary features of each media data;

step 3, extracting the public characteristics of each media data;

step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain final combined features;

and 5, measuring the similarity between different media characteristics by utilizing the cosine distance and sequencing the media characteristics according to the similarity.

Preferably, the cross-media data includes image, video, text and audio data.

Preferably, the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:

extracting image and video data characteristics by adopting a characteristic extractor based on bilinear CNN;

pre-training word vectors by adopting a word2vec model, and extracting text data features by adopting a feature extractor of a bidirectional long-short term network based on attention;

and extracting audio data features by using a VGG-based feature extractor.

Preferably, the specific process of extracting the image and video data features by using the feature extractor based on bilinear CNN is as follows:

the image or video data respectively passes through two CNN networks to obtain different characteristics, and obtains bilinear characteristics b (l, i) through bilinear operation, and the specific formula is as follows:

b(l,i)＝E_a(l,i)^TE_b(l,i)

in the formula, E_a、E_bRespectively extracting functions of the two CNN networks;

converging the bilinear features of all the positions L into one feature through a pooling function, wherein the pooling function specifically comprises:

P(i)＝∑_l∈Lb(l,i)

the converged features are passed through a full link layer to obtain final image and video features.

Preferably, the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:

receiving an input sentence T ═ T by an input layer₁,t₂,…,t_n]Wherein t is_iIs the ith word in the sentence, and n is the length of the sentence;

each word t in the sentence is combined by embedding a pre-training word vector W matrix in the layer_iConversion into a particular word vector e_i；

Obtaining a deeper feature representation through a bidirectional LSTM network, wherein the output of the ith word is specifically as follows:

the set of output vectors is denoted by H, H ═ H₁,h₂,…,h_n]；

A larger weight is assigned by the attention layer, and a weight matrix γ of H is obtained, where the weight matrix γ is expressed as:

γ＝softmax(w^Ttanh(H))

w is a parameter vector obtained by training and learning;

multiplying the output vector set H of the LSTM layer and the weight matrix gamma obtained by the attention layer to obtain a feature representation f of the sentence, namely:

f＝Hγ^T

obtaining a final text feature representation f through a softmax classifier_pro。

Preferably, the specific method for extracting the common features of the media data is as follows:

constructing a public network based on FGCrossNet;

the four media data simultaneously pass through the convolution layer, the pooling layer and the full connection layer, and the common characteristics of the four media are learned at one time through the loss function.

Preferably, the loss function comprises:

the cross entropy loss function is specifically:

wherein I, T, V, A represents image, text, video and audio data, respectively, and l is the cross entropy loss function of all samples in a single media;

the central loss function is specifically:

wherein x is_jIs a characteristic of the jth sample,

features that are the center of the category to which the jth sample belongs;

the quadruple loss function is specifically as follows:

wherein x_a,x_p,x_m1,x_m2Belonging to four media types, x_a,x_pBelonging to the same class, x_m1,x_m2Belonging to different categories, d () representing the L2 distance, α₁,α₂Is a set hyper-parameter;

the distribution loss function is specifically:

wherein C represents a certain class, C represents the sum of classes,

indicating the distance of the two distributions.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the invention sets a feature extractor for each media individually, which fully considers the special characteristics of each media;

(2) the invention constructs a network which can process four media simultaneously, thereby reducing the heterogeneous gap problem as much as possible;

(3) the invention introduces four loss functions, comprehensively considers the problems of inter-class difference, intra-class difference and inter-media difference;

(4) the invention adopts a multi-model network method to obviously improve the accuracy of fine-grained cross-media retrieval.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

FIG. 1 is a flow chart of a fine-grained cross-media retrieval method based on a multi-model network.

Fig. 2 is a schematic diagram of a bilinear CNN-based feature extractor.

FIG. 3 is a schematic diagram of an attention-based bidirectional long-short term network (Att-BLSTM) feature extractor.

Fig. 4 is a schematic diagram of a public network.

Detailed Description

As shown in fig. 1, a fine-grained cross-media retrieval method based on a multi-model network includes the following steps:

step 1, acquiring a PKU FG-XMedia data set, wherein the data set is the only data set with fine granularity in the cross-media field at present, and comprises 200 fine-granularity categories of birds, including four media types of images, videos, texts and audios. And preprocessing the data set to obtain cross-media data.

Specifically, the pretreatment method comprises the following steps: for pictures and texts, no processing is needed; for video, taking 25 frames at equal intervals for each video as video data; for audio, a short-time fourier transform is used to obtain a spectrogram as audio data.

Step 2, respectively extracting the special characteristics of each media data, which specifically comprises the following steps:

the image and video data features are extracted by a feature extractor based on bilinear CNN, and the specific process is as follows:

as shown in FIG. 2, two CNN networks can be viewed as two feature extraction functions E_a、E_bThe image or video data i respectively passes through two CNN networks to obtain different characteristics, and through bilinear operation, that is, for each position l of the space, the outer product of different spatial positions is calculated to obtain bilinear characteristics b (l, i):

b(l,i)＝E_a(l,i)^TE_b(l,i)

the bilinear features of all positions L are converged into one feature through a pooling function, wherein P is the pooling function:

P(i)＝∑_l∈Lb(l,i)

then obtaining final image and video characteristics f through a full connection layer_pro。

For text data, firstly, word vector W is trained and obtained by adopting a word2vec model (namely, the semantics of words are mapped into a vector space, and a word is represented by a specific vector);

features are then extracted using a feature extractor for the attention-based two-way long and short term network. As shown in fig. 3, the text feature extractor is composed of four layers, the first layer being an input layer for receiving an input sentence T ═ T₁,t₂,…,t_n]Whereint_iIs the ith word in the sentence, and n is the length of the sentence;

the second layer is an embedding layer, each word t in the sentence is pre-trained through a word vector W matrix_iConversion into a particular word vector e_iThe input corresponding word vector is denoted as E ═ E₁,e₂,…,e_n]；

The third layer is the LSTM layer, with deeper feature representations obtained through the bi-directional LSTM network. The output of the ith word is h_iCombining the forward and backward pass outputs using element summation:

the set of output vectors is denoted by H, H ═ H₁,h₂,…,h_n]；

The fourth layer is an attention layer and is used for distributing a larger weight to the important features, a weight matrix gamma of H is obtained, w is a parameter vector obtained by training and learning, and the weight matrix gamma is expressed as:

γ＝softmax(w^Ttanh(H))

f＝Hγ^T

and then obtaining a final text feature representation f through a softmax classifier_pro。

For audio data, because some noise, such as water noise, wind noise, etc., is contained in the audio data, the VGG16 has strong noise immunity. Extracting features by adopting a feature extractor based on VGG to obtain a final audio feature representation f_pro。

Step 3, extracting the public characteristics of each media data based on the public network, wherein the specific method comprises the following steps:

a common network based on FGCrossNet and capable of modeling four media data simultaneously is constructed so as to extract common characteristics of the four media. As shown in fig. 4, the public network simultaneously passes four media data through a convolutional layer, a pooling layer and a full link layer, and learns the public characteristics of the four media at one time through an optimized loss function, wherein the optimized loss function includes a cross entropy loss function, a central loss function, a quadruple loss function and a distribution loss function;

the cross entropy loss function ensures a better differentiation of the network, representing images, text, video and audio as I, T, V, A. L is_croRepresents the sum of all media cross entropy loss functions:

in the equation, l is the cross entropy loss function of all samples in a single medium. At the beginning of each training stage, the training set is randomly selected again, and the selected sample numbers N of the four media types are the same, so that the problem that the training data of the media are different is solved.

The center loss function ensures that the sample features of the same class are closer to the sample center features, denoted as L_cen：

Wherein x is_jIs a characteristic of the jth sample,

is the feature of the center of the category to which the jth sample belongs.

The quadruple loss function ensures that the distances of different classes are as far as possible, denoted L_qua：

Wherein x_a,x_p,x_m1,x_m2Belonging to four media types, x_a,x_pBelonging to the same class, x_m1,x_m2Belonging to different categories, d () representing the L2 distance, α₁,α₂Is artificialThe set hyperparameters are used to balance the two terms in equation (8), typically set to 1 and 0.5, respectively.

By the quadruple loss function, the distance between different types of samples is further increased, and the positive sample pair (x) is also drawn close_a,x_p) The distance between them.

The distribution loss function ensures that the distributions of the same category of different media are as close as possible. Firstly, optimizing sample characteristic distribution difference h by adopting maximum mean difference loss function (MMD)ⁱ,h^jSample feature distributions representing the same category of different media, i, j representing any two different media types, the distance of the two distributions being denoted g_mmd(hⁱ,h^j)：

Where φ () is a mapping function and H denotes that this distance is measured by φ () mapping data into the regenerative Hilbert space (RKHS). Distribution function L_disThe sum of the distribution differences for all categories of any two different media:

where C represents a certain class and C represents the sum of classes.

Step 4, carrying out weighted summation on the proprietary features and the public features of the cross-media data to obtain the final combined feature f:

f＝α*f_pro+(1-α)*f_com

wherein f is_proIs a characteristic peculiar to each input, f_comIs the common feature of each input and α is the weight corresponding to the unique feature.

And 5, measuring the similarity between different media characteristics by using the cosine distance to obtain a returned result, and adopting an average precision average value (mAP) as an evaluation index of retrieval, wherein the higher the mAP value is, the more excellent the retrieval effect is. Tables 1 and 2 show the comparison between the effect of the fine-grained cross-media search method of the present invention and the existing methods, which correspond to the methods described in the following documents [1] to [8], respectively:

[1]Xiangteng He,Yuxin Peng,Liu Xie:A New Benchmark and Approach forFine-grained Cross-media Retrieval.ACM Multimedia 2019:1740-1748.

[2]Xin Huang,Yuxin Peng,Mingkuan Yuan:MHTN:Modal-adversarial HybridTransfer Network for Cross-modal Retrieval.CoRR abs/1708.04308(2017).

[3]Bokun Wang,Yang Yang,Xing Xu,Alan Hanjalic,Heng Tao Shen:Adversarial Cross-Modal Retrieval.ACM Multimedia 2017:154-162.

[4]Xiaohua Zhai,Yuxin Peng,Jianguo Xiao:Learning Cross-Media JointRepresentation With Sparse and Semisupervised Regularization.IEEETrans.Circuits Syst.Video Techn.24(6):965-978(2014).

[5]Devraj Mandal,Kunal N.Chaudhury,Soma Biswas:Generalized SemanticPreserving Hashing for N-Label Cross-Modal Retrieval.CVPR 2017:2633-2641.

[6]Yuxin Peng,Xin Huang,Jinwei Qi:Cross-Media Shared Representationby Hierarchical Learning with Multiple Deep Networks.IJCAI 2016:3846-3853.

[7]Kuang-Huei Lee,Xi Chen,Gang Hua,Houdong Hu,Xiaodong He:StackedCross Attention for Image-Text Matching.ECCV(4)2018:212-228.

[8]Jiuxiang Gu,Jianfei Cai,Shafiq R.Joty,Li Niu,Gang Wang:Look,Imagine and Match:Improving Textual-Visual Cross-Modal Retrieval WithGenerative Models.CVPR 2018:7181-7189.

table 1 shows the bimodal fine-grained cross-media search results on a PKU FG-XMedia dataset using the present invention and the existing method, wherein "-" indicates no search result:

table 2 shows the results of a multi-modal fine-grained cross-media search on a PKU FG-XMedia dataset using the methods provided by the present invention and existing methods:

as can be seen from the table, the invention obtains the best performance in both bimodal fine-grained cross-media retrieval and multimodal fine-grained cross-media retrieval, which shows the effectiveness of the combination of a private network and a public network in the method of the invention, and simultaneously, four different loss functions are introduced to comprehensively consider the differences in classes, between classes and between media, thereby fully verifying the effectiveness of the designed method.

Claims

1. A fine-grained cross-media retrieval method based on a multi-model network is characterized by comprising the following specific steps:

step 2, respectively extracting the proprietary features of each media data;

step 3, extracting the public characteristics of each media data;

2. The multi-model network based fine-grained cross-media retrieval method of claim 1, wherein the cross-media data comprises image, video, text and audio data.

3. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for respectively extracting the proprietary features of each media data in step 2 is as follows:

and extracting audio data features by using a VGG-based feature extractor.

4. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting image and video data features by using the feature extractor based on bilinear CNN is as follows:

b(l，i)＝E_a(l，i)^TE_b(l，i)

P(i)＝Σ_l∈Lb(l，i)

5. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 3, wherein the specific process of extracting text data features by the feature extractor of the attention-based bidirectional long-short term network is as follows:

receiving an input sentence T ═ T by an input layer₁，t₂，…，t_n]Wherein t is_iIs the ith word in the sentence, and n is the length of the sentence;

the set of output vectors is denoted by H, H ═ H₁，h₂，…，h_n]；

γ＝softmax(w^Ttanh(H))

w is a parameter vector obtained by training and learning;

f＝Hγ^T

6. The fine-grained cross-media retrieval method based on the multi-model network as claimed in claim 1, wherein the specific method for extracting the common features of each media data is as follows:

constructing a public network based on FGCrossNet;

7. The fine-grained cross-media retrieval method based on multi-model network according to claim 6, wherein the loss function comprises:

the cross entropy loss function is specifically:

the central loss function is specifically:

wherein x is_jIs a characteristic of the jth sample,

features that are the center of the category to which the jth sample belongs;

the quadruple loss function is specifically as follows:

wherein x_a，x_p，x_m1，x_m2Belonging to four media types, x_a，x_pBelonging to the same class, x_m1，x_m2Belonging to different categories, d () representing the L2 distance, α₁，α₂Is a set hyper-parameter;

the distribution loss function is specifically:

wherein C represents a certain class, C represents the sum of classes,

indicating the distance of the two distributions.