CN112800249A

CN112800249A - Fine-grained cross-media retrieval method based on generation of countermeasure network

Info

Publication number: CN112800249A
Application number: CN202110133925.3A
Authority: CN
Inventors: 唐振民; 洪瑾; 姚亚洲
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-05-14

Abstract

The invention discloses a fine-grained cross-media retrieval method based on a generated countermeasure network, which comprises the following steps: (1) carrying out noise frame filtering operation on the video data; (2) constructing a dual-channel feature extraction network; (3) constructing a fine-grained cross-media retrieval model based on a generated countermeasure network, wherein the model comprises a generator and a media discriminator, and a common characteristic representation of cross-media data is obtained after countermeasure training; (4) and performing similarity measurement on the features of the common feature space, sequencing according to the similarity, and finding out data with higher similarity with the input fine-grained data for output. The invention can accurately learn the discriminative characteristics of fine-grained data, quickly and accurately retrieve various media data with higher similarity to the input data, and can be widely applied to various multimedia fields.

Description

Fine-grained cross-media retrieval method based on generation of countermeasure network

Technical Field

The invention relates to a multimedia analysis and retrieval method, in particular to a fine-grained cross-media retrieval method based on a generation countermeasure network.

Background

In the big data era, multimedia data in various forms such as images, videos, audios, texts and the like comprehensively convey information to be expressed, and great convenience is brought to work and life of people. To address this problem, cross-media retrieval is receiving increasing attention from scholars. The cross-media retrieval is to input data of one media type, and obtain data of other media types with the same or similar semantics. For example, the user may retrieve a bird video or audio with an image of the bird. However, conventional cross-media retrieval is coarse-grained. For example, when a user inputs an image of "gull of california", the search engine will only treat it as "birds" and return many images, videos and other media data of "birds", including information about the sub-species "gull of siemens" and "gull", whereas when the user wants to search only "gull of california" and does not want the searched result to be adulterated with other sub-species of birds, such coarse-grained cross-media search does not meet the original intention of the user search. Only fine-grained cross-media retrieval can relatively accurately find the precise subdivision type which a user wants to search, and only relevant media information of the 'Laribu, California' is returned. Therefore, as the demand of people for diversified and professional retrieval is continuously increased, fine-grained cross-media retrieval replaces traditional coarse-grained and single-media retrieval and becomes a hot spot of current research.

The fine-grained cross-media retrieval algorithm can be associated with various media information of a certain precisely classified object, so that the purpose of flexible and accurate retrieval is achieved. However, fine-grained cross-media retrieval has been challenging for two reasons: (1) minor inter-class differences: similar subcategories belonging to the same species may have similar global appearance (images or video), similar textual description (text) and similar sound (audio), which makes it difficult to distinguish between similar fine-grained subcategories. (2) Heterogeneity differences: data of different media types have inconsistent distributions and characteristic representations, and thus it is difficult to directly perform similarity measurements across media data.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a fine-grained cross-media retrieval method based on a generation countermeasure network, which can accurately learn the discriminative characteristics of fine-grained data and quickly and accurately retrieve various media data with higher similarity to input data.

The technical scheme is as follows: the fine-grained cross-media retrieval method based on the generation countermeasure network comprises the following steps:

(1) carrying out noise frame filtering operation on the video data;

(2) constructing a dual-channel feature extraction network, extracting features of text data by using a self-attention-based text feature extractor, and extracting features of other media data by using a deep neural network;

(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training, and finally obtaining a public characteristic representation;

(4) and calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode.

Preferably, the cross-media data includes image, video, audio and text data.

Preferably, in step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:

n frames are cut from each video at the same interval to form an original key frame, N is more than or equal to 25 and less than or equal to 50, and then the deep neural network is used as a feature extractor to extract the features of the original key frame, and the features are expressed as follows:

f_v＝{m₁,m₂,…,m_N}

where v denotes the total number of videos in the video data set, m_iFeatures representing an ith frame of image;

using ζ₂Norm calculation of the characteristics of each video frame and the characteristics of all other framesThe sum of the distances between features, expressed as:

in the formula (d)_jRepresenting all other video frames to m_jThe sum of the distances of (a);

to d₁、d₂、…、d_NSorting is performed, assuming d_k(k ∈ 1,2, …, N) is minimum, then the kth frame is set as the center frame; calculating d_kAverage value of (a)_k：

Let λ a_kIs a threshold value T, and the value of lambda is 0.001-0.01; and judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as a valid frame.

The extraction of the video data is to intercept a fixed number of frames at equal intervals as input, wherein the frames inevitably comprise some frames irrelevant to a target object, such as a leader and a trailer, so that a noise frame in the video influences the characteristic distribution of the input data and a network parameter adapts to the characteristic distribution, thereby influencing the retrieval accuracy. And further, a video frame filtering method based on characteristic space clustering can be adopted to remove video frames irrelevant to the target object, so that the retrieval accuracy is improved.

Preferably, in step (2), the deep neural network is a ResNet50 network pre-trained on ImageNet for extracting image, video and audio data features.

Preferably, in the step (2), the text feature extraction method based on the self-attention mechanism specifically includes: given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:

E＝(e₁,e₂,…,e_n)

in the formula, e_iA word-embedding representation vector representing an ith word of the sentence;

adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time t_tExpressed as:

h_t＝LSTM(e_t,h_t-1,h_t+1)

h is the set of all hidden layer output results for bi-directional LSTM, denoted as:

H＝(h₁,h₂,…,h_n)

linear superposition is used to reduce the feature dimension. The dimensionality is reduced and then represented by H';

the self-attention mechanism takes the entire LSTM hidden state H' as input and then outputs a weight matrix M, represented as:

M＝s(W₁(g(W₂H′^T)))

in the formula, W₁And W₂A parameter matrix representing two fully connected layers, s (-) and g (-) representing activation functions;

the hidden state H' of LSTM is multiplied by the weight matrix M to obtain an embedded text matrix L, which is expressed as:

L＝H′^TM

l is the character of the text data obtained by the text processing channel, and then the dimension of the character is adjusted to be consistent with the characters of other three media by a plurality of full connection layers.

The invention adopts a text feature extraction method based on a self-attention mechanism, and the recurrent neural network can overcome the weak label characteristic of the text, and can combine the accuracy of the self-attention mechanism on the acquisition of important features and the controllability of the recurrent neural network on sequence data to find the most important features in a plurality of description words.

Preferably, in the step (3), the specific process of constructing the common feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.

In the fine-grained cross-media retrieval task, the intra-class distance of the sub-species of different media types needs to be reduced in the retrieval process, and the inter-class distance of the different sub-species of the same media type is enlarged, so that the inconsistency of the features can be caused if the features are directly extracted from each media by using a common convolutional neural network; the invention completes the standardization process by using the generated countermeasure network, not only can improve the classification accuracy, but also can ensure that the characteristics of different media data under the same category label are similar as much as possible, namely, the public characteristic expression space of fine-grained subcategories is obtained.

Preferably, said set of constraints comprises in particular the following:

the generator classification constraint specifically includes:

in the formula (I), the compound is shown in the specification,

represents the cross entropy loss function, hⁱRepresenting a feature, yⁱRepresenting semantic category labels;

the invention can learn the fine-grained semantic features of various media by adopting generator classification constraints.

The distance constraint specifically includes:

L_dis＝||S_I-S_V||₂+||S_I-S_A||₂+||S_I-S_T||₂+||S_T-S_V||₂+||S_T-S_A||₂+||S_A-S_V||₂

in the formula, S represents the characteristics obtained by each media data through a common characteristic representation space;

the present invention employs distance constraints to ensure that the intra-class sample features are as close as possible.

The ordering constraint specifically includes:

wherein d (·,. cndot.) represents the Euclidean distance, a₁And a₂Represents a boundary threshold;

the classification loss of the media discriminator is specifically:

in the formula, theta_DDenotes parameters constituting a full connection layer of the media discriminator, and m denotes a one-hot media category label.

The invention adopts the ordering constraint to ensure that the characteristics of the same subclass sample are closer, and the characteristics of different subclass samples have sparseness.

Preferably, in the step (3), the countermeasure training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, and the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm; assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phase_i,v,a,θ_t,θ_S,θ_D) Is defined as:

E(θ_i,v,a,θ_t,θ_S,θ_D)＝L_dis(θ_i,v,a,θ_t,θ_S)+L_cla(θ_i,v,a,θ_t,θ_S)+L_ran(θ_i,v,a,θ_t,θ_S)-λL_adv(θ_i,v,a,θ_t,θ_S,θ_D)

in the formula，θ_i,v,aAnd theta_tParameter, θ, representing a two-channel feature extractor_SRepresenting parameters of fully connected layers constituting a common feature representation space;

the countermeasure training of the minimum and maximum game is specifically represented as:

in the formula, the parameter theta_i,v,a、θ_t、θ_SMinimize the above equation and the parameter theta_DThe following equation is maximized, which generates a confrontational training process for the confrontational model.

Preferably, in step (4), the similarity measurement tool is:

in the formula (I), the compound is shown in the specification,

and

representing two vectors, A_iAnd B_iRespectively represent vectors

And

the respective components of (a).

Has the advantages that: compared with the prior art, the invention has the following remarkable effects: (1) the invention adopts a video frame filtering method based on characteristic space clustering, which can remove video frames irrelevant to the target object, improve the retrieval accuracy and solve the problem of noise frames of video data; (2) effective text features can be better captured by adopting a self-attention machine, and the context relationship is extracted by using a recurrent neural network, so that the weak label characteristic of the text is overcome; (3) by generating a common feature representation space for countering network learning across media data, common representations of the across media data can be better learned; (4) the invention comprehensively considers the heterogeneity difference and the fine-grained subcategory difference among the cross-media data, designs a fine-grained cross-media retrieval algorithm based on the generation countermeasure network, and greatly improves the retrieval accuracy.

Drawings

FIG. 1 is a flow diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;

FIG. 2 is a schematic diagram of a video frame filtering algorithm based on feature space clustering;

FIG. 3 is a block diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;

FIG. 4 is a schematic diagram of a text feature extractor based on a self-attention mechanism.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The present invention will be described in detail with reference to examples.

As shown in fig. 1, the fine-grained cross-media retrieval method based on the generation countermeasure network is characterized in that: the method comprises the following steps:

(1) carrying out noise frame filtering operation on the video data; the PKU FG-XMedia dataset, which is the first dataset to date to contain four media type data and is used for fine-grained cross-media retrieval tasks, including images, video, audio, and text, and each media category contains 200 fine-grained sub-species of birds. The data set is first preprocessed to obtain cross-media data.

Specifically, the pretreatment method comprises the following steps: for an image, the size thereof is adjusted to 448 × 448; for the text, converting the text into an n x d matrix, wherein d is a symbol embedding dimension and is 100, and in addition, fixing the length of all text sentences into 448 characters, so that the matrix size of each text section is 448 x 100, and if the number of characters of the sentences is less than 448, supplementing 0 on the lines; if the character length exceeds 448, the following characters are cut out at the 448 th character; the audio data in the original data set is processed by short-time Fourier transform, and the existing audio is presented by a spectrogram; for each video data, 40 frames of images are extracted from the video data at equal intervals, and then the 40 frames of images are subjected to a denoising operation.

As shown in fig. 2, 40 frame images were input into a ResNet50 network that had been pre-trained on ImageNet, extracting the features of each video frame, which are expressed as:

f_v＝{m₁,m₂,…,m₄₀}

where v represents the total number of videos in the video data set, 8000, m_iFeatures representing an ith frame of image;

using ζ₂The norm calculates the sum of the distances between the features of each video frame and the features of all other frames, and is expressed as:

to d₁、d₂、…、d₄₀Sorting is performed, assuming d_k(k ∈ 1,2, …,40) is minimum, then the kth frame is taken as the center frame;

calculating d_kAverage value of (a)_k：

Let λ a_kIs a threshold value T, and the value of lambda is 0.001-0.01; judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as an effective frame;

(2) as shown in fig. 3, a two-channel feature extraction network is constructed, a self-attention-based text feature extractor is used for extracting features of text data, and a deep neural network is used for extracting features of other media data;

the method comprises the following steps of extracting the features of text data by a text feature extractor based on self attention, wherein the specific process comprises the following steps:

as shown in FIG. 4, given a sentence having n words, where n is 448, the word embedding matrix E for the sentence is represented as:

E＝(e₁,e₂,…,e_n)

adopting a bidirectional LSTM to obtain the dependency relationship between adjacent words in a sentence, wherein the dimension of the hidden layer is 2048, and the output data h of the hidden layer at the time t_tCan be expressed as:

h_t＝LSTM(e_t,h_t-1,h_t+1)

h is the set of all hidden layer output results for bi-directional LSTM, with dimensions 448 × 4096, denoted as:

H＝(h₁,h₂,…,h_n)

reducing characteristic dimensionality by adopting linear superposition, and expressing the dimensionality reduced by H ', wherein the size of H' is 448 multiplied by 2048;

the self-attention mechanism takes as input the entire LSTM hidden state H', and then the output weight matrix M is represented as follows:

M＝softmax(W₁(g(W₂H′^T)))

W₁and W₂Parameter matrices representing the fully-connected layers in two dimensions (2048, 128) and (128,1), respectively, g (-) is the ReLU activation function, and the M dimension after passing through the fully-connected layers and the activation function is 448 × 1;

the hidden state H' of the LSTM is multiplied by the weight matrix M to obtain an embedded text matrix f_TIt is expressed as:

f_T＝H′^TM

f_Tis obtained through a text processing channelThe dimensionality of the text data of (1) is 2048 × 1;

the other channel is a feature extractor for extracting image, video and audio data, here a ResNet50 network, followed by an average pooling layer with kernel size of 14 and step size of 1 after the last convolutional layer of ResNet50, resulting in image, video and audio data with features f, respectively_I、f_VAnd f_AThe dimensions are 2048 × 1;

(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training and finally obtain public characteristic representation, wherein the specific process comprises the following steps:

as shown in fig. 3, after the two-channel feature extraction network, a common feature space learning module is constructed by using 2 linear layers, and semantic information of features in the common feature space is kept under a series of constraints; f. of_I、f_V、f_AAnd f_TGenerating S after passing through the common feature space_I、S_V、S_AAnd S_T(ii) a Then inputting the characteristics in the space into a media discriminator, wherein the discriminator consists of 6 full connection layers, and judges the media type of the characteristics by using media label constraint;

the constraints used include:

a softmax function is connected to the last full-link layer of the generator to serve as a classifier, a group of probability values with the dimension of 200 are output finally, and the type of the sample predicted by the model can be judged according to the probability values; the generator classification constraint can ensure that the network learns semantic information through class labels, specifically:

in the formula (I), the compound is shown in the specification,

is a function of the cross-entropy loss,

is S_I、S_V、S_AAnd S_TRepresentation after the classifier, yⁱIs a semantic category label;

the distance constraint ensures that the features of the intra-class sample are as close as possible, specifically:

ordering constraints ensure that the features of the same subclass samples can be closer, and the features of different subclass samples have sparsity, specifically:

wherein d (·,. cndot.) represents the Euclidean distance, a₁And a₂Representing boundary threshold values, which are respectively 1 and 0.5;

the classification constraints of the media discriminator are:

in the formula, theta_DDenotes parameters of a full connection layer constituting the media discriminator, and m denotes a one-hot media category tag.

During the countertraining, the loss of the generator is minimized, and simultaneously, the loss of the media discriminator is maximized to obtain the optimal model of the algorithm, and the process is also called as the minimum and maximum game; assigning parameters to each loss function and solving the loss function E (theta) of the antagonistic phase_i,v,a,θ_t,θ_S,θ_D) Is defined as:

in the formula, theta_i,v,aAnd theta_tIs a parameter of the two-channel feature extractor, θ_SIs a parameter of the fully connected layer that constitutes the common feature representation space;

the countermeasure training for the min-max game is specifically represented as:

wherein the parameter theta_i,v,a、θ_t、θ_SMinimize the above equation and the parameter theta_DMaximizing the following equation, which generates a confrontational training process for the confrontational model; after the countertraining, the public features which effectively reduce the heterogeneity difference of the cross-media data and learn the fine-grained features can be obtained.

(4) And calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode. And calculating the feature similarity between each kind of media data by using the cosine distance so as to sort the media data, wherein the cosine distance calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

and

representing two vectors, A_iAnd B_iRespectively represent vectors

And

the respective components of (a).

The fine-grained cross-media retrieval method is compared with the effects of 6 cross-media retrieval methods, and the average precision average value (mAP) is used as an evaluation index of retrieval, so that the higher the mAP value is, the more excellent the retrieval effect is. The 6 cross-media retrieval methods are as follows:

[1]He X,Peng Y,Hunag X,et al.A new benchmark and approach for fine-grained cross-media retrieval[C].ACM International Conference on Multimedia,2019:1740-1748.

[2]Huang X,Peng Y,Yuan M.Mhtn:Modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Transactions on Cybernetics,2018.

[3]Wang B,Yang Y,Xu X,et al.Adversarial Cross-Modal Retrieval[C].ACM Ingernational Conference on Multimedia,2017:154-162.

[4]Zhai X,Peng Y,Xiao J.Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2014,24(6):965-978.

[5]Mandal D,Chaudhury K,Biswas S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C].IEEE Conference on Computer Vision and Pattern Recognition,2017:4076-4084.

[6]Peng Y,Huang X,Qi J.Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks[C].International Joint Conference on Artificial Intelligence,2016:3846-3853.

the results of performing dual-media fine-grained cross-media search on a PKU FG-XMedia dataset by using the present invention and 6 cross-media search methods are shown in Table 1, and the conventional methods [1] to [6] in Table 1 respectively correspond to the methods described in the above documents [1] to [6 ].

TABLE 1 Dual-media Fine-grained Cross-media search result comparison

The present invention and 6 cross-media search methods are used to perform the multimedia fine-grained cross-media search on the PKU FG-XMedia dataset, and the results are shown in table 2, in which the existing methods [1] to [6] correspond to the methods described in the above documents [1] to [6], respectively.

TABLE 2 comparison of fine-grained multimedia across-media retrieval results

Method	I→All	T→All	V→All	A→All	Average
						The invention	0.627	0.311	0.491	0.568	0.499
FGCrossNet[1]	0.549	0.196	0.416	0.485	0.412
						MHTN[2]	0.208	0.142	0.237	0.341	0.232
GSPH[5]	0.387	0.103	0.075	0.312	0.219
						JRL[4]	0.344	0.080	0.069	0.275	0.192
CMDN[6]	0.321	0.071	0.016	0.229	0.159
						ACMR[3]	0.245	0.039	0.041	0.279	0.151

As can be seen from tables 1-2, the method of the invention has the best performance in both dual-media fine-grained cross-media retrieval and multimedia fine-grained cross-media retrieval, which shows that the effectiveness of generating the common feature representation of the anti-network learning cross-media data in the method of the invention is utilized, and the effectiveness of the video frame denoising algorithm and the effectiveness of various constraint conditions are verified.

Claims

1. A fine-grained cross-media retrieval method based on a generative confrontation network is characterized in that: the method comprises the following steps:

(1) carrying out noise frame filtering operation on the video data;

(2) constructing a dual-channel feature extraction network, extracting features of text data by using a text feature extractor based on a self-attention mechanism, and extracting features of other media data by using a deep neural network;

2. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:

f_v＝{m₁，m₂，...，m_N}

using ζ₂Norm calculation between the features of each video frame and the features of all other framesIs given as:

to d₁、d₂、…、d_NSorting is performed, assuming d_k(k belongs to 1, 2.. and N) is minimum, then the k frame is taken as a central frame; calculating d_kAverage value of (a)_k：

3. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (2), the text feature extraction method based on the self-attention mechanism specifically comprises the following steps:

given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:

adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time t_tExpressed as: e ═ E (E)₁，e₂，...，e_n)

h_t＝LSTM(e_t，h_t-1，h_t+1)

H＝(h₁，h₂，...，h_n)

M＝s(W₁(g(W₂H′^T)))

L＝H′^TM

4. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the specific process of constructing the public feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.

5. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 4, characterized in that: the series of constraints specifically includes the following:

the generator classification constraint specifically includes:

in the formula (I), the compound is shown in the specification,

the distance constraint specifically includes:

the ordering constraint specifically includes:

the classification loss of the media discriminator is specifically:

6. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the confrontation training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, namely, the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm;

assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phase_i，v，a，θ_t，θ_s，θ_D) Is defined as:

E(θ_i，v，a，θ_t，θ_S，θ_D)＝L_dis(θ_i，v，a，θ_t，θ_S)+L_cla(θ_i，v，a，θ_t，θ_s)+L_ran(θ_i，v，a，θ_t，θ_S)-λL_adv(θ_i，v，a，θ_t，θ_s，θ_D)

in the formula, theta_i，v，aAnd theta_tParameter, θ, representing a two-channel feature extractor_SRepresenting parameters of fully connected layers constituting a common feature representation space;

in the formula, the parameter theta_i，v，a、θ_t、θ_SMinimize the above equation and the parameter theta_DMaximizing the following equation.