CN112800249A - Fine-grained cross-media retrieval method based on generation of countermeasure network - Google Patents
Fine-grained cross-media retrieval method based on generation of countermeasure network Download PDFInfo
- Publication number
- CN112800249A CN112800249A CN202110133925.3A CN202110133925A CN112800249A CN 112800249 A CN112800249 A CN 112800249A CN 202110133925 A CN202110133925 A CN 202110133925A CN 112800249 A CN112800249 A CN 112800249A
- Authority
- CN
- China
- Prior art keywords
- media
- fine
- data
- frame
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/435—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/438—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a fine-grained cross-media retrieval method based on a generated countermeasure network, which comprises the following steps: (1) carrying out noise frame filtering operation on the video data; (2) constructing a dual-channel feature extraction network; (3) constructing a fine-grained cross-media retrieval model based on a generated countermeasure network, wherein the model comprises a generator and a media discriminator, and a common characteristic representation of cross-media data is obtained after countermeasure training; (4) and performing similarity measurement on the features of the common feature space, sequencing according to the similarity, and finding out data with higher similarity with the input fine-grained data for output. The invention can accurately learn the discriminative characteristics of fine-grained data, quickly and accurately retrieve various media data with higher similarity to the input data, and can be widely applied to various multimedia fields.
Description
Technical Field
The invention relates to a multimedia analysis and retrieval method, in particular to a fine-grained cross-media retrieval method based on a generation countermeasure network.
Background
In the big data era, multimedia data in various forms such as images, videos, audios, texts and the like comprehensively convey information to be expressed, and great convenience is brought to work and life of people. To address this problem, cross-media retrieval is receiving increasing attention from scholars. The cross-media retrieval is to input data of one media type, and obtain data of other media types with the same or similar semantics. For example, the user may retrieve a bird video or audio with an image of the bird. However, conventional cross-media retrieval is coarse-grained. For example, when a user inputs an image of "gull of california", the search engine will only treat it as "birds" and return many images, videos and other media data of "birds", including information about the sub-species "gull of siemens" and "gull", whereas when the user wants to search only "gull of california" and does not want the searched result to be adulterated with other sub-species of birds, such coarse-grained cross-media search does not meet the original intention of the user search. Only fine-grained cross-media retrieval can relatively accurately find the precise subdivision type which a user wants to search, and only relevant media information of the 'Laribu, California' is returned. Therefore, as the demand of people for diversified and professional retrieval is continuously increased, fine-grained cross-media retrieval replaces traditional coarse-grained and single-media retrieval and becomes a hot spot of current research.
The fine-grained cross-media retrieval algorithm can be associated with various media information of a certain precisely classified object, so that the purpose of flexible and accurate retrieval is achieved. However, fine-grained cross-media retrieval has been challenging for two reasons: (1) minor inter-class differences: similar subcategories belonging to the same species may have similar global appearance (images or video), similar textual description (text) and similar sound (audio), which makes it difficult to distinguish between similar fine-grained subcategories. (2) Heterogeneity differences: data of different media types have inconsistent distributions and characteristic representations, and thus it is difficult to directly perform similarity measurements across media data.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a fine-grained cross-media retrieval method based on a generation countermeasure network, which can accurately learn the discriminative characteristics of fine-grained data and quickly and accurately retrieve various media data with higher similarity to input data.
The technical scheme is as follows: the fine-grained cross-media retrieval method based on the generation countermeasure network comprises the following steps:
(1) carrying out noise frame filtering operation on the video data;
(2) constructing a dual-channel feature extraction network, extracting features of text data by using a self-attention-based text feature extractor, and extracting features of other media data by using a deep neural network;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training, and finally obtaining a public characteristic representation;
(4) and calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode.
Preferably, the cross-media data includes image, video, audio and text data.
Preferably, in step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:
n frames are cut from each video at the same interval to form an original key frame, N is more than or equal to 25 and less than or equal to 50, and then the deep neural network is used as a feature extractor to extract the features of the original key frame, and the features are expressed as follows:
fv={m1,m2,…,mN}
where v denotes the total number of videos in the video data set, miFeatures representing an ith frame of image;
using ζ2Norm calculation of the characteristics of each video frame and the characteristics of all other framesThe sum of the distances between features, expressed as:
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、dNSorting is performed, assuming dk(k ∈ 1,2, …, N) is minimum, then the kth frame is set as the center frame; calculating dkAverage value of (a)k:
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; and judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as a valid frame.
The extraction of the video data is to intercept a fixed number of frames at equal intervals as input, wherein the frames inevitably comprise some frames irrelevant to a target object, such as a leader and a trailer, so that a noise frame in the video influences the characteristic distribution of the input data and a network parameter adapts to the characteristic distribution, thereby influencing the retrieval accuracy. And further, a video frame filtering method based on characteristic space clustering can be adopted to remove video frames irrelevant to the target object, so that the retrieval accuracy is improved.
Preferably, in step (2), the deep neural network is a ResNet50 network pre-trained on ImageNet for extracting image, video and audio data features.
Preferably, in the step (2), the text feature extraction method based on the self-attention mechanism specifically includes: given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:
E=(e1,e2,…,en)
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time ttExpressed as:
ht=LSTM(et,ht-1,ht+1)
h is the set of all hidden layer output results for bi-directional LSTM, denoted as:
H=(h1,h2,…,hn)
linear superposition is used to reduce the feature dimension. The dimensionality is reduced and then represented by H';
the self-attention mechanism takes the entire LSTM hidden state H' as input and then outputs a weight matrix M, represented as:
M=s(W1(g(W2H′T)))
in the formula, W1And W2A parameter matrix representing two fully connected layers, s (-) and g (-) representing activation functions;
the hidden state H' of LSTM is multiplied by the weight matrix M to obtain an embedded text matrix L, which is expressed as:
L=H′TM
l is the character of the text data obtained by the text processing channel, and then the dimension of the character is adjusted to be consistent with the characters of other three media by a plurality of full connection layers.
The invention adopts a text feature extraction method based on a self-attention mechanism, and the recurrent neural network can overcome the weak label characteristic of the text, and can combine the accuracy of the self-attention mechanism on the acquisition of important features and the controllability of the recurrent neural network on sequence data to find the most important features in a plurality of description words.
Preferably, in the step (3), the specific process of constructing the common feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.
In the fine-grained cross-media retrieval task, the intra-class distance of the sub-species of different media types needs to be reduced in the retrieval process, and the inter-class distance of the different sub-species of the same media type is enlarged, so that the inconsistency of the features can be caused if the features are directly extracted from each media by using a common convolutional neural network; the invention completes the standardization process by using the generated countermeasure network, not only can improve the classification accuracy, but also can ensure that the characteristics of different media data under the same category label are similar as much as possible, namely, the public characteristic expression space of fine-grained subcategories is obtained.
Preferably, said set of constraints comprises in particular the following:
the generator classification constraint specifically includes:
in the formula (I), the compound is shown in the specification,represents the cross entropy loss function, hiRepresenting a feature, yiRepresenting semantic category labels;
the invention can learn the fine-grained semantic features of various media by adopting generator classification constraints.
The distance constraint specifically includes:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
in the formula, S represents the characteristics obtained by each media data through a common characteristic representation space;
the present invention employs distance constraints to ensure that the intra-class sample features are as close as possible.
The ordering constraint specifically includes:
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Represents a boundary threshold;
the classification loss of the media discriminator is specifically:
in the formula, thetaDDenotes parameters constituting a full connection layer of the media discriminator, and m denotes a one-hot media category label.
The invention adopts the ordering constraint to ensure that the characteristics of the same subclass sample are closer, and the characteristics of different subclass samples have sparseness.
Preferably, in the step (3), the countermeasure training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, and the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm; assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phasei,v,a,θt,θS,θD) Is defined as:
E(θi,v,a,θt,θS,θD)=Ldis(θi,v,a,θt,θS)+Lcla(θi,v,a,θt,θS)+Lran(θi,v,a,θt,θS)-λLadv(θi,v,a,θt,θS,θD)
in the formula,θi,v,aAnd thetatParameter, θ, representing a two-channel feature extractorSRepresenting parameters of fully connected layers constituting a common feature representation space;
the countermeasure training of the minimum and maximum game is specifically represented as:
in the formula, the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDThe following equation is maximized, which generates a confrontational training process for the confrontational model.
Preferably, in step (4), the similarity measurement tool is:
in the formula (I), the compound is shown in the specification,andrepresenting two vectors, AiAnd BiRespectively represent vectorsAndthe respective components of (a).
Has the advantages that: compared with the prior art, the invention has the following remarkable effects: (1) the invention adopts a video frame filtering method based on characteristic space clustering, which can remove video frames irrelevant to the target object, improve the retrieval accuracy and solve the problem of noise frames of video data; (2) effective text features can be better captured by adopting a self-attention machine, and the context relationship is extracted by using a recurrent neural network, so that the weak label characteristic of the text is overcome; (3) by generating a common feature representation space for countering network learning across media data, common representations of the across media data can be better learned; (4) the invention comprehensively considers the heterogeneity difference and the fine-grained subcategory difference among the cross-media data, designs a fine-grained cross-media retrieval algorithm based on the generation countermeasure network, and greatly improves the retrieval accuracy.
Drawings
FIG. 1 is a flow diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;
FIG. 2 is a schematic diagram of a video frame filtering algorithm based on feature space clustering;
FIG. 3 is a block diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;
FIG. 4 is a schematic diagram of a text feature extractor based on a self-attention mechanism.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The present invention will be described in detail with reference to examples.
As shown in fig. 1, the fine-grained cross-media retrieval method based on the generation countermeasure network is characterized in that: the method comprises the following steps:
(1) carrying out noise frame filtering operation on the video data; the PKU FG-XMedia dataset, which is the first dataset to date to contain four media type data and is used for fine-grained cross-media retrieval tasks, including images, video, audio, and text, and each media category contains 200 fine-grained sub-species of birds. The data set is first preprocessed to obtain cross-media data.
Specifically, the pretreatment method comprises the following steps: for an image, the size thereof is adjusted to 448 × 448; for the text, converting the text into an n x d matrix, wherein d is a symbol embedding dimension and is 100, and in addition, fixing the length of all text sentences into 448 characters, so that the matrix size of each text section is 448 x 100, and if the number of characters of the sentences is less than 448, supplementing 0 on the lines; if the character length exceeds 448, the following characters are cut out at the 448 th character; the audio data in the original data set is processed by short-time Fourier transform, and the existing audio is presented by a spectrogram; for each video data, 40 frames of images are extracted from the video data at equal intervals, and then the 40 frames of images are subjected to a denoising operation.
As shown in fig. 2, 40 frame images were input into a ResNet50 network that had been pre-trained on ImageNet, extracting the features of each video frame, which are expressed as:
fv={m1,m2,…,m40}
where v represents the total number of videos in the video data set, 8000, miFeatures representing an ith frame of image;
using ζ2The norm calculates the sum of the distances between the features of each video frame and the features of all other frames, and is expressed as:
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、d40Sorting is performed, assuming dk(k ∈ 1,2, …,40) is minimum, then the kth frame is taken as the center frame;
calculating dkAverage value of (a)k:
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as an effective frame;
(2) as shown in fig. 3, a two-channel feature extraction network is constructed, a self-attention-based text feature extractor is used for extracting features of text data, and a deep neural network is used for extracting features of other media data;
the method comprises the following steps of extracting the features of text data by a text feature extractor based on self attention, wherein the specific process comprises the following steps:
as shown in FIG. 4, given a sentence having n words, where n is 448, the word embedding matrix E for the sentence is represented as:
E=(e1,e2,…,en)
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM to obtain the dependency relationship between adjacent words in a sentence, wherein the dimension of the hidden layer is 2048, and the output data h of the hidden layer at the time ttCan be expressed as:
ht=LSTM(et,ht-1,ht+1)
h is the set of all hidden layer output results for bi-directional LSTM, with dimensions 448 × 4096, denoted as:
H=(h1,h2,…,hn)
reducing characteristic dimensionality by adopting linear superposition, and expressing the dimensionality reduced by H ', wherein the size of H' is 448 multiplied by 2048;
the self-attention mechanism takes as input the entire LSTM hidden state H', and then the output weight matrix M is represented as follows:
M=softmax(W1(g(W2H′T)))
W1and W2Parameter matrices representing the fully-connected layers in two dimensions (2048, 128) and (128,1), respectively, g (-) is the ReLU activation function, and the M dimension after passing through the fully-connected layers and the activation function is 448 × 1;
the hidden state H' of the LSTM is multiplied by the weight matrix M to obtain an embedded text matrix fTIt is expressed as:
fT=H′TM
fTis obtained through a text processing channelThe dimensionality of the text data of (1) is 2048 × 1;
the other channel is a feature extractor for extracting image, video and audio data, here a ResNet50 network, followed by an average pooling layer with kernel size of 14 and step size of 1 after the last convolutional layer of ResNet50, resulting in image, video and audio data with features f, respectivelyI、fVAnd fAThe dimensions are 2048 × 1;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training and finally obtain public characteristic representation, wherein the specific process comprises the following steps:
as shown in fig. 3, after the two-channel feature extraction network, a common feature space learning module is constructed by using 2 linear layers, and semantic information of features in the common feature space is kept under a series of constraints; f. ofI、fV、fAAnd fTGenerating S after passing through the common feature spaceI、SV、SAAnd ST(ii) a Then inputting the characteristics in the space into a media discriminator, wherein the discriminator consists of 6 full connection layers, and judges the media type of the characteristics by using media label constraint;
the constraints used include:
a softmax function is connected to the last full-link layer of the generator to serve as a classifier, a group of probability values with the dimension of 200 are output finally, and the type of the sample predicted by the model can be judged according to the probability values; the generator classification constraint can ensure that the network learns semantic information through class labels, specifically:
in the formula (I), the compound is shown in the specification,is a function of the cross-entropy loss,is SI、SV、SAAnd STRepresentation after the classifier, yiIs a semantic category label;
the distance constraint ensures that the features of the intra-class sample are as close as possible, specifically:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
ordering constraints ensure that the features of the same subclass samples can be closer, and the features of different subclass samples have sparsity, specifically:
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Representing boundary threshold values, which are respectively 1 and 0.5;
the classification constraints of the media discriminator are:
in the formula, thetaDDenotes parameters of a full connection layer constituting the media discriminator, and m denotes a one-hot media category tag.
During the countertraining, the loss of the generator is minimized, and simultaneously, the loss of the media discriminator is maximized to obtain the optimal model of the algorithm, and the process is also called as the minimum and maximum game; assigning parameters to each loss function and solving the loss function E (theta) of the antagonistic phasei,v,a,θt,θS,θD) Is defined as:
E(θi,v,a,θt,θS,θD)=Ldis(θi,v,a,θt,θS)+Lcla(θi,v,a,θt,θS)+Lran(θi,v,a,θt,θS)-λLadv(θi,v,a,θt,θS,θD)
in the formula, thetai,v,aAnd thetatIs a parameter of the two-channel feature extractor, θSIs a parameter of the fully connected layer that constitutes the common feature representation space;
the countermeasure training for the min-max game is specifically represented as:
wherein the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDMaximizing the following equation, which generates a confrontational training process for the confrontational model; after the countertraining, the public features which effectively reduce the heterogeneity difference of the cross-media data and learn the fine-grained features can be obtained.
(4) And calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode. And calculating the feature similarity between each kind of media data by using the cosine distance so as to sort the media data, wherein the cosine distance calculation formula is as follows:
in the formula (I), the compound is shown in the specification,andrepresenting two vectors, AiAnd BiRespectively represent vectorsAndthe respective components of (a).
The fine-grained cross-media retrieval method is compared with the effects of 6 cross-media retrieval methods, and the average precision average value (mAP) is used as an evaluation index of retrieval, so that the higher the mAP value is, the more excellent the retrieval effect is. The 6 cross-media retrieval methods are as follows:
[1]He X,Peng Y,Hunag X,et al.A new benchmark and approach for fine-grained cross-media retrieval[C].ACM International Conference on Multimedia,2019:1740-1748.
[2]Huang X,Peng Y,Yuan M.Mhtn:Modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Transactions on Cybernetics,2018.
[3]Wang B,Yang Y,Xu X,et al.Adversarial Cross-Modal Retrieval[C].ACM Ingernational Conference on Multimedia,2017:154-162.
[4]Zhai X,Peng Y,Xiao J.Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2014,24(6):965-978.
[5]Mandal D,Chaudhury K,Biswas S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C].IEEE Conference on Computer Vision and Pattern Recognition,2017:4076-4084.
[6]Peng Y,Huang X,Qi J.Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks[C].International Joint Conference on Artificial Intelligence,2016:3846-3853.
the results of performing dual-media fine-grained cross-media search on a PKU FG-XMedia dataset by using the present invention and 6 cross-media search methods are shown in Table 1, and the conventional methods [1] to [6] in Table 1 respectively correspond to the methods described in the above documents [1] to [6 ].
TABLE 1 Dual-media Fine-grained Cross-media search result comparison
The present invention and 6 cross-media search methods are used to perform the multimedia fine-grained cross-media search on the PKU FG-XMedia dataset, and the results are shown in table 2, in which the existing methods [1] to [6] correspond to the methods described in the above documents [1] to [6], respectively.
TABLE 2 comparison of fine-grained multimedia across-media retrieval results
Method | I→All | T→All | V→All | A→All | Average |
The invention | 0.627 | 0.311 | 0.491 | 0.568 | 0.499 |
FGCrossNet[1] | 0.549 | 0.196 | 0.416 | 0.485 | 0.412 |
MHTN[2] | 0.208 | 0.142 | 0.237 | 0.341 | 0.232 |
GSPH[5] | 0.387 | 0.103 | 0.075 | 0.312 | 0.219 |
JRL[4] | 0.344 | 0.080 | 0.069 | 0.275 | 0.192 |
CMDN[6] | 0.321 | 0.071 | 0.016 | 0.229 | 0.159 |
ACMR[3] | 0.245 | 0.039 | 0.041 | 0.279 | 0.151 |
As can be seen from tables 1-2, the method of the invention has the best performance in both dual-media fine-grained cross-media retrieval and multimedia fine-grained cross-media retrieval, which shows that the effectiveness of generating the common feature representation of the anti-network learning cross-media data in the method of the invention is utilized, and the effectiveness of the video frame denoising algorithm and the effectiveness of various constraint conditions are verified.
Claims (6)
1. A fine-grained cross-media retrieval method based on a generative confrontation network is characterized in that: the method comprises the following steps:
(1) carrying out noise frame filtering operation on the video data;
(2) constructing a dual-channel feature extraction network, extracting features of text data by using a text feature extractor based on a self-attention mechanism, and extracting features of other media data by using a deep neural network;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training, and finally obtaining a public characteristic representation;
(4) and calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode.
2. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:
n frames are cut from each video at the same interval to form an original key frame, N is more than or equal to 25 and less than or equal to 50, and then the deep neural network is used as a feature extractor to extract the features of the original key frame, and the features are expressed as follows:
fv={m1,m2,...,mN}
where v denotes the total number of videos in the video data set, miFeatures representing an ith frame of image;
using ζ2Norm calculation between the features of each video frame and the features of all other framesIs given as:
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、dNSorting is performed, assuming dk(k belongs to 1, 2.. and N) is minimum, then the k frame is taken as a central frame; calculating dkAverage value of (a)k:
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; and judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as a valid frame.
3. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (2), the text feature extraction method based on the self-attention mechanism specifically comprises the following steps:
given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time ttExpressed as: e ═ E (E)1,e2,...,en)
ht=LSTM(et,ht-1,ht+1)
H is the set of all hidden layer output results for bi-directional LSTM, denoted as:
H=(h1,h2,...,hn)
linear superposition is used to reduce the feature dimension. The dimensionality is reduced and then represented by H';
the self-attention mechanism takes the entire LSTM hidden state H' as input and then outputs a weight matrix M, represented as:
M=s(W1(g(W2H′T)))
in the formula, W1And W2A parameter matrix representing two fully connected layers, s (-) and g (-) representing activation functions;
the hidden state H' of LSTM is multiplied by the weight matrix M to obtain an embedded text matrix L, which is expressed as:
L=H′TM
l is the character of the text data obtained by the text processing channel, and then the dimension of the character is adjusted to be consistent with the characters of other three media by a plurality of full connection layers.
4. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the specific process of constructing the public feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.
5. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 4, characterized in that: the series of constraints specifically includes the following:
the generator classification constraint specifically includes:
in the formula (I), the compound is shown in the specification,represents the cross entropy loss function, hiRepresenting a feature, yiRepresenting semantic category labels;
the distance constraint specifically includes:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
in the formula, S represents the characteristics obtained by each media data through a common characteristic representation space;
the ordering constraint specifically includes:
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Represents a boundary threshold;
the classification loss of the media discriminator is specifically:
in the formula, thetaDDenotes parameters constituting a full connection layer of the media discriminator, and m denotes a one-hot media category label.
6. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the confrontation training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, namely, the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm;
assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phasei,v,a,θt,θs,θD) Is defined as:
E(θi,v,a,θt,θS,θD)=Ldis(θi,v,a,θt,θS)+Lcla(θi,v,a,θt,θs)+Lran(θi,v,a,θt,θS)-λLadv(θi,v,a,θt,θs,θD)
in the formula, thetai,v,aAnd thetatParameter, θ, representing a two-channel feature extractorSRepresenting parameters of fully connected layers constituting a common feature representation space;
the countermeasure training of the minimum and maximum game is specifically represented as:
in the formula, the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDMaximizing the following equation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110133925.3A CN112800249A (en) | 2021-02-01 | 2021-02-01 | Fine-grained cross-media retrieval method based on generation of countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110133925.3A CN112800249A (en) | 2021-02-01 | 2021-02-01 | Fine-grained cross-media retrieval method based on generation of countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112800249A true CN112800249A (en) | 2021-05-14 |
Family
ID=75813192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110133925.3A Withdrawn CN112800249A (en) | 2021-02-01 | 2021-02-01 | Fine-grained cross-media retrieval method based on generation of countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112800249A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113204674A (en) * | 2021-07-05 | 2021-08-03 | 杭州一知智能科技有限公司 | Video-paragraph retrieval method and system based on local-overall graph inference network |
CN113641800A (en) * | 2021-10-18 | 2021-11-12 | 中国铁道科学研究院集团有限公司科学技术信息研究所 | Text duplicate checking method, device and equipment and readable storage medium |
CN113704537A (en) * | 2021-10-28 | 2021-11-26 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on multi-scale feature union |
CN113779282A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network |
CN113779284A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Method for constructing entity-level public feature space based on fine-grained cross-media retrieval |
-
2021
- 2021-02-01 CN CN202110133925.3A patent/CN112800249A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113204674A (en) * | 2021-07-05 | 2021-08-03 | 杭州一知智能科技有限公司 | Video-paragraph retrieval method and system based on local-overall graph inference network |
CN113641800A (en) * | 2021-10-18 | 2021-11-12 | 中国铁道科学研究院集团有限公司科学技术信息研究所 | Text duplicate checking method, device and equipment and readable storage medium |
CN113704537A (en) * | 2021-10-28 | 2021-11-26 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on multi-scale feature union |
CN113779282A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network |
CN113779284A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Method for constructing entity-level public feature space based on fine-grained cross-media retrieval |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN107515895B (en) | Visual target retrieval method and system based on target detection | |
CN107608956B (en) | Reader emotion distribution prediction algorithm based on CNN-GRNN | |
CN112800249A (en) | Fine-grained cross-media retrieval method based on generation of countermeasure network | |
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
CN107122352B (en) | Method for extracting keywords based on K-MEANS and WORD2VEC | |
CN107239565B (en) | Image retrieval method based on saliency region | |
CN112819023A (en) | Sample set acquisition method and device, computer equipment and storage medium | |
CN107895303B (en) | Personalized recommendation method based on OCEAN model | |
CN110705247B (en) | Based on x2-C text similarity calculation method | |
CN105701225B (en) | A kind of cross-media retrieval method based on unified association hypergraph specification | |
CN107391565B (en) | Matching method of cross-language hierarchical classification system based on topic model | |
CN112100212A (en) | Case scenario extraction method based on machine learning and rule matching | |
CN113779246A (en) | Text clustering analysis method and system based on sentence vectors | |
CN112711944B (en) | Word segmentation method and system, and word segmentation device generation method and system | |
Guo | Intelligent sports video classification based on deep neural network (DNN) algorithm and transfer learning | |
CN111061939B (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN111767402B (en) | Limited domain event detection method based on counterstudy | |
Banerjee et al. | A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports | |
CN111222570B (en) | Ensemble learning classification method based on difference privacy | |
CN114036289A (en) | Intention identification method, device, equipment and medium | |
CN114298020A (en) | Keyword vectorization method based on subject semantic information and application thereof | |
CN114595324A (en) | Method, device, terminal and non-transitory storage medium for power grid service data domain division | |
CN113673237A (en) | Model training method, intent recognition method, device, electronic equipment and storage medium | |
CN112270185A (en) | Text representation method based on topic model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210514 |