CN112800249A - Fine-grained cross-media retrieval method based on generation of countermeasure network - Google Patents

Fine-grained cross-media retrieval method based on generation of countermeasure network Download PDF

Info

Publication number
CN112800249A
CN112800249A CN202110133925.3A CN202110133925A CN112800249A CN 112800249 A CN112800249 A CN 112800249A CN 202110133925 A CN202110133925 A CN 202110133925A CN 112800249 A CN112800249 A CN 112800249A
Authority
CN
China
Prior art keywords
media
fine
data
frame
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110133925.3A
Other languages
Chinese (zh)
Inventor
唐振民
洪瑾
姚亚洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202110133925.3A priority Critical patent/CN112800249A/en
Publication of CN112800249A publication Critical patent/CN112800249A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/438Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a fine-grained cross-media retrieval method based on a generated countermeasure network, which comprises the following steps: (1) carrying out noise frame filtering operation on the video data; (2) constructing a dual-channel feature extraction network; (3) constructing a fine-grained cross-media retrieval model based on a generated countermeasure network, wherein the model comprises a generator and a media discriminator, and a common characteristic representation of cross-media data is obtained after countermeasure training; (4) and performing similarity measurement on the features of the common feature space, sequencing according to the similarity, and finding out data with higher similarity with the input fine-grained data for output. The invention can accurately learn the discriminative characteristics of fine-grained data, quickly and accurately retrieve various media data with higher similarity to the input data, and can be widely applied to various multimedia fields.

Description

Fine-grained cross-media retrieval method based on generation of countermeasure network
Technical Field
The invention relates to a multimedia analysis and retrieval method, in particular to a fine-grained cross-media retrieval method based on a generation countermeasure network.
Background
In the big data era, multimedia data in various forms such as images, videos, audios, texts and the like comprehensively convey information to be expressed, and great convenience is brought to work and life of people. To address this problem, cross-media retrieval is receiving increasing attention from scholars. The cross-media retrieval is to input data of one media type, and obtain data of other media types with the same or similar semantics. For example, the user may retrieve a bird video or audio with an image of the bird. However, conventional cross-media retrieval is coarse-grained. For example, when a user inputs an image of "gull of california", the search engine will only treat it as "birds" and return many images, videos and other media data of "birds", including information about the sub-species "gull of siemens" and "gull", whereas when the user wants to search only "gull of california" and does not want the searched result to be adulterated with other sub-species of birds, such coarse-grained cross-media search does not meet the original intention of the user search. Only fine-grained cross-media retrieval can relatively accurately find the precise subdivision type which a user wants to search, and only relevant media information of the 'Laribu, California' is returned. Therefore, as the demand of people for diversified and professional retrieval is continuously increased, fine-grained cross-media retrieval replaces traditional coarse-grained and single-media retrieval and becomes a hot spot of current research.
The fine-grained cross-media retrieval algorithm can be associated with various media information of a certain precisely classified object, so that the purpose of flexible and accurate retrieval is achieved. However, fine-grained cross-media retrieval has been challenging for two reasons: (1) minor inter-class differences: similar subcategories belonging to the same species may have similar global appearance (images or video), similar textual description (text) and similar sound (audio), which makes it difficult to distinguish between similar fine-grained subcategories. (2) Heterogeneity differences: data of different media types have inconsistent distributions and characteristic representations, and thus it is difficult to directly perform similarity measurements across media data.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a fine-grained cross-media retrieval method based on a generation countermeasure network, which can accurately learn the discriminative characteristics of fine-grained data and quickly and accurately retrieve various media data with higher similarity to input data.
The technical scheme is as follows: the fine-grained cross-media retrieval method based on the generation countermeasure network comprises the following steps:
(1) carrying out noise frame filtering operation on the video data;
(2) constructing a dual-channel feature extraction network, extracting features of text data by using a self-attention-based text feature extractor, and extracting features of other media data by using a deep neural network;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training, and finally obtaining a public characteristic representation;
(4) and calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode.
Preferably, the cross-media data includes image, video, audio and text data.
Preferably, in step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:
n frames are cut from each video at the same interval to form an original key frame, N is more than or equal to 25 and less than or equal to 50, and then the deep neural network is used as a feature extractor to extract the features of the original key frame, and the features are expressed as follows:
fv={m1,m2,…,mN}
where v denotes the total number of videos in the video data set, miFeatures representing an ith frame of image;
using ζ2Norm calculation of the characteristics of each video frame and the characteristics of all other framesThe sum of the distances between features, expressed as:
Figure BDA0002926377130000021
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、dNSorting is performed, assuming dk(k ∈ 1,2, …, N) is minimum, then the kth frame is set as the center frame; calculating dkAverage value of (a)k
Figure BDA0002926377130000022
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; and judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as a valid frame.
The extraction of the video data is to intercept a fixed number of frames at equal intervals as input, wherein the frames inevitably comprise some frames irrelevant to a target object, such as a leader and a trailer, so that a noise frame in the video influences the characteristic distribution of the input data and a network parameter adapts to the characteristic distribution, thereby influencing the retrieval accuracy. And further, a video frame filtering method based on characteristic space clustering can be adopted to remove video frames irrelevant to the target object, so that the retrieval accuracy is improved.
Preferably, in step (2), the deep neural network is a ResNet50 network pre-trained on ImageNet for extracting image, video and audio data features.
Preferably, in the step (2), the text feature extraction method based on the self-attention mechanism specifically includes: given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:
E=(e1,e2,…,en)
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time ttExpressed as:
ht=LSTM(et,ht-1,ht+1)
h is the set of all hidden layer output results for bi-directional LSTM, denoted as:
H=(h1,h2,…,hn)
linear superposition is used to reduce the feature dimension. The dimensionality is reduced and then represented by H';
the self-attention mechanism takes the entire LSTM hidden state H' as input and then outputs a weight matrix M, represented as:
M=s(W1(g(W2H′T)))
in the formula, W1And W2A parameter matrix representing two fully connected layers, s (-) and g (-) representing activation functions;
the hidden state H' of LSTM is multiplied by the weight matrix M to obtain an embedded text matrix L, which is expressed as:
L=H′TM
l is the character of the text data obtained by the text processing channel, and then the dimension of the character is adjusted to be consistent with the characters of other three media by a plurality of full connection layers.
The invention adopts a text feature extraction method based on a self-attention mechanism, and the recurrent neural network can overcome the weak label characteristic of the text, and can combine the accuracy of the self-attention mechanism on the acquisition of important features and the controllability of the recurrent neural network on sequence data to find the most important features in a plurality of description words.
Preferably, in the step (3), the specific process of constructing the common feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.
In the fine-grained cross-media retrieval task, the intra-class distance of the sub-species of different media types needs to be reduced in the retrieval process, and the inter-class distance of the different sub-species of the same media type is enlarged, so that the inconsistency of the features can be caused if the features are directly extracted from each media by using a common convolutional neural network; the invention completes the standardization process by using the generated countermeasure network, not only can improve the classification accuracy, but also can ensure that the characteristics of different media data under the same category label are similar as much as possible, namely, the public characteristic expression space of fine-grained subcategories is obtained.
Preferably, said set of constraints comprises in particular the following:
the generator classification constraint specifically includes:
Figure BDA0002926377130000041
in the formula (I), the compound is shown in the specification,
Figure BDA0002926377130000042
represents the cross entropy loss function, hiRepresenting a feature, yiRepresenting semantic category labels;
the invention can learn the fine-grained semantic features of various media by adopting generator classification constraints.
The distance constraint specifically includes:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
in the formula, S represents the characteristics obtained by each media data through a common characteristic representation space;
the present invention employs distance constraints to ensure that the intra-class sample features are as close as possible.
The ordering constraint specifically includes:
Figure BDA0002926377130000043
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Represents a boundary threshold;
the classification loss of the media discriminator is specifically:
Figure BDA0002926377130000044
in the formula, thetaDDenotes parameters constituting a full connection layer of the media discriminator, and m denotes a one-hot media category label.
The invention adopts the ordering constraint to ensure that the characteristics of the same subclass sample are closer, and the characteristics of different subclass samples have sparseness.
Preferably, in the step (3), the countermeasure training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, and the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm; assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phasei,v,atSD) Is defined as:
E(θi,v,atSD)=Ldisi,v,atS)+Lclai,v,atS)+Lrani,v,atS)-λLadvi,v,atSD)
in the formula,θi,v,aAnd thetatParameter, θ, representing a two-channel feature extractorSRepresenting parameters of fully connected layers constituting a common feature representation space;
the countermeasure training of the minimum and maximum game is specifically represented as:
Figure BDA0002926377130000051
Figure BDA0002926377130000052
in the formula, the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDThe following equation is maximized, which generates a confrontational training process for the confrontational model.
Preferably, in step (4), the similarity measurement tool is:
Figure BDA0002926377130000053
in the formula (I), the compound is shown in the specification,
Figure BDA0002926377130000054
and
Figure BDA0002926377130000055
representing two vectors, AiAnd BiRespectively represent vectors
Figure BDA0002926377130000056
And
Figure BDA0002926377130000057
the respective components of (a).
Has the advantages that: compared with the prior art, the invention has the following remarkable effects: (1) the invention adopts a video frame filtering method based on characteristic space clustering, which can remove video frames irrelevant to the target object, improve the retrieval accuracy and solve the problem of noise frames of video data; (2) effective text features can be better captured by adopting a self-attention machine, and the context relationship is extracted by using a recurrent neural network, so that the weak label characteristic of the text is overcome; (3) by generating a common feature representation space for countering network learning across media data, common representations of the across media data can be better learned; (4) the invention comprehensively considers the heterogeneity difference and the fine-grained subcategory difference among the cross-media data, designs a fine-grained cross-media retrieval algorithm based on the generation countermeasure network, and greatly improves the retrieval accuracy.
Drawings
FIG. 1 is a flow diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;
FIG. 2 is a schematic diagram of a video frame filtering algorithm based on feature space clustering;
FIG. 3 is a block diagram of a fine-grained cross-media retrieval method based on generating a countermeasure network;
FIG. 4 is a schematic diagram of a text feature extractor based on a self-attention mechanism.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The present invention will be described in detail with reference to examples.
As shown in fig. 1, the fine-grained cross-media retrieval method based on the generation countermeasure network is characterized in that: the method comprises the following steps:
(1) carrying out noise frame filtering operation on the video data; the PKU FG-XMedia dataset, which is the first dataset to date to contain four media type data and is used for fine-grained cross-media retrieval tasks, including images, video, audio, and text, and each media category contains 200 fine-grained sub-species of birds. The data set is first preprocessed to obtain cross-media data.
Specifically, the pretreatment method comprises the following steps: for an image, the size thereof is adjusted to 448 × 448; for the text, converting the text into an n x d matrix, wherein d is a symbol embedding dimension and is 100, and in addition, fixing the length of all text sentences into 448 characters, so that the matrix size of each text section is 448 x 100, and if the number of characters of the sentences is less than 448, supplementing 0 on the lines; if the character length exceeds 448, the following characters are cut out at the 448 th character; the audio data in the original data set is processed by short-time Fourier transform, and the existing audio is presented by a spectrogram; for each video data, 40 frames of images are extracted from the video data at equal intervals, and then the 40 frames of images are subjected to a denoising operation.
As shown in fig. 2, 40 frame images were input into a ResNet50 network that had been pre-trained on ImageNet, extracting the features of each video frame, which are expressed as:
fv={m1,m2,…,m40}
where v represents the total number of videos in the video data set, 8000, miFeatures representing an ith frame of image;
using ζ2The norm calculates the sum of the distances between the features of each video frame and the features of all other frames, and is expressed as:
Figure BDA0002926377130000061
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、d40Sorting is performed, assuming dk(k ∈ 1,2, …,40) is minimum, then the kth frame is taken as the center frame;
calculating dkAverage value of (a)k
Figure BDA0002926377130000062
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as an effective frame;
(2) as shown in fig. 3, a two-channel feature extraction network is constructed, a self-attention-based text feature extractor is used for extracting features of text data, and a deep neural network is used for extracting features of other media data;
the method comprises the following steps of extracting the features of text data by a text feature extractor based on self attention, wherein the specific process comprises the following steps:
as shown in FIG. 4, given a sentence having n words, where n is 448, the word embedding matrix E for the sentence is represented as:
E=(e1,e2,…,en)
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM to obtain the dependency relationship between adjacent words in a sentence, wherein the dimension of the hidden layer is 2048, and the output data h of the hidden layer at the time ttCan be expressed as:
ht=LSTM(et,ht-1,ht+1)
h is the set of all hidden layer output results for bi-directional LSTM, with dimensions 448 × 4096, denoted as:
H=(h1,h2,…,hn)
reducing characteristic dimensionality by adopting linear superposition, and expressing the dimensionality reduced by H ', wherein the size of H' is 448 multiplied by 2048;
the self-attention mechanism takes as input the entire LSTM hidden state H', and then the output weight matrix M is represented as follows:
M=softmax(W1(g(W2H′T)))
W1and W2Parameter matrices representing the fully-connected layers in two dimensions (2048, 128) and (128,1), respectively, g (-) is the ReLU activation function, and the M dimension after passing through the fully-connected layers and the activation function is 448 × 1;
the hidden state H' of the LSTM is multiplied by the weight matrix M to obtain an embedded text matrix fTIt is expressed as:
fT=H′TM
fTis obtained through a text processing channelThe dimensionality of the text data of (1) is 2048 × 1;
the other channel is a feature extractor for extracting image, video and audio data, here a ResNet50 network, followed by an average pooling layer with kernel size of 14 and step size of 1 after the last convolutional layer of ResNet50, resulting in image, video and audio data with features f, respectivelyI、fVAnd fAThe dimensions are 2048 × 1;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training and finally obtain public characteristic representation, wherein the specific process comprises the following steps:
as shown in fig. 3, after the two-channel feature extraction network, a common feature space learning module is constructed by using 2 linear layers, and semantic information of features in the common feature space is kept under a series of constraints; f. ofI、fV、fAAnd fTGenerating S after passing through the common feature spaceI、SV、SAAnd ST(ii) a Then inputting the characteristics in the space into a media discriminator, wherein the discriminator consists of 6 full connection layers, and judges the media type of the characteristics by using media label constraint;
the constraints used include:
a softmax function is connected to the last full-link layer of the generator to serve as a classifier, a group of probability values with the dimension of 200 are output finally, and the type of the sample predicted by the model can be judged according to the probability values; the generator classification constraint can ensure that the network learns semantic information through class labels, specifically:
Figure BDA0002926377130000081
in the formula (I), the compound is shown in the specification,
Figure BDA0002926377130000082
is a function of the cross-entropy loss,
Figure BDA0002926377130000083
is SI、SV、SAAnd STRepresentation after the classifier, yiIs a semantic category label;
the distance constraint ensures that the features of the intra-class sample are as close as possible, specifically:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
ordering constraints ensure that the features of the same subclass samples can be closer, and the features of different subclass samples have sparsity, specifically:
Figure BDA0002926377130000084
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Representing boundary threshold values, which are respectively 1 and 0.5;
the classification constraints of the media discriminator are:
Figure BDA0002926377130000085
in the formula, thetaDDenotes parameters of a full connection layer constituting the media discriminator, and m denotes a one-hot media category tag.
During the countertraining, the loss of the generator is minimized, and simultaneously, the loss of the media discriminator is maximized to obtain the optimal model of the algorithm, and the process is also called as the minimum and maximum game; assigning parameters to each loss function and solving the loss function E (theta) of the antagonistic phasei,v,atSD) Is defined as:
E(θi,v,atSD)=Ldisi,v,atS)+Lclai,v,atS)+Lrani,v,atS)-λLadvi,v,atSD)
in the formula, thetai,v,aAnd thetatIs a parameter of the two-channel feature extractor, θSIs a parameter of the fully connected layer that constitutes the common feature representation space;
the countermeasure training for the min-max game is specifically represented as:
Figure BDA0002926377130000091
wherein the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDMaximizing the following equation, which generates a confrontational training process for the confrontational model; after the countertraining, the public features which effectively reduce the heterogeneity difference of the cross-media data and learn the fine-grained features can be obtained.
(4) And calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode. And calculating the feature similarity between each kind of media data by using the cosine distance so as to sort the media data, wherein the cosine distance calculation formula is as follows:
Figure BDA0002926377130000092
in the formula (I), the compound is shown in the specification,
Figure BDA0002926377130000093
and
Figure BDA0002926377130000094
representing two vectors, AiAnd BiRespectively represent vectors
Figure BDA0002926377130000095
And
Figure BDA0002926377130000096
the respective components of (a).
The fine-grained cross-media retrieval method is compared with the effects of 6 cross-media retrieval methods, and the average precision average value (mAP) is used as an evaluation index of retrieval, so that the higher the mAP value is, the more excellent the retrieval effect is. The 6 cross-media retrieval methods are as follows:
[1]He X,Peng Y,Hunag X,et al.A new benchmark and approach for fine-grained cross-media retrieval[C].ACM International Conference on Multimedia,2019:1740-1748.
[2]Huang X,Peng Y,Yuan M.Mhtn:Modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Transactions on Cybernetics,2018.
[3]Wang B,Yang Y,Xu X,et al.Adversarial Cross-Modal Retrieval[C].ACM Ingernational Conference on Multimedia,2017:154-162.
[4]Zhai X,Peng Y,Xiao J.Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2014,24(6):965-978.
[5]Mandal D,Chaudhury K,Biswas S.Generalized semantic preserving hashing for n-label cross-modal retrieval[C].IEEE Conference on Computer Vision and Pattern Recognition,2017:4076-4084.
[6]Peng Y,Huang X,Qi J.Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks[C].International Joint Conference on Artificial Intelligence,2016:3846-3853.
the results of performing dual-media fine-grained cross-media search on a PKU FG-XMedia dataset by using the present invention and 6 cross-media search methods are shown in Table 1, and the conventional methods [1] to [6] in Table 1 respectively correspond to the methods described in the above documents [1] to [6 ].
TABLE 1 Dual-media Fine-grained Cross-media search result comparison
Figure BDA0002926377130000101
The present invention and 6 cross-media search methods are used to perform the multimedia fine-grained cross-media search on the PKU FG-XMedia dataset, and the results are shown in table 2, in which the existing methods [1] to [6] correspond to the methods described in the above documents [1] to [6], respectively.
TABLE 2 comparison of fine-grained multimedia across-media retrieval results
Method I→All T→All V→All A→All Average
The invention 0.627 0.311 0.491 0.568 0.499
FGCrossNet[1] 0.549 0.196 0.416 0.485 0.412
MHTN[2] 0.208 0.142 0.237 0.341 0.232
GSPH[5] 0.387 0.103 0.075 0.312 0.219
JRL[4] 0.344 0.080 0.069 0.275 0.192
CMDN[6] 0.321 0.071 0.016 0.229 0.159
ACMR[3] 0.245 0.039 0.041 0.279 0.151
As can be seen from tables 1-2, the method of the invention has the best performance in both dual-media fine-grained cross-media retrieval and multimedia fine-grained cross-media retrieval, which shows that the effectiveness of generating the common feature representation of the anti-network learning cross-media data in the method of the invention is utilized, and the effectiveness of the video frame denoising algorithm and the effectiveness of various constraint conditions are verified.

Claims (6)

1. A fine-grained cross-media retrieval method based on a generative confrontation network is characterized in that: the method comprises the following steps:
(1) carrying out noise frame filtering operation on the video data;
(2) constructing a dual-channel feature extraction network, extracting features of text data by using a text feature extractor based on a self-attention mechanism, and extracting features of other media data by using a deep neural network;
(3) constructing a public characteristic space learning module and a media discriminator to enable a generator and the discriminator to carry out countermeasure training, and finally obtaining a public characteristic representation;
(4) and calculating the feature similarity between each kind of media data by using a similarity measurement tool so as to output the feature similarity in a ranking mode.
2. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (1), the noise frame filtering operation method adopts a video frame filtering method based on feature space clustering, and the specific process is as follows:
n frames are cut from each video at the same interval to form an original key frame, N is more than or equal to 25 and less than or equal to 50, and then the deep neural network is used as a feature extractor to extract the features of the original key frame, and the features are expressed as follows:
fv={m1,m2,...,mN}
where v denotes the total number of videos in the video data set, miFeatures representing an ith frame of image;
using ζ2Norm calculation between the features of each video frame and the features of all other framesIs given as:
Figure FDA0002926377120000011
in the formula (d)jRepresenting all other video frames to mjThe sum of the distances of (a);
to d1、d2、…、dNSorting is performed, assuming dk(k belongs to 1, 2.. and N) is minimum, then the k frame is taken as a central frame; calculating dkAverage value of (a)k
Figure FDA0002926377120000012
Let λ akIs a threshold value T, and the value of lambda is 0.001-0.01; and judging the distance from each frame to the central frame, if the distance from the current frame to the central frame is more than T, discarding the current frame, otherwise, keeping the current frame as a valid frame.
3. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (2), the text feature extraction method based on the self-attention mechanism specifically comprises the following steps:
given a sentence having n words, the size of n depending on the length of the text, the word embedding matrix E for the sentence is represented as:
in the formula, eiA word-embedding representation vector representing an ith word of the sentence;
adopting a bidirectional LSTM network (long-short term memory network) to obtain the dependency relationship between adjacent words in a sentence, and then the output data h of the hidden layer at the time ttExpressed as: e ═ E (E)1,e2,...,en)
ht=LSTM(et,ht-1,ht+1)
H is the set of all hidden layer output results for bi-directional LSTM, denoted as:
H=(h1,h2,...,hn)
linear superposition is used to reduce the feature dimension. The dimensionality is reduced and then represented by H';
the self-attention mechanism takes the entire LSTM hidden state H' as input and then outputs a weight matrix M, represented as:
M=s(W1(g(W2H′T)))
in the formula, W1And W2A parameter matrix representing two fully connected layers, s (-) and g (-) representing activation functions;
the hidden state H' of LSTM is multiplied by the weight matrix M to obtain an embedded text matrix L, which is expressed as:
L=H′TM
l is the character of the text data obtained by the text processing channel, and then the dimension of the character is adjusted to be consistent with the characters of other three media by a plurality of full connection layers.
4. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the specific process of constructing the public feature space learning module and the media discriminator is as follows: the input of the public characteristic space learning module consisting of a plurality of full connection layers is the output of the two-channel characteristic extractor, and the characteristics in the public characteristic space are kept semantic information by a series of constraints; the features in this space are then input to a media discriminator, which consists of multiple fully connected layers, and the media type of the feature is determined using media tag constraints.
5. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 4, characterized in that: the series of constraints specifically includes the following:
the generator classification constraint specifically includes:
Figure FDA0002926377120000031
in the formula (I), the compound is shown in the specification,
Figure FDA0002926377120000032
represents the cross entropy loss function, hiRepresenting a feature, yiRepresenting semantic category labels;
the distance constraint specifically includes:
Ldis=||SI-SV||2+||SI-SA||2+||SI-ST||2+||ST-SV||2+||ST-SA||2+||SA-SV||2
in the formula, S represents the characteristics obtained by each media data through a common characteristic representation space;
the ordering constraint specifically includes:
Figure FDA0002926377120000033
wherein d (·,. cndot.) represents the Euclidean distance, a1And a2Represents a boundary threshold;
the classification loss of the media discriminator is specifically:
Figure FDA0002926377120000034
in the formula, thetaDDenotes parameters constituting a full connection layer of the media discriminator, and m denotes a one-hot media category label.
6. The fine-grained cross-media retrieval method based on generation of a competing network according to claim 1, characterized in that: in the step (3), the confrontation training comprises the following specific processes: during the countertraining, the minimum and maximum games are adopted, namely, the loss of the generator is minimized, and the loss of the media discriminator is maximized to obtain the optimal model of the algorithm;
assigning parameters to each loss function and assigning a loss function E (theta) of the challenge phasei,v,a,θt,θs,θD) Is defined as:
E(θi,v,a,θt,θS,θD)=Ldisi,v,a,θt,θS)+Lclai,v,a,θt,θs)+Lrani,v,a,θt,θS)-λLadvi,v,a,θt,θs,θD)
in the formula, thetai,v,aAnd thetatParameter, θ, representing a two-channel feature extractorSRepresenting parameters of fully connected layers constituting a common feature representation space;
the countermeasure training of the minimum and maximum game is specifically represented as:
Figure FDA0002926377120000035
Figure FDA0002926377120000036
in the formula, the parameter thetai,v,a、θt、θSMinimize the above equation and the parameter thetaDMaximizing the following equation.
CN202110133925.3A 2021-02-01 2021-02-01 Fine-grained cross-media retrieval method based on generation of countermeasure network Withdrawn CN112800249A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110133925.3A CN112800249A (en) 2021-02-01 2021-02-01 Fine-grained cross-media retrieval method based on generation of countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110133925.3A CN112800249A (en) 2021-02-01 2021-02-01 Fine-grained cross-media retrieval method based on generation of countermeasure network

Publications (1)

Publication Number Publication Date
CN112800249A true CN112800249A (en) 2021-05-14

Family

ID=75813192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110133925.3A Withdrawn CN112800249A (en) 2021-02-01 2021-02-01 Fine-grained cross-media retrieval method based on generation of countermeasure network

Country Status (1)

Country Link
CN (1) CN112800249A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204674A (en) * 2021-07-05 2021-08-03 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113641800A (en) * 2021-10-18 2021-11-12 中国铁道科学研究院集团有限公司科学技术信息研究所 Text duplicate checking method, device and equipment and readable storage medium
CN113704537A (en) * 2021-10-28 2021-11-26 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on multi-scale feature union
CN113779282A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network
CN113779284A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Method for constructing entity-level public feature space based on fine-grained cross-media retrieval

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204674A (en) * 2021-07-05 2021-08-03 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113641800A (en) * 2021-10-18 2021-11-12 中国铁道科学研究院集团有限公司科学技术信息研究所 Text duplicate checking method, device and equipment and readable storage medium
CN113704537A (en) * 2021-10-28 2021-11-26 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on multi-scale feature union
CN113779282A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method based on self-attention and generation countermeasure network
CN113779284A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Method for constructing entity-level public feature space based on fine-grained cross-media retrieval

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN107515895B (en) Visual target retrieval method and system based on target detection
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
CN112800249A (en) Fine-grained cross-media retrieval method based on generation of countermeasure network
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN107122352B (en) Method for extracting keywords based on K-MEANS and WORD2VEC
CN107239565B (en) Image retrieval method based on saliency region
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN107895303B (en) Personalized recommendation method based on OCEAN model
CN110705247B (en) Based on x2-C text similarity calculation method
CN105701225B (en) A kind of cross-media retrieval method based on unified association hypergraph specification
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN113779246A (en) Text clustering analysis method and system based on sentence vectors
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
Guo Intelligent sports video classification based on deep neural network (DNN) algorithm and transfer learning
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN111767402B (en) Limited domain event detection method based on counterstudy
Banerjee et al. A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports
CN111222570B (en) Ensemble learning classification method based on difference privacy
CN114036289A (en) Intention identification method, device, equipment and medium
CN114298020A (en) Keyword vectorization method based on subject semantic information and application thereof
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113673237A (en) Model training method, intent recognition method, device, electronic equipment and storage medium
CN112270185A (en) Text representation method based on topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210514