CN113158798A - Short video classification method based on multi-mode feature complete representation - Google Patents

Short video classification method based on multi-mode feature complete representation Download PDF

Info

Publication number
CN113158798A
CN113158798A CN202110282974.3A CN202110282974A CN113158798A CN 113158798 A CN113158798 A CN 113158798A CN 202110282974 A CN202110282974 A CN 202110282974A CN 113158798 A CN113158798 A CN 113158798A
Authority
CN
China
Prior art keywords
representation
visual
modal
label
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110282974.3A
Other languages
Chinese (zh)
Inventor
井佩光
张丽娟
苏育挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110282974.3A priority Critical patent/CN113158798A/en
Publication of CN113158798A publication Critical patent/CN113158798A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a short video classification method based on multi-mode feature complete representation, which comprises the following steps: for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned; for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video; a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video; the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.

Description

Short video classification method based on multi-mode feature complete representation
Technical Field
The invention relates to the field of short video classification, in particular to a short video classification method based on multi-mode characteristic complete representation.
Background
In recent years, along with popularization of intelligent terminals and fire and heat of social networks, more and more information is presented by adopting multimedia contents, and a high-definition camera, a large-capacity storage and a high-speed network connection create extremely convenient shooting and sharing conditions for users, so that massive multimedia data are created.
The short video is used as novel user generated content, and is greatly popular in a social network by virtue of unique advantages of low creation threshold, fragmented content, strong social attributes and the like. Especially, since 2011, with the popularization of mobile internet terminals, the speed increase of networks and the reduction of traffic charges, short videos have rapidly gained support and favor of multiple parties including large content platforms, fans, capital and the like. There is data showing that global mobile video traffic has taken up more than half of the total traffic of mobile data and continues to grow at a high rate. The enormous size of short video data easily overwhelms the information that users need, making it difficult for users to find the desired content of short video information, so how to efficiently process and utilize this information becomes critical.
Artificial intelligence technology represented by deep learning is one of the most popular technologies at present, and is widely applied to many fields such as computer vision.
Therefore, the introduction of the short video classification task is beneficial to promoting the innovation of related topics in the computer vision and multimedia fields, and has important application value and practical significance for improving the user experience and developing the industry.
Disclosure of Invention
The invention provides a short video frequency classification method based on multi-mode feature complete expression, which solves the problem of short video multi-label classification and evaluates the result, and is described in detail as follows:
a method for short video classification based on a complete representation of multi-modal features, the method comprising:
for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned;
for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video;
a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video;
the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
Wherein the two types of visual modality features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.
Further, the exploring the correlation between labels and updating the label representation by using inverse covariance estimation and a graph attention network to obtain the label vector representation corresponding to the short video specifically comprises:
introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V-1To characterize the pair-wise relationship of the labels, i.e. define the graph relationship function to initialize the graph structure S;
and converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (-) and calculating a graph structure S' under the new label matrix.
Wherein, the multi-head trans-modal fusion scheme based on multi-head attention is as follows: and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention researches a multi-modal representation learning problem in a short video, provides a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part, constructs information complementarity among four subspace learning modalities from a modality missing angle to obtain potential representations of two types of visual modal characteristics, and utilizes automatic coding and decoding network fusion to the potential representations of the two types of visual modal characteristics to obtain a common potential representation of the visual modal characteristics in consideration of consistency of the visual modal characteristic information. The process considers the problem of modal loss and the complementarity and consistency of modal information at the same time, and fully utilizes the modal information of the short video;
2. the invention explores the label information space of the short video, and provides a new idea for label correlation learning from two aspects of inverse covariance estimation and a graph attention network;
3. aiming at the disadvantages of limited duration and insufficient information of a short video, the invention suggests that a visual modal public potential representation and a label representation are respectively learned from two angles of content information and label information of the short video, and a multi-head cross-modal fusion strategy based on multi-head attention is proposed for the two representations to obtain a final label prediction score.
The method fully utilizes the modal information of the short video to learn the visual modal representation and the label representation which have important effects on the multi-label classification task, and is favorable for improving the accuracy of the short video multi-label classification task.
Drawings
FIG. 1 is a diagram of the overall network framework of a short video classification method based on a complete representation of multi-modal features;
FIG. 2 is a subspace learning framework diagram;
fig. 3 is data of experimental results.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
The embodiment of the invention provides a short video classification method based on multi-mode complete feature representation, which makes full use of content information and label information of a short video, and as shown in figure 1, the method comprises the following steps:
101: for content information, according to experience, semantic feature representation of visual modalities in a short video multi-label classification task is important, so that representation learning based on the visual modality features is provided, the visual modality features are taken as the main, four subspaces are constructed from the aspect of modality missing, information complementarity between modalities is learned, and potential representation of two types of visual modality features is obtained. In consideration of consistency of visual modal characteristic information, in order to obtain more compact visual modal characteristic representation, two types of visual modal characteristic potential representations obtained by four subspaces are fused by utilizing an automatic coding and decoding network to learn common potential representation of the visual modal characteristics;
102: for label information, a unique convex form (inverse covariance estimation) and a graph attention network are adopted to explore the correlation among labels and update label representation, and label vector representation corresponding to short videos is obtained;
the tag vector representation is used to explore a tag representation suitable for a short video data set, participating in the multi-head cross-modality fusion network of step 103 together with the common potential representation of visual modality features of step 101;
103: two information spaces are represented: the common potential representation of the visual modal features obtained in step 101 and the label representation obtained in step 102 propose a multi-head trans-modal fusion scheme based on multi-head attention, which is used for obtaining a label prediction score of the short video;
the output of the multi-head cross-modal fusion network can be regarded as the label prediction score of the input short video and is directly used in the classification loss function.
104: the overall loss function consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
The performance of the scheme is evaluated by five evaluation indexes, namely coverage rate, ranking loss, average precision, Hamming loss and first marking error, so that the objectivity of an experimental result is ensured.
In a specific implementation, before step 101, the method further includes:
inputting short videos, and extracting three modal characteristics of vision, sound and track by using a classical deep learning network respectively.
In summary, the embodiment of the invention obtains the label prediction score of the input short video by utilizing the multi-modal learning and label learning related theories and combining the advantages of the deep learning network, and the classification result is accurate and effective.
Example 2
The scheme in example 1 is further described below by combining the calculation formula and examples, and the following description refers to:
201: inputting a complete short video by the model, and respectively extracting three modal characteristics of vision, audio and track;
for visual mode, extracting key frames, applying a classical image feature extraction network ResNet (residual network) to all video key frames, and then performing an averaging (AvePolling) operation to obtain visual mode features XvOverall characteristic z ofv
Figure BDA0002979318270000041
Wherein ResNet (·): residual network, ave boost (·): average operation, Xv: short video original visual features, β v: the network parameters to be learned are,
Figure BDA0002979318270000042
visual modal characteristics zvIs d in the dimension ofv
For the audio mode, drawing the sound spectrogram, and using "CNN + for the spectrogramExtracting sound feature z by LSTM (convolutional neural network + long-short term memory network)'a
Figure BDA0002979318270000043
Wherein, CNN (·): convolutional neural network, LSTM (·): long and short term memory networks, Xa: short video original audio feature, betaa: the network parameters to be learned are,
Figure BDA0002979318270000044
audio modal characteristic zaIs d in the dimension ofa. For track mode, extracting track characteristic z from time domain and space domain jointly by using TDD (track pool depth convolution descriptor) methodt
Figure BDA0002979318270000045
Wherein, TDD (·): network of trajectory depth descriptors, Xt: original track information, beta, of short videot: the network parameters to be learned are,
Figure BDA0002979318270000046
modal characteristics of trajectory ztIs d in the dimension oft
202: modality subspace learning based on visual modalities;
the model considers the visual, audio and trajectory modalities of short video. For a specific short video, the video pictures are generally contained, namely, the visual modal characteristics exist, but the missing situations of the other two modalities are uncertain, and the missing situations of the different modalities are four in total. According to experience, the potential representation of the visual modality is crucial in the task of short video multi-label classification, so four subspaces are constructed based on the potential representation learning of the visual modality, namely, the two main cases are discussed: unique visual modality potential representation and visual modality potential representation under complementation of different modality information to ensure pairingThe visual modality potential representation is fully mined. (wherein, visual modality feature zvAudio modal characteristics zaTrace modal characteristics ztAre all obtained in step 201. )
(ii) unique visual modality potential representation
Using extracted visual modal features zvLearn its specific potential representation hv
Figure BDA0002979318270000051
Figure BDA0002979318270000052
Wherein the content of the first and second substances,
Figure BDA0002979318270000053
visual feature specific mapper, θv: the network parameters to be learned are,
Figure BDA0002979318270000054
visual modality potential representation hvIs d in the dimension ofh
Visual modal potential representation under different modal information complementation
And quantitatively analyzing the complementary relation between different modal information and visual modal information by introducing a normalized index function, converting other modal characteristics into corresponding characteristics in a visual representation space, adding the corresponding characteristics and the visual modal characteristics, and sending the sum and the visual modal characteristics into a characteristic fusion mapper to obtain visual modal potential representation after information complementation.
When only the visual modal characteristics z are presentvAnd an audio modal characteristic zaFirstly, calculating the incidence matrix U of the two modal characteristicsa
Figure BDA0002979318270000055
Wherein the content of the first and second substances,
Figure BDA0002979318270000056
visual modal characteristics zvTranspose of (d)v: visual modal characteristics zvDimension of (d)a: audio modal characteristic zaThe dimension (c) of (a) is,
Figure BDA0002979318270000057
correlation matrix UaIs d in the dimension ofv×da
Then, a correlation score matrix between the modalities is calculated
Figure BDA0002979318270000058
Figure BDA0002979318270000059
Wherein softmax (·): normalized exponential function (same below), dv: visual modal characteristics zvDimension of (d)a: audio modal characteristic zaThe dimension (c) of (a) is,
Figure BDA00029793182700000510
correlation score matrix
Figure BDA00029793182700000514
Is d in the dimension ofv×da
Using a correlation score matrix
Figure BDA00029793182700000511
Characterizing the audio modality zaTransforming to visual representation space to obtain audio modal feature representation in visual representation space
Figure BDA00029793182700000512
Figure BDA00029793182700000513
Wherein the content of the first and second substances,
Figure BDA0002979318270000061
audio modal characteristic zaThe transpose of (a) is performed,
Figure BDA0002979318270000062
audio modal feature representation in visual representation space
Figure BDA0002979318270000063
Is d in the dimension ofv
Finally, the original visual modal characteristics z are measuredvAnd audio modality features in visual representation space
Figure BDA0002979318270000064
Added and sent to a feature fusion mapper phiaGenerating a visual modality potential representation supplemented with audio modality information
Figure BDA0002979318270000065
Figure BDA0002979318270000066
Figure BDA0002979318270000067
Wherein, thetaa: the features to be learned fuse the mapper parameters,
Figure BDA0002979318270000068
the corresponding elements between the vectors are added up,
Figure BDA0002979318270000069
visual modality potential representation generated by feature fusion mapper
Figure BDA00029793182700000610
Is d in the dimension ofh
When only the visual modal characteristics z are availablevAnd a modal signature z of the trajectorytAnd then, obtaining the visual modal potential representation after the track modal information is supplemented by adopting the same strategy as I.
Figure BDA00029793182700000611
Wherein, Ut: visual modal characteristics zvAnd a modal signature z of the trajectorytThe correlation matrix of (a) is obtained,
Figure BDA00029793182700000612
correlation matrix UtIs d in the dimension ofv×dt
Figure BDA00029793182700000613
Visual modal characteristics zvThe transposing of (1).
Figure BDA00029793182700000614
Wherein the content of the first and second substances,
Figure BDA00029793182700000615
a correlation score matrix between visual modalities and trajectory modalities,
Figure BDA00029793182700000616
correlation score matrix
Figure BDA00029793182700000617
Is d in the dimension ofv×dt
Figure BDA00029793182700000618
Wherein the content of the first and second substances,
Figure BDA00029793182700000619
the trajectory modal characteristics in the visual representation space,
Figure BDA00029793182700000620
original trajectory modal characteristics ztThe transpose of (a) is performed,
Figure BDA00029793182700000621
modal characterization of trajectories
Figure BDA00029793182700000622
Is d in the dimension ofv
Figure BDA00029793182700000623
Wherein phi ist: feature fusion mapper, θt: the features to be learned fuse the mapper parameters,
Figure BDA00029793182700000624
visual modality potential representation generated by feature fusion mapper
Figure BDA00029793182700000625
Is d in the dimension ofh
When the visual mode characteristic zvAudio modal characteristics zaTrace modal characteristics ztWhen both exist, it is considered to supplement the visual information with the audio information and the trajectory information jointly.
Firstly, acquiring joint information representation z of audio modality and track modalityat
Figure BDA00029793182700000626
Figure BDA00029793182700000627
Wherein concat (·): a cascade function of the feature vectors is provided,
Figure BDA00029793182700000628
joint information representation zatIs d in the dimension ofa+dt. And then, the same strategy as I is adopted to obtain a new potential representation of the visual mode when the information of the three modes exists.
Figure BDA0002979318270000071
Wherein, Uat: the correlation matrix between the three modes is,
Figure BDA0002979318270000072
visual modal characteristics zvThe transpose of (a) is performed,
Figure BDA0002979318270000073
correlation matrix UatIs d in the dimension ofv×(da+dt)。
Figure BDA0002979318270000074
Wherein the content of the first and second substances,
Figure BDA0002979318270000075
a matrix of correlation scores between the three modalities,
Figure BDA0002979318270000076
correlation score matrix
Figure BDA0002979318270000077
Is d in the dimension ofv×(da+dt)。
Figure BDA0002979318270000078
Wherein the content of the first and second substances,
Figure BDA0002979318270000079
a joint information representation of the audio modality and the trajectory modality in the visual representation space,
Figure BDA00029793182700000710
the original audio modality and the trajectory modality are combined with a transposition of the information representation,
Figure BDA00029793182700000711
Figure BDA00029793182700000712
is d in the dimension ofv
Figure BDA00029793182700000713
Figure BDA00029793182700000714
Wherein phi isat: feature fusion mapper, θat: the features to be learned fuse the mapper parameters,
Figure BDA00029793182700000715
visual modality potential representation generated by feature fusion mapper
Figure BDA00029793182700000716
Is d in the dimension ofh
203: the consistency of potential representation of the visual mode is learned by the automatic coding and decoding network;
the visual modality potential representations learned by the subspaces should be similar, in theory they all characterize the same visual content. The two types of potential representations of the visual modalities learned in step 202 are projected into a common space as much as possible using an automatic codec network. The scheme has two advantages, on one hand, overfitting of data is prevented to a certain extent, dimension reduction is carried out on the data, and more compact visual modal potential representation is obtained; on the other hand, the effective connection among the four subspaces is strengthened, so that the subspace learning becomes more meaningful. Two types of potential representations of visual modalities are obtained in step 202: unique visual modality potential representation hvVisual modality potential representation in complementary to modality
Figure BDA00029793182700000717
Where m is the { a, t, at }, and they are concatenated to obtain the vector u, i.e., the vector
Figure BDA00029793182700000718
Then u is input into an automatic coding and decoding network to obtain the common visual modePotential representation h and reconstructed representation
Figure BDA00029793182700000719
Thereby obtaining a reconstruction loss function
Figure BDA00029793182700000720
Figure BDA00029793182700000721
s.t.h=gae(u;Wae),
Figure BDA00029793182700000722
Wherein, gae(. o): coding network, gdg(. o): degenerate network, Wae: encoding a parameter to be learned of a network, Wdg: the parameters to be learned of the degraded network,
Figure BDA0002979318270000081
the dimension of the common potential representation of visual modalities h is du
Figure BDA0002979318270000082
Reconstructed representation
Figure BDA0002979318270000083
Is 2dh
204: learning a label information space of the short video;
one of the key issues for the multi-label classification task is to explore label relationships. An attention network is constructed to explore tag correlations and compute a tag matrix. To this end, the concept of a graph is first introduced. For the label set Y ═ Y1,y2,…,yCConsider graph G (V, E), where V represents a set of label nodes and E ∈ | V | × | V | represents an adjacency matrix of label relationships. In particular, for any label node viIts neighborhood node is defined as ρi(j)={j|vj∈V}U{viThe original label matrix is V ═ V }1,v2,…,vC]Wherein
Figure BDA0002979318270000084
Is an initial vector representation of the label node C,
Figure BDA0002979318270000085
the original feature dimension representing the label is n.
(1) Building an initial graph structure
Since the initial relationship between the labels is unknown, an inverse covariance estimate is introduced, finding the inverse covariance matrix S for a given label matrix V-1To characterize the pairwise relationships of the labels, i.e. to define a graph relationship function: g (v) ═ tr (VS)-1VT)(19)
s.t.S≥0;tr(S)=1
The graph structure S is initialized. The solution of the model is S, which minimizes G (V). The analytical solution expression for calculating S is:
Figure BDA0002979318270000086
wherein tr (·): trace of matrix, VT: transpose of the label matrix.
(2) Picture attention learning
For learning label node representation, a unique graph attention learning network is provided, which comprises two steps of node feature learning and node relation learning:
first, node feature learning. Consider the conversion of a tag matrix V input into the network into a new tag matrix
Figure BDA0002979318270000087
Wherein, M (·): feature mapping function applied on each label node, vj: j-th label node representation, sij: relationship score, v 'for tag i and tag j'i: new feature of tag i, v'C: label (R)C, the new characteristics of the alloy material,
Figure BDA0002979318270000088
the dimensions of the new label features.
And secondly, learning the node relation. Inputting the new label matrix V 'learned in the first step into the graph relation function G (-) and calculating a graph structure S' under the new label matrix:
Figure BDA0002979318270000091
wherein, V'T: transpose of the new tag matrix. Note that: v ', S' are inputs to the next layer of the graph attention learning layer (equation-21). Thus, the model establishes 2 to 3 attention learning layers in total, and finally obtains the structured label matrix
Figure BDA0002979318270000092
Figure BDA0002979318270000093
The dimension of the label matrix P is du×C。
205: in order to obtain the label prediction score of the short video, an information fusion scheme based on multi-head attention is proposed for the visual modality public potential representation h obtained in the step 203 and the structured label matrix P obtained in the step 204.
Multi-headed note allows the model to jointly process information from different representation subspaces at different locations. First, the query matrix Q, the key matrix K, and the value matrix V in this task are calculated.
Analyzing the characteristics of the short video multi-label classification task, a short video may contain a plurality of labels, namely, the relation between the visual feature representation and the label representation of the short video is multi-coupled, and the classification task is facilitated by explicitly researching the coupling relation. Therefore, a multi-head cross-modal fusion layer is provided, the labels are queried by using the short video visual feature common representation, the correlation between the labels is calculated, and the short video visual feature common representation and the label matrix are aligned.
First, consider a labelA correlation of the representation and the visual feature representation. Computing a visual modality common potential representation h and an ith class label vector piIs scored as a relation ofi
Figure BDA0002979318270000094
Wherein, cos (·): cosine similarity function, | ·| non-conducting phosphor2: computing the 2-norm, h, of the vectorT: the visual modalities share a transpose of the underlying representation. Thereby obtaining a relation vector of the short video visual feature representation and the label representation
Figure BDA0002979318270000095
Inspired by a multi-head attention mechanism, a multi-head cross-modal fusion layer is provided to calculate a label representation corresponding to the visual feature representation. For the e-th head, a weighted projection H of the visual feature representation in label space is computede
Figure BDA0002979318270000096
Wherein the content of the first and second substances,
Figure BDA0002979318270000097
the projection parameters of the potential representation are common to the visual modalities,
Figure BDA0002979318270000098
the projection parameters of the relationship score vector,
Figure BDA0002979318270000099
Heis d in the dimension ofk×dk,(·)T: the transpose of the matrix is computed. Projecting the visual weighting HeFusing the label matrix P to obtain a label representation F with semantic perception attributee
Figure BDA00029793182700000910
Wherein the content of the first and second substances,
Figure BDA0002979318270000101
the projection parameters of the representation of the label,
Figure BDA0002979318270000102
label representation FeIs d in the dimension ofkAnd (4) x C. And finally, the output of a plurality of attention heads is cascaded and is subjected to linear projection to obtain the label prediction score of the short video
Figure BDA0002979318270000103
Figure BDA0002979318270000104
Wherein the content of the first and second substances,
Figure BDA0002979318270000105
linear projection matrix, concat (g): cascade function, F1;F2;…;Fe: the e heads respectively calculated label representation,
Figure BDA0002979318270000106
predictive score
Figure BDA0002979318270000107
Is of the dimension ofC
206: the traditional multi-label classification loss is adopted to measure the gap between the predicted label score and the real label information:
Figure BDA0002979318270000108
wherein, log (·): logarithmic function, y: the real tag information of the short video,
Figure BDA0002979318270000109
label prediction scores for short videos.
Therefore, the overall loss function of the model
Figure BDA00029793182700001010
Figure BDA00029793182700001011
Where λ is the equilibrium classification loss
Figure BDA00029793182700001012
And reconstruction loss
Figure BDA00029793182700001013
The compromise parameter of (1).
In the whole training and testing process, the performance of the model is evaluated by five evaluation indexes of Coverage ratio Coverage, ranking loss RankingLoss, average precision mAP, Hamming loss HamminLoss and header error One-error, wherein: (1) coverage is used to calculate how many tags are needed on average to cover all the correct tags for an instance, it is loosely tied to the accuracy of the best level of recall, the smaller the value, the better the performance; (2) the average score of the reverse label pair of the ranking loss RankingLoss calculation example is smaller, and the performance is better; (3) mAP represents the average of the m categories of accuracy, the larger the value, the better the performance; (4) hamming loss Hammingloss measures the times of wrong division of the label, and the smaller the value is, the better the performance is; (5) the number of times that the label with the maximum prediction probability value is not in the real label set is firstly marked by the error One-error, and the smaller the value is, the better the performance is. (the results are shown in FIG. 3)
In summary, the invention learns the common potential representation and the label representation of the visual modality from two angles of the content information and the label information respectively aiming at the disadvantages of 'limited time and insufficient information' of the short video, and finally fuses the representations of the two information spaces to obtain the label prediction score, and the whole process fully utilizes the information of each modality of the short video. Firstly, a multi-modal representation learning problem in a short video is researched, a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part is provided, specifically, information complementarity among four subspace learning modalities is constructed from a modality missing angle, consistency of visual modal characteristic information is further considered, and an automatic coding and decoding network is utilized to learn common potential representation of the visual modalities; then, label information of the short video is explored, and a new idea of label correlation learning is provided from two aspects of inverse covariance estimation and a graph attention network; and finally, a multi-head cross-modal information fusion scheme based on multi-head attention is provided for the representation of the two information spaces to obtain a final label prediction score.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A short video classification method based on a multi-modal complete feature representation, the method comprising:
for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned;
for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video;
a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video;
the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
2. The short video classification method based on the multi-modal complete feature representation as claimed in claim 1, wherein the two types of visual modal features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.
3. The short video classification method based on multi-modal complete feature representation according to claim 2, wherein the unique visual modality potential representation is:
Figure FDA0002979318260000011
wherein the content of the first and second substances,
Figure FDA0002979318260000012
mapper, theta, representing visual characteristicsvRepresents the network parameters to be learned and,
Figure FDA0002979318260000013
representing potential representations of visual modalities hvIs d in the dimension ofh;zvRepresenting the original visual modality characteristics.
4. The method for classifying short video based on complete multi-modal feature representation according to claim 3, wherein the visual modality potential representation under the complementation of different modality information is:
the original visual modal characteristics zvAnd audio modality features in visual representation space
Figure FDA0002979318260000014
Added and sent to a feature fusion mapper phiaGenerating a visual modality potential representation supplemented with audio modality information
Figure FDA0002979318260000015
Figure FDA0002979318260000016
Wherein, thetaa: the features to be learned fuse the mapper parameters,
Figure FDA0002979318260000017
adding corresponding elements between the vectors;
visual modality potential representation supplemented with trajectory modality information
Figure FDA0002979318260000018
Figure FDA0002979318260000019
Wherein phi ist: feature fusion mapper, θt: the features to be learned fuse mapper parameters;
when the original visual modal characteristics zvAudio modal characteristics zaTrace modal characteristics ztWhen both exist, the audio information and the track information are jointly used for supplementing the visual information to obtain a new potential representation of the visual mode
Figure FDA0002979318260000021
Figure FDA0002979318260000022
Wherein phi isat: feature fusion mapper, θat: the features to be learned fuse mapper parameters.
5. The method of claim 1, wherein the reconstruction loss function is:
Figure FDA0002979318260000023
Figure FDA0002979318260000024
wherein the content of the first and second substances,
Figure FDA0002979318260000025
u is a series vector of the series,
Figure FDA0002979318260000026
h is a common potential representation of the visual modality,
Figure FDA0002979318260000027
for reconstruction representation, gae(. o): coding network, gdg(. o): degenerate network, Wae: encoding a parameter to be learned of a network, Wdg: the parameters to be learned of the degraded network,
Figure FDA0002979318260000028
the dimension of the common potential representation of visual modalities h is du
Figure FDA0002979318260000029
Reconstructed representation
Figure FDA00029793182600000210
Is 2dh
6. The method for classifying short videos based on complete multi-modal feature representation according to claim 1, wherein the exploring the correlation between tags and updating the tag representation by using inverse covariance estimation and a graph attention network to obtain the tag vector representation corresponding to the short videos specifically comprises:
introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V-1To characterize the pairwise relationship of the tags;
converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (g), and calculating a graph structure S' under the new label matrix.
7. The method according to claim 1, wherein the multi-head cross-modal fusion scheme based on multi-head attention is as follows:
and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.
CN202110282974.3A 2021-03-16 2021-03-16 Short video classification method based on multi-mode feature complete representation Withdrawn CN113158798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110282974.3A CN113158798A (en) 2021-03-16 2021-03-16 Short video classification method based on multi-mode feature complete representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110282974.3A CN113158798A (en) 2021-03-16 2021-03-16 Short video classification method based on multi-mode feature complete representation

Publications (1)

Publication Number Publication Date
CN113158798A true CN113158798A (en) 2021-07-23

Family

ID=76887371

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110282974.3A Withdrawn CN113158798A (en) 2021-03-16 2021-03-16 Short video classification method based on multi-mode feature complete representation

Country Status (1)

Country Link
CN (1) CN113158798A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657272A (en) * 2021-08-17 2021-11-16 山东建筑大学 Micro-video classification method and system based on missing data completion
CN113743277A (en) * 2021-08-30 2021-12-03 上海明略人工智能(集团)有限公司 Method, system, equipment and storage medium for short video frequency classification
CN113989697A (en) * 2021-09-24 2022-01-28 天津大学 Short video frequency classification method and device based on multi-mode self-supervision deep countermeasure network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657272A (en) * 2021-08-17 2021-11-16 山东建筑大学 Micro-video classification method and system based on missing data completion
CN113743277A (en) * 2021-08-30 2021-12-03 上海明略人工智能(集团)有限公司 Method, system, equipment and storage medium for short video frequency classification
CN113989697A (en) * 2021-09-24 2022-01-28 天津大学 Short video frequency classification method and device based on multi-mode self-supervision deep countermeasure network

Similar Documents

Publication Publication Date Title
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN113158798A (en) Short video classification method based on multi-mode feature complete representation
CN112287170B (en) Short video classification method and device based on multi-mode joint learning
US7996762B2 (en) Correlative multi-label image annotation
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
Ma et al. A weighted KNN-based automatic image annotation method
CN111783903B (en) Text processing method, text model processing method and device and computer equipment
CN115131638B (en) Training method, device, medium and equipment for visual text pre-training model
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
Gao et al. A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective
CN114065048A (en) Article recommendation method based on multi-different-pattern neural network
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN115964482A (en) Multi-mode false news detection method based on user cognitive consistency reasoning
CN115618097A (en) Entity alignment method for prior data insufficient multi-social media platform knowledge graph
Lu et al. Cross-domain structure learning for visual data recognition
Li et al. Deep InterBoost networks for small-sample image classification
Zhou et al. Multi-modal multi-hop interaction network for dialogue response generation
CN116977701A (en) Video classification model training method, video classification method and device
CN116189047A (en) Short video classification method based on multi-mode information aggregation
CN115631008B (en) Commodity recommendation method, device, equipment and medium
Jia et al. An unsupervised person re‐identification approach based on cross‐view distribution alignment
Xu et al. Hierarchical composition learning for composed query image retrieval
CN116414938A (en) Knowledge point labeling method, device, equipment and storage medium
Zuo et al. UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210723

WW01 Invention patent application withdrawn after publication