CN113158798A - Short video classification method based on multi-mode feature complete representation - Google Patents
Short video classification method based on multi-mode feature complete representation Download PDFInfo
- Publication number
- CN113158798A CN113158798A CN202110282974.3A CN202110282974A CN113158798A CN 113158798 A CN113158798 A CN 113158798A CN 202110282974 A CN202110282974 A CN 202110282974A CN 113158798 A CN113158798 A CN 113158798A
- Authority
- CN
- China
- Prior art keywords
- representation
- visual
- modal
- label
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a short video classification method based on multi-mode feature complete representation, which comprises the following steps: for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned; for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video; a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video; the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
Description
Technical Field
The invention relates to the field of short video classification, in particular to a short video classification method based on multi-mode characteristic complete representation.
Background
In recent years, along with popularization of intelligent terminals and fire and heat of social networks, more and more information is presented by adopting multimedia contents, and a high-definition camera, a large-capacity storage and a high-speed network connection create extremely convenient shooting and sharing conditions for users, so that massive multimedia data are created.
The short video is used as novel user generated content, and is greatly popular in a social network by virtue of unique advantages of low creation threshold, fragmented content, strong social attributes and the like. Especially, since 2011, with the popularization of mobile internet terminals, the speed increase of networks and the reduction of traffic charges, short videos have rapidly gained support and favor of multiple parties including large content platforms, fans, capital and the like. There is data showing that global mobile video traffic has taken up more than half of the total traffic of mobile data and continues to grow at a high rate. The enormous size of short video data easily overwhelms the information that users need, making it difficult for users to find the desired content of short video information, so how to efficiently process and utilize this information becomes critical.
Artificial intelligence technology represented by deep learning is one of the most popular technologies at present, and is widely applied to many fields such as computer vision.
Therefore, the introduction of the short video classification task is beneficial to promoting the innovation of related topics in the computer vision and multimedia fields, and has important application value and practical significance for improving the user experience and developing the industry.
Disclosure of Invention
The invention provides a short video frequency classification method based on multi-mode feature complete expression, which solves the problem of short video multi-label classification and evaluates the result, and is described in detail as follows:
a method for short video classification based on a complete representation of multi-modal features, the method comprising:
for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned;
for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video;
a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video;
the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
Wherein the two types of visual modality features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.
Further, the exploring the correlation between labels and updating the label representation by using inverse covariance estimation and a graph attention network to obtain the label vector representation corresponding to the short video specifically comprises:
introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V-1To characterize the pair-wise relationship of the labels, i.e. define the graph relationship function to initialize the graph structure S;
and converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (-) and calculating a graph structure S' under the new label matrix.
Wherein, the multi-head trans-modal fusion scheme based on multi-head attention is as follows: and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention researches a multi-modal representation learning problem in a short video, provides a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part, constructs information complementarity among four subspace learning modalities from a modality missing angle to obtain potential representations of two types of visual modal characteristics, and utilizes automatic coding and decoding network fusion to the potential representations of the two types of visual modal characteristics to obtain a common potential representation of the visual modal characteristics in consideration of consistency of the visual modal characteristic information. The process considers the problem of modal loss and the complementarity and consistency of modal information at the same time, and fully utilizes the modal information of the short video;
2. the invention explores the label information space of the short video, and provides a new idea for label correlation learning from two aspects of inverse covariance estimation and a graph attention network;
3. aiming at the disadvantages of limited duration and insufficient information of a short video, the invention suggests that a visual modal public potential representation and a label representation are respectively learned from two angles of content information and label information of the short video, and a multi-head cross-modal fusion strategy based on multi-head attention is proposed for the two representations to obtain a final label prediction score.
The method fully utilizes the modal information of the short video to learn the visual modal representation and the label representation which have important effects on the multi-label classification task, and is favorable for improving the accuracy of the short video multi-label classification task.
Drawings
FIG. 1 is a diagram of the overall network framework of a short video classification method based on a complete representation of multi-modal features;
FIG. 2 is a subspace learning framework diagram;
fig. 3 is data of experimental results.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
The embodiment of the invention provides a short video classification method based on multi-mode complete feature representation, which makes full use of content information and label information of a short video, and as shown in figure 1, the method comprises the following steps:
101: for content information, according to experience, semantic feature representation of visual modalities in a short video multi-label classification task is important, so that representation learning based on the visual modality features is provided, the visual modality features are taken as the main, four subspaces are constructed from the aspect of modality missing, information complementarity between modalities is learned, and potential representation of two types of visual modality features is obtained. In consideration of consistency of visual modal characteristic information, in order to obtain more compact visual modal characteristic representation, two types of visual modal characteristic potential representations obtained by four subspaces are fused by utilizing an automatic coding and decoding network to learn common potential representation of the visual modal characteristics;
102: for label information, a unique convex form (inverse covariance estimation) and a graph attention network are adopted to explore the correlation among labels and update label representation, and label vector representation corresponding to short videos is obtained;
the tag vector representation is used to explore a tag representation suitable for a short video data set, participating in the multi-head cross-modality fusion network of step 103 together with the common potential representation of visual modality features of step 101;
103: two information spaces are represented: the common potential representation of the visual modal features obtained in step 101 and the label representation obtained in step 102 propose a multi-head trans-modal fusion scheme based on multi-head attention, which is used for obtaining a label prediction score of the short video;
the output of the multi-head cross-modal fusion network can be regarded as the label prediction score of the input short video and is directly used in the classification loss function.
104: the overall loss function consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
The performance of the scheme is evaluated by five evaluation indexes, namely coverage rate, ranking loss, average precision, Hamming loss and first marking error, so that the objectivity of an experimental result is ensured.
In a specific implementation, before step 101, the method further includes:
inputting short videos, and extracting three modal characteristics of vision, sound and track by using a classical deep learning network respectively.
In summary, the embodiment of the invention obtains the label prediction score of the input short video by utilizing the multi-modal learning and label learning related theories and combining the advantages of the deep learning network, and the classification result is accurate and effective.
Example 2
The scheme in example 1 is further described below by combining the calculation formula and examples, and the following description refers to:
201: inputting a complete short video by the model, and respectively extracting three modal characteristics of vision, audio and track;
for visual mode, extracting key frames, applying a classical image feature extraction network ResNet (residual network) to all video key frames, and then performing an averaging (AvePolling) operation to obtain visual mode features XvOverall characteristic z ofv:
Wherein ResNet (·): residual network, ave boost (·): average operation, Xv: short video original visual features, β v: the network parameters to be learned are,visual modal characteristics zvIs d in the dimension ofv。
For the audio mode, drawing the sound spectrogram, and using "CNN + for the spectrogramExtracting sound feature z by LSTM (convolutional neural network + long-short term memory network)'a:
Wherein, CNN (·): convolutional neural network, LSTM (·): long and short term memory networks, Xa: short video original audio feature, betaa: the network parameters to be learned are,audio modal characteristic zaIs d in the dimension ofa. For track mode, extracting track characteristic z from time domain and space domain jointly by using TDD (track pool depth convolution descriptor) methodt:
Wherein, TDD (·): network of trajectory depth descriptors, Xt: original track information, beta, of short videot: the network parameters to be learned are,modal characteristics of trajectory ztIs d in the dimension oft。
202: modality subspace learning based on visual modalities;
the model considers the visual, audio and trajectory modalities of short video. For a specific short video, the video pictures are generally contained, namely, the visual modal characteristics exist, but the missing situations of the other two modalities are uncertain, and the missing situations of the different modalities are four in total. According to experience, the potential representation of the visual modality is crucial in the task of short video multi-label classification, so four subspaces are constructed based on the potential representation learning of the visual modality, namely, the two main cases are discussed: unique visual modality potential representation and visual modality potential representation under complementation of different modality information to ensure pairingThe visual modality potential representation is fully mined. (wherein, visual modality feature zvAudio modal characteristics zaTrace modal characteristics ztAre all obtained in step 201. )
(ii) unique visual modality potential representation
Wherein the content of the first and second substances,visual feature specific mapper, θv: the network parameters to be learned are,visual modality potential representation hvIs d in the dimension ofh。
Visual modal potential representation under different modal information complementation
And quantitatively analyzing the complementary relation between different modal information and visual modal information by introducing a normalized index function, converting other modal characteristics into corresponding characteristics in a visual representation space, adding the corresponding characteristics and the visual modal characteristics, and sending the sum and the visual modal characteristics into a characteristic fusion mapper to obtain visual modal potential representation after information complementation.
When only the visual modal characteristics z are presentvAnd an audio modal characteristic zaFirstly, calculating the incidence matrix U of the two modal characteristicsa:
Wherein the content of the first and second substances,visual modal characteristics zvTranspose of (d)v: visual modal characteristics zvDimension of (d)a: audio modal characteristic zaThe dimension (c) of (a) is,correlation matrix UaIs d in the dimension ofv×da。
Wherein softmax (·): normalized exponential function (same below), dv: visual modal characteristics zvDimension of (d)a: audio modal characteristic zaThe dimension (c) of (a) is,correlation score matrixIs d in the dimension ofv×da。
Using a correlation score matrixCharacterizing the audio modality zaTransforming to visual representation space to obtain audio modal feature representation in visual representation space
Wherein the content of the first and second substances,audio modal characteristic zaThe transpose of (a) is performed,audio modal feature representation in visual representation spaceIs d in the dimension ofv。
Finally, the original visual modal characteristics z are measuredvAnd audio modality features in visual representation spaceAdded and sent to a feature fusion mapper phiaGenerating a visual modality potential representation supplemented with audio modality information
Wherein, thetaa: the features to be learned fuse the mapper parameters,the corresponding elements between the vectors are added up,visual modality potential representation generated by feature fusion mapperIs d in the dimension ofh。
When only the visual modal characteristics z are availablevAnd a modal signature z of the trajectorytAnd then, obtaining the visual modal potential representation after the track modal information is supplemented by adopting the same strategy as I.
Wherein, Ut: visual modal characteristics zvAnd a modal signature z of the trajectorytThe correlation matrix of (a) is obtained,correlation matrix UtIs d in the dimension ofv×dt,Visual modal characteristics zvThe transposing of (1).
Wherein the content of the first and second substances,a correlation score matrix between visual modalities and trajectory modalities,correlation score matrixIs d in the dimension ofv×dt。
Wherein the content of the first and second substances,the trajectory modal characteristics in the visual representation space,original trajectory modal characteristics ztThe transpose of (a) is performed,modal characterization of trajectoriesIs d in the dimension ofv。
Wherein phi ist: feature fusion mapper, θt: the features to be learned fuse the mapper parameters,visual modality potential representation generated by feature fusion mapperIs d in the dimension ofh。
When the visual mode characteristic zvAudio modal characteristics zaTrace modal characteristics ztWhen both exist, it is considered to supplement the visual information with the audio information and the trajectory information jointly.
Wherein concat (·): a cascade function of the feature vectors is provided,joint information representation zatIs d in the dimension ofa+dt. And then, the same strategy as I is adopted to obtain a new potential representation of the visual mode when the information of the three modes exists.
Wherein, Uat: the correlation matrix between the three modes is,visual modal characteristics zvThe transpose of (a) is performed,correlation matrix UatIs d in the dimension ofv×(da+dt)。
Wherein the content of the first and second substances,a matrix of correlation scores between the three modalities,correlation score matrixIs d in the dimension ofv×(da+dt)。
Wherein the content of the first and second substances,a joint information representation of the audio modality and the trajectory modality in the visual representation space,the original audio modality and the trajectory modality are combined with a transposition of the information representation, is d in the dimension ofv。
Wherein phi isat: feature fusion mapper, θat: the features to be learned fuse the mapper parameters,visual modality potential representation generated by feature fusion mapperIs d in the dimension ofh。
203: the consistency of potential representation of the visual mode is learned by the automatic coding and decoding network;
the visual modality potential representations learned by the subspaces should be similar, in theory they all characterize the same visual content. The two types of potential representations of the visual modalities learned in step 202 are projected into a common space as much as possible using an automatic codec network. The scheme has two advantages, on one hand, overfitting of data is prevented to a certain extent, dimension reduction is carried out on the data, and more compact visual modal potential representation is obtained; on the other hand, the effective connection among the four subspaces is strengthened, so that the subspace learning becomes more meaningful. Two types of potential representations of visual modalities are obtained in step 202: unique visual modality potential representation hvVisual modality potential representation in complementary to modalityWhere m is the { a, t, at }, and they are concatenated to obtain the vector u, i.e., the vectorThen u is input into an automatic coding and decoding network to obtain the common visual modePotential representation h and reconstructed representation
Wherein, gae(. o): coding network, gdg(. o): degenerate network, Wae: encoding a parameter to be learned of a network, Wdg: the parameters to be learned of the degraded network,the dimension of the common potential representation of visual modalities h is du,Reconstructed representationIs 2dh。
204: learning a label information space of the short video;
one of the key issues for the multi-label classification task is to explore label relationships. An attention network is constructed to explore tag correlations and compute a tag matrix. To this end, the concept of a graph is first introduced. For the label set Y ═ Y1,y2,…,yCConsider graph G (V, E), where V represents a set of label nodes and E ∈ | V | × | V | represents an adjacency matrix of label relationships. In particular, for any label node viIts neighborhood node is defined as ρi(j)={j|vj∈V}U{viThe original label matrix is V ═ V }1,v2,…,vC]WhereinIs an initial vector representation of the label node C,the original feature dimension representing the label is n.
(1) Building an initial graph structure
Since the initial relationship between the labels is unknown, an inverse covariance estimate is introduced, finding the inverse covariance matrix S for a given label matrix V-1To characterize the pairwise relationships of the labels, i.e. to define a graph relationship function: g (v) ═ tr (VS)-1VT)(19)
s.t.S≥0;tr(S)=1
The graph structure S is initialized. The solution of the model is S, which minimizes G (V). The analytical solution expression for calculating S is:
wherein tr (·): trace of matrix, VT: transpose of the label matrix.
(2) Picture attention learning
For learning label node representation, a unique graph attention learning network is provided, which comprises two steps of node feature learning and node relation learning:
first, node feature learning. Consider the conversion of a tag matrix V input into the network into a new tag matrix
Wherein, M (·): feature mapping function applied on each label node, vj: j-th label node representation, sij: relationship score, v 'for tag i and tag j'i: new feature of tag i, v'C: label (R)C, the new characteristics of the alloy material,the dimensions of the new label features.
And secondly, learning the node relation. Inputting the new label matrix V 'learned in the first step into the graph relation function G (-) and calculating a graph structure S' under the new label matrix:
wherein, V'T: transpose of the new tag matrix. Note that: v ', S' are inputs to the next layer of the graph attention learning layer (equation-21). Thus, the model establishes 2 to 3 attention learning layers in total, and finally obtains the structured label matrix The dimension of the label matrix P is du×C。
205: in order to obtain the label prediction score of the short video, an information fusion scheme based on multi-head attention is proposed for the visual modality public potential representation h obtained in the step 203 and the structured label matrix P obtained in the step 204.
Multi-headed note allows the model to jointly process information from different representation subspaces at different locations. First, the query matrix Q, the key matrix K, and the value matrix V in this task are calculated.
Analyzing the characteristics of the short video multi-label classification task, a short video may contain a plurality of labels, namely, the relation between the visual feature representation and the label representation of the short video is multi-coupled, and the classification task is facilitated by explicitly researching the coupling relation. Therefore, a multi-head cross-modal fusion layer is provided, the labels are queried by using the short video visual feature common representation, the correlation between the labels is calculated, and the short video visual feature common representation and the label matrix are aligned.
First, consider a labelA correlation of the representation and the visual feature representation. Computing a visual modality common potential representation h and an ith class label vector piIs scored as a relation ofi:
Wherein, cos (·): cosine similarity function, | ·| non-conducting phosphor2: computing the 2-norm, h, of the vectorT: the visual modalities share a transpose of the underlying representation. Thereby obtaining a relation vector of the short video visual feature representation and the label representation
Inspired by a multi-head attention mechanism, a multi-head cross-modal fusion layer is provided to calculate a label representation corresponding to the visual feature representation. For the e-th head, a weighted projection H of the visual feature representation in label space is computede:
Wherein the content of the first and second substances,the projection parameters of the potential representation are common to the visual modalities,the projection parameters of the relationship score vector,Heis d in the dimension ofk×dk,(·)T: the transpose of the matrix is computed. Projecting the visual weighting HeFusing the label matrix P to obtain a label representation F with semantic perception attributee:
Wherein the content of the first and second substances,the projection parameters of the representation of the label,label representation FeIs d in the dimension ofkAnd (4) x C. And finally, the output of a plurality of attention heads is cascaded and is subjected to linear projection to obtain the label prediction score of the short video
Wherein the content of the first and second substances,linear projection matrix, concat (g): cascade function, F1;F2;…;Fe: the e heads respectively calculated label representation,predictive scoreIs of the dimension ofC。
206: the traditional multi-label classification loss is adopted to measure the gap between the predicted label score and the real label information:
wherein, log (·): logarithmic function, y: the real tag information of the short video,label prediction scores for short videos.
Where λ is the equilibrium classification lossAnd reconstruction lossThe compromise parameter of (1).
In the whole training and testing process, the performance of the model is evaluated by five evaluation indexes of Coverage ratio Coverage, ranking loss RankingLoss, average precision mAP, Hamming loss HamminLoss and header error One-error, wherein: (1) coverage is used to calculate how many tags are needed on average to cover all the correct tags for an instance, it is loosely tied to the accuracy of the best level of recall, the smaller the value, the better the performance; (2) the average score of the reverse label pair of the ranking loss RankingLoss calculation example is smaller, and the performance is better; (3) mAP represents the average of the m categories of accuracy, the larger the value, the better the performance; (4) hamming loss Hammingloss measures the times of wrong division of the label, and the smaller the value is, the better the performance is; (5) the number of times that the label with the maximum prediction probability value is not in the real label set is firstly marked by the error One-error, and the smaller the value is, the better the performance is. (the results are shown in FIG. 3)
In summary, the invention learns the common potential representation and the label representation of the visual modality from two angles of the content information and the label information respectively aiming at the disadvantages of 'limited time and insufficient information' of the short video, and finally fuses the representations of the two information spaces to obtain the label prediction score, and the whole process fully utilizes the information of each modality of the short video. Firstly, a multi-modal representation learning problem in a short video is researched, a deep multi-modal unified representation learning scheme taking visual modal information as a main part and other modal information as an auxiliary part is provided, specifically, information complementarity among four subspace learning modalities is constructed from a modality missing angle, consistency of visual modal characteristic information is further considered, and an automatic coding and decoding network is utilized to learn common potential representation of the visual modalities; then, label information of the short video is explored, and a new idea of label correlation learning is provided from two aspects of inverse covariance estimation and a graph attention network; and finally, a multi-head cross-modal information fusion scheme based on multi-head attention is provided for the representation of the two information spaces to obtain a final label prediction score.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. A short video classification method based on a multi-modal complete feature representation, the method comprising:
for self content information of the short video, visual modal characteristics are mainly provided, four subspaces are constructed from a modal missing angle and potential characteristic representations are respectively obtained, and the potential characteristic representations of the four subspaces are further fused by utilizing an automatic coding and decoding network so as to ensure that a more robust and effective public potential representation is learned;
for label information, adopting inverse covariance estimation and a graph attention network to explore the correlation among labels and update label representation to obtain label vector representation corresponding to the short video;
a multi-head cross-modal fusion scheme based on multi-head attention is proposed for the public potential representation and the label vector representation, and the multi-head cross-modal fusion scheme is used for obtaining a label prediction score of the short video;
the overall loss function of the model consists of the traditional multi-label classification loss and the reconstruction loss of the automatic coding and decoding network, and is used for measuring the difference between the network output value and the actual value and guiding the network to find the optimal solution of the model.
2. The short video classification method based on the multi-modal complete feature representation as claimed in claim 1, wherein the two types of visual modal features are potentially represented as: the unique visual modality potential representation is complementary to the visual modality potential representation under the different modality information.
3. The short video classification method based on multi-modal complete feature representation according to claim 2, wherein the unique visual modality potential representation is:
wherein the content of the first and second substances,mapper, theta, representing visual characteristicsvRepresents the network parameters to be learned and,representing potential representations of visual modalities hvIs d in the dimension ofh;zvRepresenting the original visual modality characteristics.
4. The method for classifying short video based on complete multi-modal feature representation according to claim 3, wherein the visual modality potential representation under the complementation of different modality information is:
the original visual modal characteristics zvAnd audio modality features in visual representation spaceAdded and sent to a feature fusion mapper phiaGenerating a visual modality potential representation supplemented with audio modality information
Wherein, thetaa: the features to be learned fuse the mapper parameters,adding corresponding elements between the vectors;
Wherein phi ist: feature fusion mapper, θt: the features to be learned fuse mapper parameters;
when the original visual modal characteristics zvAudio modal characteristics zaTrace modal characteristics ztWhen both exist, the audio information and the track information are jointly used for supplementing the visual information to obtain a new potential representation of the visual mode
Wherein phi isat: feature fusion mapper, θat: the features to be learned fuse mapper parameters.
5. The method of claim 1, wherein the reconstruction loss function is:
wherein the content of the first and second substances,u is a series vector of the series,h is a common potential representation of the visual modality,for reconstruction representation, gae(. o): coding network, gdg(. o): degenerate network, Wae: encoding a parameter to be learned of a network, Wdg: the parameters to be learned of the degraded network,the dimension of the common potential representation of visual modalities h is du,Reconstructed representationIs 2dh。
6. The method for classifying short videos based on complete multi-modal feature representation according to claim 1, wherein the exploring the correlation between tags and updating the tag representation by using inverse covariance estimation and a graph attention network to obtain the tag vector representation corresponding to the short videos specifically comprises:
introducing an inverse covariance estimate, finding an inverse covariance matrix S for a given label matrix V-1To characterize the pairwise relationship of the tags;
converting the label matrix V input into the network into a new label matrix, inputting the new label matrix into a graph relation function G (g), and calculating a graph structure S' under the new label matrix.
7. The method according to claim 1, wherein the multi-head cross-modal fusion scheme based on multi-head attention is as follows:
and querying the labels by using the short video visual characteristic public potential representation, calculating the correlation, and aligning the short video visual modal public potential representation and the label matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110282974.3A CN113158798A (en) | 2021-03-16 | 2021-03-16 | Short video classification method based on multi-mode feature complete representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110282974.3A CN113158798A (en) | 2021-03-16 | 2021-03-16 | Short video classification method based on multi-mode feature complete representation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113158798A true CN113158798A (en) | 2021-07-23 |
Family
ID=76887371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110282974.3A Withdrawn CN113158798A (en) | 2021-03-16 | 2021-03-16 | Short video classification method based on multi-mode feature complete representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113158798A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657272A (en) * | 2021-08-17 | 2021-11-16 | 山东建筑大学 | Micro-video classification method and system based on missing data completion |
CN113743277A (en) * | 2021-08-30 | 2021-12-03 | 上海明略人工智能(集团)有限公司 | Method, system, equipment and storage medium for short video frequency classification |
CN113989697A (en) * | 2021-09-24 | 2022-01-28 | 天津大学 | Short video frequency classification method and device based on multi-mode self-supervision deep countermeasure network |
-
2021
- 2021-03-16 CN CN202110282974.3A patent/CN113158798A/en not_active Withdrawn
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657272A (en) * | 2021-08-17 | 2021-11-16 | 山东建筑大学 | Micro-video classification method and system based on missing data completion |
CN113743277A (en) * | 2021-08-30 | 2021-12-03 | 上海明略人工智能(集团)有限公司 | Method, system, equipment and storage medium for short video frequency classification |
CN113989697A (en) * | 2021-09-24 | 2022-01-28 | 天津大学 | Short video frequency classification method and device based on multi-mode self-supervision deep countermeasure network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112966127B (en) | Cross-modal retrieval method based on multilayer semantic alignment | |
CN113158798A (en) | Short video classification method based on multi-mode feature complete representation | |
CN112287170B (en) | Short video classification method and device based on multi-mode joint learning | |
US7996762B2 (en) | Correlative multi-label image annotation | |
CN106202256B (en) | Web image retrieval method based on semantic propagation and mixed multi-instance learning | |
CN112417097B (en) | Multi-modal data feature extraction and association method for public opinion analysis | |
Ma et al. | A weighted KNN-based automatic image annotation method | |
CN111783903B (en) | Text processing method, text model processing method and device and computer equipment | |
CN115131638B (en) | Training method, device, medium and equipment for visual text pre-training model | |
CN112800292A (en) | Cross-modal retrieval method based on modal specificity and shared feature learning | |
Gao et al. | A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective | |
CN114065048A (en) | Article recommendation method based on multi-different-pattern neural network | |
CN113239159A (en) | Cross-modal retrieval method of videos and texts based on relational inference network | |
CN115964482A (en) | Multi-mode false news detection method based on user cognitive consistency reasoning | |
CN115618097A (en) | Entity alignment method for prior data insufficient multi-social media platform knowledge graph | |
Lu et al. | Cross-domain structure learning for visual data recognition | |
Li et al. | Deep InterBoost networks for small-sample image classification | |
Zhou et al. | Multi-modal multi-hop interaction network for dialogue response generation | |
CN116977701A (en) | Video classification model training method, video classification method and device | |
CN116189047A (en) | Short video classification method based on multi-mode information aggregation | |
CN115631008B (en) | Commodity recommendation method, device, equipment and medium | |
Jia et al. | An unsupervised person re‐identification approach based on cross‐view distribution alignment | |
Xu et al. | Hierarchical composition learning for composed query image retrieval | |
CN116414938A (en) | Knowledge point labeling method, device, equipment and storage medium | |
Zuo et al. | UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210723 |
|
WW01 | Invention patent application withdrawn after publication |