CN116524407A

CN116524407A - Short video event detection method and device based on multi-modal representation learning

Info

Publication number: CN116524407A
Application number: CN202310505779.1A
Authority: CN
Inventors: 井佩光; 宋晓艺; 苏育挺
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-08-01

Abstract

The invention discloses a short video event detection method and a short video event detection device based on multi-mode representation learning, wherein the method comprises the following steps: constructing a potential sequence characteristic acquisition module, and exploring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence; a cyclic interaction information embedding module is constructed, a cyclic matrix is respectively constructed for different modes, the relation among multi-mode data and the coupling mode of the multi-mode data are fully explored, and the potential relevance among multi-mode characteristic elements is excavated; obtaining a fusion feature representation with enhanced local and global attention characteristics through a multimodal attention fusion network; and realizing event detection by utilizing the short video multi-mode fusion characteristics obtained by training. The invention utilizes the visual and auditory mode information of the short video to construct the short video characteristic representation learning network which fully excavates the multi-mode information potential association characteristic and the attention enhancement thereof, thereby realizing the detection of short video events. And a new idea is provided for solving the short video event detection problem.

Description

Short video event detection method and device based on multi-modal representation learning

Technical Field

The invention belongs to the technical field of multimedia and big data analysis, and particularly relates to a short video event detection method and device based on multi-mode representation learning.

Background

With the rapid development of the short video industry, short video content analysis typified by short video event detection has received increasing attention. Short video event detection helps to solve the short video supervision problem, so that the industry is continuously and healthily developed. However, with the increasing number of short videos and the increasing complexity and variety of information, how to use the existing short video information to quickly and efficiently search for short videos needed by users is a problem to be solved in the present day.

Currently, artificial intelligence techniques, typified by deep learning, have been rapidly developed in various fields, and are also widely used in the field of video information processing. The problem of short video event detection is solved by utilizing an artificial intelligence technology, so that the development of the field of computer vision can be promoted, and the user experience can be improved, thereby having research value and practical application value.

Disclosure of Invention

In order to solve the technical problems, the invention provides a short video event detection method and device based on multi-mode representation learning, so as to solve the problem of how to quickly and efficiently search short videos required by users by utilizing the existing short video information.

In order to achieve the above object, the present invention provides a short video event detection method based on multi-modal representation learning, comprising the steps of:

constructing a potential sequence characteristic acquisition module through a bidirectional long-short-time memory network, and acquiring potential characteristics of multi-mode information on a short video front-rear sequence through the potential sequence characteristic acquisition module to acquire visual representation and auditory representation containing front-rear audio potential related information;

constructing a cyclic interaction information embedding module based on the visual representation and the auditory representation, respectively constructing cyclic matrixes for different modes through the cyclic interaction information embedding module, and acquiring characteristic representation of potential element interaction information among the embedded modes based on the cyclic matrixes;

mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion mode characteristic representation weighted by local attention and global attention characteristics;

and inputting the fusion modal characteristic representation into a classifier to obtain event category scores, and obtaining short video event detection results based on the scores.

Preferably, the process of constructing the potential sequence property acquisition module includes: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:

H _v ＝Bi-LSTM(x _v ；θ _v )

H _s ＝Bi-LSTM(x _s ；θ _s )

wherein,,is a visual feature; />Is an auditory feature; d (D) _v And D _s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x _v Visual coding feature/obtained after Bi-LSTM training and learning _v Is the visual coding feature sequence length, d _v Is the dimension of the visual coding feature; />For auditory feature x _s Auditory coding feature obtained after Bi-LSTM training learning, l _s Is the length of the auditory coding feature sequence, d _s Is the auditory encoding feature dimension size; θ _v And theta _s Is the network parameter to be learned.

Preferably, the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.

Preferably, the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.

Preferably, the calculation formula for obtaining the local attention characteristic is:

Q _t ＝BN(Conv ₂ (δ(BN(Conv ₁ (F _t )))))，t∈{v,s}；

wherein Conv ₁ Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv ₂ Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;representing captured modal features with local attention characteristics; d, d _o To the dimension size after mapping to the low dimensional space.

Preferably, the calculation formula for obtaining the global attention characteristic is:

G＝GAP[BN[Conv ₂ (δ(BN(Conv ₁ (GAP(F _h )))))]]；

wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a catchThe captured overall modality feature representation with global attention characteristics.

Preferably, the calculation formula for obtaining the fused modal feature representation is:

wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information; />Representing a captured overall modality feature representation having global attention characteristics;representing captured modal features with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.

Preferably, the calculation formula of the event category score is:

wherein,,representing event category score,/->And C represents the number of event categories.

The invention also provides a short video event detection device based on multi-modal representation learning, which comprises: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.

Compared with the prior art, the invention has the following advantages and technical effects:

1. the invention realizes the full utilization of the multi-mode information of the short video and digs the potential correlation of the front sequence and the back sequence;

2. according to the invention, the relation and the coupling mode between the multi-mode data are explored by respectively constructing the cyclic matrix for different modes, the potential relevance between the multi-mode characteristic elements is mined, and the characteristic representation of the potential element interaction information between the embedded modes is obtained;

3. the invention utilizes the multi-modal attention fusion network to mine the local attention and the global attention of the multi-modal information, guides the fusion of the multi-modal information by utilizing the multi-modal attention fusion network, obtains the fusion characteristic representation of the attention enhancement and utilizes the classifier to realize the calculation of the event category score. The method provides a new method idea for solving the short video event detection problem.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:

fig. 1 is a flowchart of a short video event detection method according to an embodiment of the present invention.

Detailed Description

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Example 1

As shown in fig. 1, the invention provides a short video event detection method based on multi-modal representation learning, which comprises the following steps:

101: constructing a potential sequence characteristic acquisition module, and acquiring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence by utilizing a two-way long short-time memory network;

102: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 101, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;

103: mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion characteristic representation with enhanced local attention and global attention characteristics;

104: and inputting the short video multi-mode fusion characteristics with the multi-mode information potential association characteristics and the attention enhanced thereof obtained through training into a classifier to obtain event category scores, and completing short video event detection tasks.

The scheme is further described below in conjunction with the calculation formulas and examples, and is described in detail below:

201: inputting the characteristics of two modes of vision and hearing of a short video, and mining the potential information correlation of the front and rear sequences of the short video to obtain a visual representation containing front and rear frame potential correlation information and an audible representation containing front and rear audio potential correlation information;

short video visual features areAuditory characteristics of->l _v And l _s Respectively representing the frame sequence length after downsampling the short video and the picture sequence length after converting the short video and audio file into a spectrogram, D _v And D _s Representing the dimensions of the visual and acoustic features, respectively.

In order to acquire potential characteristic representations containing correlation of the front and rear sequence information, a bidirectional long-short-time memory network is utilized to construct a potential sequence characteristic acquisition module, so that potential information encoding of two modal characteristics is realized:

H _v ＝Bi-LSTM(x _v ；θ _v )

H _s ＝Bi-LSTM(x _s ；θ _s )

wherein Bi-LSTM is a bidirectional long and short time memory network;for visual features x _v Visual coding feature/obtained after Bi-LSTM training and learning _v Is the sequence length, d _v Is its dimension size; />For auditory feature x _s Auditory coding feature obtained after Bi-LSTM training learning, l _s Is the sequence length, d _s Is its dimension size; θ _v And theta _S Is the network parameter to be learned. Via the module, visual representation H containing potential relevant information of previous and subsequent frames is finally obtained _v And an auditory representation H containing audio-visual potential related information _s 。

202: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 201, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;

first, visual and auditory features are mapped into a low-dimensional space:

I _v ＝h _v W _v ^T

I _s ＝h _s W _s ^T

wherein,,is H _v Transposed column vector features; />D for mapping to visual projection vectors in low-dimensional space _o The dimension size after mapping; />To get h _v Mapping to a mapping matrix used in the process of low-dimensional space; />To H _s Transposed column vector features; />For mapping to auditory projection vectors in low-dimensional space, d _o The dimension size after mapping; />To get h _s To a mapping matrix used in the low-dimensional spatial process. Subsequently utilize I _v And I _s Respectively constructing visual circulation matrix->And auditory cyclic matrix->

L _v ＝circ(I _v )

L _s ＝circ(I _s )

Wherein circ (·) represents the dose to be administeredThe positions of the elements in the shadow vector are circularly shifted right in turn and combined into a matrix. Finally, to fully function the projection vector and the elements in the circulant matrix, matrix multiplication is used between the circulant matrix and the projection vector, thereby obtaining the characteristic representation embedded with the inter-modal element interaction informationAnd->The specific formula is as follows:

F _v ＝I _v L _s

F _s ＝I _s L _v

203: in order to better fuse the modal characteristics containing different semantic characteristics, the local attention and the global attention of the multi-modal information are mined through a multi-modal attention fusion network, and finally fusion characteristic representation with enhanced local attention and global attention characteristics is obtained;

feature representation with inter-modal element interaction information obtained in step 202 of federationAnd->To obtain a global feature representation of the different modalities:

wherein,,representing the overall feature representation after the combination of different modality representations d _h ＝d _o +d _o Representing the dimension thereof; concat (-) indicates a cascading operation.

The calculation formula for capturing the local attention characteristic is as follows:

Q _t ＝BN(Conv ₂ (δ(BN(Conv ₁ (F _t )))))，t∈{v,s}

wherein Conv ₁ Representing the use of 1*1 point-by-point convolution, which would input featuresAnd->The dimension number of (2) is reduced to 1/r of the original dimension, and r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv ₂ Representing the restoration of the feature dimension to the feature dimension of the original input by 1*1 point-wise convolution. />Representing captured modal features with local attention characteristics.

The calculation formula for capturing the global attention characteristic is as follows:

G＝GAP[BN[Conv ₂ (δ(BN(Conv ₁ (GAP(F _h )))))]]

where GAP represents a global pooling (globalargepoling) operation. Compared with the local attention characteristic capturing formula, the global attention characteristic capturing calculation formula is mainly added with a global pooling layer;an overall modal feature representation obtained for the cascading operation; />Representing a captured overall modality feature representation with global attention characteristics.

Finally, the module performs fusion operation on the modal characteristics with the local attention characteristic and the global attention characteristic obtained by the process:

wherein,,representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function; />A fused modal feature representation weighted for local and global attention characteristics is obtained.

204: acquiring event category scores by utilizing the learned fusion mode short video characteristic representation, and completing short video event detection tasks;

the fusion mode characteristic representation finally obtained in the process is representedInput to the fully connected layer and use the Softmax (·) function to get the event detection prediction result:

wherein,,is a parameter to be learned of the full connection layer, and C represents the number of event categories.

Two loss functions were used for training, classification loss and reconstruction loss, respectively. For classification loss, classification is performed using binary cross entropy loss:

wherein σ (·) is a sigmoid function;indicating that the ith short video sample is 1 if belonging to the jth class, otherwise is 0; />A probability value indicating that the i-th short video sample is predicted as the j-th class.

For low-rank constraint, in order to ensure that short video content representation obtained after model training has no excessive redundant information while capturing characteristics among elements of different modes, the robustness of a model is enhanced, and the method uses the following steps of _* The kernel function builds a low rank constraint:

finally, the following objective functions are minimized during the model training phase:

wherein lambda is ₁ And lambda (lambda) ₂ Is a parameter to balance the different loss contributions.

Training the model through reasonable parameter setting, wherein the final result takes accuracy (Precision), recall (Recall) and average Precision mean (mean Average Precision, mAP) as evaluation indexes.

Example 2

The invention also provides a short video event detection device based on multi-mode representation learning, which comprises a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to realize the method.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A short video event detection method based on multi-modal representation learning, comprising the steps of:

2. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the process for constructing the potential sequence characteristic acquisition module comprises the following steps: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:

H _v ＝Bi-LSTM(x _v ；θ _v )

H _s ＝Bi-LSTM(x _s ；θ _s )

wherein,,is a visual feature; />Is an auditory feature; d (D) _v And D _s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x _v Visual coding feature/obtained after Bi-LSTM training and learning _v Is the visual coding feature sequence length, d _v Is the dimension of the visual coding feature;for auditory feature x _s Auditory coding feature obtained after Bi-LSTM training learning, l _s Is the length of the auditory coding feature sequence, d _s Is the auditory encoding feature dimension size; θ _v And theta _s Is the network parameter to be learned.

3. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.

4. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.

5. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the calculation formula for obtaining the local attention characteristic is as follows:

Q _t ＝BN(Conv ₂ (δ(BN(Conv ₁ (F _t )))))，t∈{v,s}；

wherein Conv ₁ Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv ₂ Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;t epsilon { v, s } represents the captured modal feature with local attention characteristics; d, d _o To the dimension size after mapping to the low dimensional space.

6. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the calculation formula for obtaining the global attention characteristic is:

G＝GAP[BN[Conv ₂ (δ(BN(Conv ₁ (GAP(F _h )))))]]；

wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a captured overall modality feature representation with global attention characteristics.

7. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the calculation formula for obtaining the fused modal feature representation is as follows:

wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information;representing a captured overall modality feature representation having global attention characteristics; />t epsilon { v, s } represents the captured modal feature with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.

8. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,

the calculation formula of the event category score is as follows:

wherein,,representing event category score,/->Representing parameters to be learned of the full connection layer, C represents mattersNumber of parts categories.

9. A short video event detection device based on multi-modal representation learning, the device comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.