CN116524407A - Short video event detection method and device based on multi-modal representation learning - Google Patents
Short video event detection method and device based on multi-modal representation learning Download PDFInfo
- Publication number
- CN116524407A CN116524407A CN202310505779.1A CN202310505779A CN116524407A CN 116524407 A CN116524407 A CN 116524407A CN 202310505779 A CN202310505779 A CN 202310505779A CN 116524407 A CN116524407 A CN 116524407A
- Authority
- CN
- China
- Prior art keywords
- short video
- representation
- auditory
- visual
- mode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 32
- 230000000007 visual effect Effects 0.000 claims abstract description 40
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 26
- 230000003993 interaction Effects 0.000 claims abstract description 22
- 230000004927 fusion Effects 0.000 claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000015654 memory Effects 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000005065 mining Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 4
- 230000007787 long-term memory Effects 0.000 claims description 2
- 230000006403 short-term memory Effects 0.000 claims description 2
- 230000008878 coupling Effects 0.000 abstract description 4
- 238000010168 coupling process Methods 0.000 abstract description 4
- 238000005859 coupling reaction Methods 0.000 abstract description 4
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/44—Event detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a short video event detection method and a short video event detection device based on multi-mode representation learning, wherein the method comprises the following steps: constructing a potential sequence characteristic acquisition module, and exploring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence; a cyclic interaction information embedding module is constructed, a cyclic matrix is respectively constructed for different modes, the relation among multi-mode data and the coupling mode of the multi-mode data are fully explored, and the potential relevance among multi-mode characteristic elements is excavated; obtaining a fusion feature representation with enhanced local and global attention characteristics through a multimodal attention fusion network; and realizing event detection by utilizing the short video multi-mode fusion characteristics obtained by training. The invention utilizes the visual and auditory mode information of the short video to construct the short video characteristic representation learning network which fully excavates the multi-mode information potential association characteristic and the attention enhancement thereof, thereby realizing the detection of short video events. And a new idea is provided for solving the short video event detection problem.
Description
Technical Field
The invention belongs to the technical field of multimedia and big data analysis, and particularly relates to a short video event detection method and device based on multi-mode representation learning.
Background
With the rapid development of the short video industry, short video content analysis typified by short video event detection has received increasing attention. Short video event detection helps to solve the short video supervision problem, so that the industry is continuously and healthily developed. However, with the increasing number of short videos and the increasing complexity and variety of information, how to use the existing short video information to quickly and efficiently search for short videos needed by users is a problem to be solved in the present day.
Currently, artificial intelligence techniques, typified by deep learning, have been rapidly developed in various fields, and are also widely used in the field of video information processing. The problem of short video event detection is solved by utilizing an artificial intelligence technology, so that the development of the field of computer vision can be promoted, and the user experience can be improved, thereby having research value and practical application value.
Disclosure of Invention
In order to solve the technical problems, the invention provides a short video event detection method and device based on multi-mode representation learning, so as to solve the problem of how to quickly and efficiently search short videos required by users by utilizing the existing short video information.
In order to achieve the above object, the present invention provides a short video event detection method based on multi-modal representation learning, comprising the steps of:
constructing a potential sequence characteristic acquisition module through a bidirectional long-short-time memory network, and acquiring potential characteristics of multi-mode information on a short video front-rear sequence through the potential sequence characteristic acquisition module to acquire visual representation and auditory representation containing front-rear audio potential related information;
constructing a cyclic interaction information embedding module based on the visual representation and the auditory representation, respectively constructing cyclic matrixes for different modes through the cyclic interaction information embedding module, and acquiring characteristic representation of potential element interaction information among the embedded modes based on the cyclic matrixes;
mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion mode characteristic representation weighted by local attention and global attention characteristics;
and inputting the fusion modal characteristic representation into a classifier to obtain event category scores, and obtaining short video event detection results based on the scores.
Preferably, the process of constructing the potential sequence property acquisition module includes: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein,,is a visual feature; />Is an auditory feature; d (D) v And D s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the visual coding feature sequence length, d v Is the dimension of the visual coding feature; />For auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the length of the auditory coding feature sequence, d s Is the auditory encoding feature dimension size; θ v And theta s Is the network parameter to be learned.
Preferably, the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.
Preferably, the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.
Preferably, the calculation formula for obtaining the local attention characteristic is:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s};
wherein Conv 1 Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;representing captured modal features with local attention characteristics; d, d o To the dimension size after mapping to the low dimensional space.
Preferably, the calculation formula for obtaining the global attention characteristic is:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]];
wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a catchThe captured overall modality feature representation with global attention characteristics.
Preferably, the calculation formula for obtaining the fused modal feature representation is:
wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information; />Representing a captured overall modality feature representation having global attention characteristics;representing captured modal features with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.
Preferably, the calculation formula of the event category score is:
wherein,,representing event category score,/->And C represents the number of event categories.
The invention also provides a short video event detection device based on multi-modal representation learning, which comprises: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention realizes the full utilization of the multi-mode information of the short video and digs the potential correlation of the front sequence and the back sequence;
2. according to the invention, the relation and the coupling mode between the multi-mode data are explored by respectively constructing the cyclic matrix for different modes, the potential relevance between the multi-mode characteristic elements is mined, and the characteristic representation of the potential element interaction information between the embedded modes is obtained;
3. the invention utilizes the multi-modal attention fusion network to mine the local attention and the global attention of the multi-modal information, guides the fusion of the multi-modal information by utilizing the multi-modal attention fusion network, obtains the fusion characteristic representation of the attention enhancement and utilizes the classifier to realize the calculation of the event category score. The method provides a new method idea for solving the short video event detection problem.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
fig. 1 is a flowchart of a short video event detection method according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Example 1
As shown in fig. 1, the invention provides a short video event detection method based on multi-modal representation learning, which comprises the following steps:
101: constructing a potential sequence characteristic acquisition module, and acquiring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence by utilizing a two-way long short-time memory network;
102: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 101, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;
103: mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion characteristic representation with enhanced local attention and global attention characteristics;
104: and inputting the short video multi-mode fusion characteristics with the multi-mode information potential association characteristics and the attention enhanced thereof obtained through training into a classifier to obtain event category scores, and completing short video event detection tasks.
The scheme is further described below in conjunction with the calculation formulas and examples, and is described in detail below:
201: inputting the characteristics of two modes of vision and hearing of a short video, and mining the potential information correlation of the front and rear sequences of the short video to obtain a visual representation containing front and rear frame potential correlation information and an audible representation containing front and rear audio potential correlation information;
short video visual features areAuditory characteristics of->l v And l s Respectively representing the frame sequence length after downsampling the short video and the picture sequence length after converting the short video and audio file into a spectrogram, D v And D s Representing the dimensions of the visual and acoustic features, respectively.
In order to acquire potential characteristic representations containing correlation of the front and rear sequence information, a bidirectional long-short-time memory network is utilized to construct a potential sequence characteristic acquisition module, so that potential information encoding of two modal characteristics is realized:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein Bi-LSTM is a bidirectional long and short time memory network;for visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the sequence length, d v Is its dimension size; />For auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the sequence length, d s Is its dimension size; θ v And theta S Is the network parameter to be learned. Via the module, visual representation H containing potential relevant information of previous and subsequent frames is finally obtained v And an auditory representation H containing audio-visual potential related information s 。
202: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 201, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;
first, visual and auditory features are mapped into a low-dimensional space:
I v =h v W v T
I s =h s W s T
wherein,,is H v Transposed column vector features; />D for mapping to visual projection vectors in low-dimensional space o The dimension size after mapping; />To get h v Mapping to a mapping matrix used in the process of low-dimensional space; />To H s Transposed column vector features; />For mapping to auditory projection vectors in low-dimensional space, d o The dimension size after mapping; />To get h s To a mapping matrix used in the low-dimensional spatial process. Subsequently utilize I v And I s Respectively constructing visual circulation matrix->And auditory cyclic matrix->
L v =circ(I v )
L s =circ(I s )
Wherein circ (·) represents the dose to be administeredThe positions of the elements in the shadow vector are circularly shifted right in turn and combined into a matrix. Finally, to fully function the projection vector and the elements in the circulant matrix, matrix multiplication is used between the circulant matrix and the projection vector, thereby obtaining the characteristic representation embedded with the inter-modal element interaction informationAnd->The specific formula is as follows:
F v =I v L s
F s =I s L v
203: in order to better fuse the modal characteristics containing different semantic characteristics, the local attention and the global attention of the multi-modal information are mined through a multi-modal attention fusion network, and finally fusion characteristic representation with enhanced local attention and global attention characteristics is obtained;
feature representation with inter-modal element interaction information obtained in step 202 of federationAnd->To obtain a global feature representation of the different modalities:
wherein,,representing the overall feature representation after the combination of different modality representations d h =d o +d o Representing the dimension thereof; concat (-) indicates a cascading operation.
The calculation formula for capturing the local attention characteristic is as follows:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s}
wherein Conv 1 Representing the use of 1*1 point-by-point convolution, which would input featuresAnd->The dimension number of (2) is reduced to 1/r of the original dimension, and r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing the restoration of the feature dimension to the feature dimension of the original input by 1*1 point-wise convolution. />Representing captured modal features with local attention characteristics.
The calculation formula for capturing the global attention characteristic is as follows:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]]
where GAP represents a global pooling (globalargepoling) operation. Compared with the local attention characteristic capturing formula, the global attention characteristic capturing calculation formula is mainly added with a global pooling layer;an overall modal feature representation obtained for the cascading operation; />Representing a captured overall modality feature representation with global attention characteristics.
Finally, the module performs fusion operation on the modal characteristics with the local attention characteristic and the global attention characteristic obtained by the process:
wherein,,representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function; />A fused modal feature representation weighted for local and global attention characteristics is obtained.
204: acquiring event category scores by utilizing the learned fusion mode short video characteristic representation, and completing short video event detection tasks;
the fusion mode characteristic representation finally obtained in the process is representedInput to the fully connected layer and use the Softmax (·) function to get the event detection prediction result:
wherein,,is a parameter to be learned of the full connection layer, and C represents the number of event categories.
Two loss functions were used for training, classification loss and reconstruction loss, respectively. For classification loss, classification is performed using binary cross entropy loss:
wherein σ (·) is a sigmoid function;indicating that the ith short video sample is 1 if belonging to the jth class, otherwise is 0; />A probability value indicating that the i-th short video sample is predicted as the j-th class.
For low-rank constraint, in order to ensure that short video content representation obtained after model training has no excessive redundant information while capturing characteristics among elements of different modes, the robustness of a model is enhanced, and the method uses the following steps of * The kernel function builds a low rank constraint:
finally, the following objective functions are minimized during the model training phase:
wherein lambda is 1 And lambda (lambda) 2 Is a parameter to balance the different loss contributions.
Training the model through reasonable parameter setting, wherein the final result takes accuracy (Precision), recall (Recall) and average Precision mean (mean Average Precision, mAP) as evaluation indexes.
Example 2
The invention also provides a short video event detection device based on multi-mode representation learning, which comprises a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to realize the method.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (9)
1. A short video event detection method based on multi-modal representation learning, comprising the steps of:
constructing a potential sequence characteristic acquisition module through a bidirectional long-short-time memory network, and acquiring potential characteristics of multi-mode information on a short video front-rear sequence through the potential sequence characteristic acquisition module to acquire visual representation and auditory representation containing front-rear audio potential related information;
constructing a cyclic interaction information embedding module based on the visual representation and the auditory representation, respectively constructing cyclic matrixes for different modes through the cyclic interaction information embedding module, and acquiring characteristic representation of potential element interaction information among the embedded modes based on the cyclic matrixes;
mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion mode characteristic representation weighted by local attention and global attention characteristics;
and inputting the fusion modal characteristic representation into a classifier to obtain event category scores, and obtaining short video event detection results based on the scores.
2. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the process for constructing the potential sequence characteristic acquisition module comprises the following steps: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein,,is a visual feature; />Is an auditory feature; d (D) v And D s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the visual coding feature sequence length, d v Is the dimension of the visual coding feature;for auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the length of the auditory coding feature sequence, d s Is the auditory encoding feature dimension size; θ v And theta s Is the network parameter to be learned.
3. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.
4. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.
5. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the local attention characteristic is as follows:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s};
wherein Conv 1 Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;t epsilon { v, s } represents the captured modal feature with local attention characteristics; d, d o To the dimension size after mapping to the low dimensional space.
6. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the global attention characteristic is:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]];
wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a captured overall modality feature representation with global attention characteristics.
7. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the fused modal feature representation is as follows:
wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information;representing a captured overall modality feature representation having global attention characteristics; />t epsilon { v, s } represents the captured modal feature with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.
8. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula of the event category score is as follows:
wherein,,representing event category score,/->Representing parameters to be learned of the full connection layer, C represents mattersNumber of parts categories.
9. A short video event detection device based on multi-modal representation learning, the device comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310505779.1A CN116524407A (en) | 2023-05-05 | 2023-05-05 | Short video event detection method and device based on multi-modal representation learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310505779.1A CN116524407A (en) | 2023-05-05 | 2023-05-05 | Short video event detection method and device based on multi-modal representation learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524407A true CN116524407A (en) | 2023-08-01 |
Family
ID=87406151
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310505779.1A Pending CN116524407A (en) | 2023-05-05 | 2023-05-05 | Short video event detection method and device based on multi-modal representation learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524407A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117746518A (en) * | 2024-02-19 | 2024-03-22 | 河海大学 | Multi-mode feature fusion and classification method |
-
2023
- 2023-05-05 CN CN202310505779.1A patent/CN116524407A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117746518A (en) * | 2024-02-19 | 2024-03-22 | 河海大学 | Multi-mode feature fusion and classification method |
CN117746518B (en) * | 2024-02-19 | 2024-05-31 | 河海大学 | Multi-mode feature fusion and classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion | |
Awais et al. | Foundational models defining a new era in vision: A survey and outlook | |
CN115203380B (en) | Text processing system and method based on multi-mode data fusion | |
CN112507990A (en) | Video time-space feature learning and extracting method, device, equipment and storage medium | |
CN111651573B (en) | Intelligent customer service dialogue reply generation method and device and electronic equipment | |
CN114926835A (en) | Text generation method and device, and model training method and device | |
CN113505193A (en) | Data processing method and related equipment | |
CN117216546A (en) | Model training method, device, electronic equipment, storage medium and program product | |
CN116958323A (en) | Image generation method, device, electronic equipment, storage medium and program product | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
CN118229844B (en) | Image generation data processing method, image generation method and device | |
CN116524407A (en) | Short video event detection method and device based on multi-modal representation learning | |
CN116543351A (en) | Self-supervision group behavior identification method based on space-time serial-parallel relation coding | |
WO2022222854A1 (en) | Data processing method and related device | |
CN117972138B (en) | Training method and device for pre-training model and computer equipment | |
Yao et al. | Transformers and CNNs fusion network for salient object detection | |
Huang et al. | Dynamic sign language recognition based on CBAM with autoencoder time series neural network | |
CN116980541B (en) | Video editing method, device, electronic equipment and storage medium | |
US11948090B2 (en) | Method and apparatus for video coding | |
Liu et al. | Transformer based Pluralistic Image Completion with Reduced Information Loss | |
CN114328943A (en) | Question answering method, device, equipment and storage medium based on knowledge graph | |
Wang et al. | Multi-scale dense and attention mechanism for image semantic segmentation based on improved DeepLabv3+ | |
CN116740078A (en) | Image segmentation processing method, device, equipment and medium | |
CN117689745A (en) | Generating images from text based on hints | |
Niu et al. | An Overview of Text-based Person Search: Recent Advances and Future Directions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |