CN116524407A - Short video event detection method and device based on multi-modal representation learning - Google Patents

Short video event detection method and device based on multi-modal representation learning Download PDF

Info

Publication number
CN116524407A
CN116524407A CN202310505779.1A CN202310505779A CN116524407A CN 116524407 A CN116524407 A CN 116524407A CN 202310505779 A CN202310505779 A CN 202310505779A CN 116524407 A CN116524407 A CN 116524407A
Authority
CN
China
Prior art keywords
short video
representation
auditory
visual
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310505779.1A
Other languages
Chinese (zh)
Inventor
井佩光
宋晓艺
苏育挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310505779.1A priority Critical patent/CN116524407A/en
Publication of CN116524407A publication Critical patent/CN116524407A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a short video event detection method and a short video event detection device based on multi-mode representation learning, wherein the method comprises the following steps: constructing a potential sequence characteristic acquisition module, and exploring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence; a cyclic interaction information embedding module is constructed, a cyclic matrix is respectively constructed for different modes, the relation among multi-mode data and the coupling mode of the multi-mode data are fully explored, and the potential relevance among multi-mode characteristic elements is excavated; obtaining a fusion feature representation with enhanced local and global attention characteristics through a multimodal attention fusion network; and realizing event detection by utilizing the short video multi-mode fusion characteristics obtained by training. The invention utilizes the visual and auditory mode information of the short video to construct the short video characteristic representation learning network which fully excavates the multi-mode information potential association characteristic and the attention enhancement thereof, thereby realizing the detection of short video events. And a new idea is provided for solving the short video event detection problem.

Description

Short video event detection method and device based on multi-modal representation learning
Technical Field
The invention belongs to the technical field of multimedia and big data analysis, and particularly relates to a short video event detection method and device based on multi-mode representation learning.
Background
With the rapid development of the short video industry, short video content analysis typified by short video event detection has received increasing attention. Short video event detection helps to solve the short video supervision problem, so that the industry is continuously and healthily developed. However, with the increasing number of short videos and the increasing complexity and variety of information, how to use the existing short video information to quickly and efficiently search for short videos needed by users is a problem to be solved in the present day.
Currently, artificial intelligence techniques, typified by deep learning, have been rapidly developed in various fields, and are also widely used in the field of video information processing. The problem of short video event detection is solved by utilizing an artificial intelligence technology, so that the development of the field of computer vision can be promoted, and the user experience can be improved, thereby having research value and practical application value.
Disclosure of Invention
In order to solve the technical problems, the invention provides a short video event detection method and device based on multi-mode representation learning, so as to solve the problem of how to quickly and efficiently search short videos required by users by utilizing the existing short video information.
In order to achieve the above object, the present invention provides a short video event detection method based on multi-modal representation learning, comprising the steps of:
constructing a potential sequence characteristic acquisition module through a bidirectional long-short-time memory network, and acquiring potential characteristics of multi-mode information on a short video front-rear sequence through the potential sequence characteristic acquisition module to acquire visual representation and auditory representation containing front-rear audio potential related information;
constructing a cyclic interaction information embedding module based on the visual representation and the auditory representation, respectively constructing cyclic matrixes for different modes through the cyclic interaction information embedding module, and acquiring characteristic representation of potential element interaction information among the embedded modes based on the cyclic matrixes;
mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion mode characteristic representation weighted by local attention and global attention characteristics;
and inputting the fusion modal characteristic representation into a classifier to obtain event category scores, and obtaining short video event detection results based on the scores.
Preferably, the process of constructing the potential sequence property acquisition module includes: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein,,is a visual feature; />Is an auditory feature; d (D) v And D s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the visual coding feature sequence length, d v Is the dimension of the visual coding feature; />For auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the length of the auditory coding feature sequence, d s Is the auditory encoding feature dimension size; θ v And theta s Is the network parameter to be learned.
Preferably, the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.
Preferably, the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.
Preferably, the calculation formula for obtaining the local attention characteristic is:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s};
wherein Conv 1 Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;representing captured modal features with local attention characteristics; d, d o To the dimension size after mapping to the low dimensional space.
Preferably, the calculation formula for obtaining the global attention characteristic is:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]];
wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a catchThe captured overall modality feature representation with global attention characteristics.
Preferably, the calculation formula for obtaining the fused modal feature representation is:
wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information; />Representing a captured overall modality feature representation having global attention characteristics;representing captured modal features with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.
Preferably, the calculation formula of the event category score is:
wherein,,representing event category score,/->And C represents the number of event categories.
The invention also provides a short video event detection device based on multi-modal representation learning, which comprises: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention realizes the full utilization of the multi-mode information of the short video and digs the potential correlation of the front sequence and the back sequence;
2. according to the invention, the relation and the coupling mode between the multi-mode data are explored by respectively constructing the cyclic matrix for different modes, the potential relevance between the multi-mode characteristic elements is mined, and the characteristic representation of the potential element interaction information between the embedded modes is obtained;
3. the invention utilizes the multi-modal attention fusion network to mine the local attention and the global attention of the multi-modal information, guides the fusion of the multi-modal information by utilizing the multi-modal attention fusion network, obtains the fusion characteristic representation of the attention enhancement and utilizes the classifier to realize the calculation of the event category score. The method provides a new method idea for solving the short video event detection problem.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
fig. 1 is a flowchart of a short video event detection method according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Example 1
As shown in fig. 1, the invention provides a short video event detection method based on multi-modal representation learning, which comprises the following steps:
101: constructing a potential sequence characteristic acquisition module, and acquiring potential characteristics of short video visual mode information and auditory mode information on a short video front-rear sequence by utilizing a two-way long short-time memory network;
102: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 101, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;
103: mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion characteristic representation with enhanced local attention and global attention characteristics;
104: and inputting the short video multi-mode fusion characteristics with the multi-mode information potential association characteristics and the attention enhanced thereof obtained through training into a classifier to obtain event category scores, and completing short video event detection tasks.
The scheme is further described below in conjunction with the calculation formulas and examples, and is described in detail below:
201: inputting the characteristics of two modes of vision and hearing of a short video, and mining the potential information correlation of the front and rear sequences of the short video to obtain a visual representation containing front and rear frame potential correlation information and an audible representation containing front and rear audio potential correlation information;
short video visual features areAuditory characteristics of->l v And l s Respectively representing the frame sequence length after downsampling the short video and the picture sequence length after converting the short video and audio file into a spectrogram, D v And D s Representing the dimensions of the visual and acoustic features, respectively.
In order to acquire potential characteristic representations containing correlation of the front and rear sequence information, a bidirectional long-short-time memory network is utilized to construct a potential sequence characteristic acquisition module, so that potential information encoding of two modal characteristics is realized:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein Bi-LSTM is a bidirectional long and short time memory network;for visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the sequence length, d v Is its dimension size; />For auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the sequence length, d s Is its dimension size; θ v And theta S Is the network parameter to be learned. Via the module, visual representation H containing potential relevant information of previous and subsequent frames is finally obtained v And an auditory representation H containing audio-visual potential related information s
202: guiding the learning of a cyclic interaction information embedding module by using the modal representation containing the potential relevant information of the previous and subsequent frames obtained in the step 201, respectively constructing cyclic matrixes for different modalities, fully exploring the relation and the coupling mode among multi-modal data, mining the potential relevance among multi-modal characteristic elements, and obtaining the characteristic representation of the potential element interaction information among the embedding modalities;
first, visual and auditory features are mapped into a low-dimensional space:
I v =h v W v T
I s =h s W s T
wherein,,is H v Transposed column vector features; />D for mapping to visual projection vectors in low-dimensional space o The dimension size after mapping; />To get h v Mapping to a mapping matrix used in the process of low-dimensional space; />To H s Transposed column vector features; />For mapping to auditory projection vectors in low-dimensional space, d o The dimension size after mapping; />To get h s To a mapping matrix used in the low-dimensional spatial process. Subsequently utilize I v And I s Respectively constructing visual circulation matrix->And auditory cyclic matrix->
L v =circ(I v )
L s =circ(I s )
Wherein circ (·) represents the dose to be administeredThe positions of the elements in the shadow vector are circularly shifted right in turn and combined into a matrix. Finally, to fully function the projection vector and the elements in the circulant matrix, matrix multiplication is used between the circulant matrix and the projection vector, thereby obtaining the characteristic representation embedded with the inter-modal element interaction informationAnd->The specific formula is as follows:
F v =I v L s
F s =I s L v
203: in order to better fuse the modal characteristics containing different semantic characteristics, the local attention and the global attention of the multi-modal information are mined through a multi-modal attention fusion network, and finally fusion characteristic representation with enhanced local attention and global attention characteristics is obtained;
feature representation with inter-modal element interaction information obtained in step 202 of federationAnd->To obtain a global feature representation of the different modalities:
wherein,,representing the overall feature representation after the combination of different modality representations d h =d o +d o Representing the dimension thereof; concat (-) indicates a cascading operation.
The calculation formula for capturing the local attention characteristic is as follows:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s}
wherein Conv 1 Representing the use of 1*1 point-by-point convolution, which would input featuresAnd->The dimension number of (2) is reduced to 1/r of the original dimension, and r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing the restoration of the feature dimension to the feature dimension of the original input by 1*1 point-wise convolution. />Representing captured modal features with local attention characteristics.
The calculation formula for capturing the global attention characteristic is as follows:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]]
where GAP represents a global pooling (globalargepoling) operation. Compared with the local attention characteristic capturing formula, the global attention characteristic capturing calculation formula is mainly added with a global pooling layer;an overall modal feature representation obtained for the cascading operation; />Representing a captured overall modality feature representation with global attention characteristics.
Finally, the module performs fusion operation on the modal characteristics with the local attention characteristic and the global attention characteristic obtained by the process:
wherein,,representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function; />A fused modal feature representation weighted for local and global attention characteristics is obtained.
204: acquiring event category scores by utilizing the learned fusion mode short video characteristic representation, and completing short video event detection tasks;
the fusion mode characteristic representation finally obtained in the process is representedInput to the fully connected layer and use the Softmax (·) function to get the event detection prediction result:
wherein,,is a parameter to be learned of the full connection layer, and C represents the number of event categories.
Two loss functions were used for training, classification loss and reconstruction loss, respectively. For classification loss, classification is performed using binary cross entropy loss:
wherein σ (·) is a sigmoid function;indicating that the ith short video sample is 1 if belonging to the jth class, otherwise is 0; />A probability value indicating that the i-th short video sample is predicted as the j-th class.
For low-rank constraint, in order to ensure that short video content representation obtained after model training has no excessive redundant information while capturing characteristics among elements of different modes, the robustness of a model is enhanced, and the method uses the following steps of * The kernel function builds a low rank constraint:
finally, the following objective functions are minimized during the model training phase:
wherein lambda is 1 And lambda (lambda) 2 Is a parameter to balance the different loss contributions.
Training the model through reasonable parameter setting, wherein the final result takes accuracy (Precision), recall (Recall) and average Precision mean (mean Average Precision, mAP) as evaluation indexes.
Example 2
The invention also provides a short video event detection device based on multi-mode representation learning, which comprises a processor and a memory, wherein program instructions are stored in the memory, and the processor calls the program instructions stored in the memory to realize the method.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A short video event detection method based on multi-modal representation learning, comprising the steps of:
constructing a potential sequence characteristic acquisition module through a bidirectional long-short-time memory network, and acquiring potential characteristics of multi-mode information on a short video front-rear sequence through the potential sequence characteristic acquisition module to acquire visual representation and auditory representation containing front-rear audio potential related information;
constructing a cyclic interaction information embedding module based on the visual representation and the auditory representation, respectively constructing cyclic matrixes for different modes through the cyclic interaction information embedding module, and acquiring characteristic representation of potential element interaction information among the embedded modes based on the cyclic matrixes;
mining local attention and global attention of the multi-mode information through a multi-mode attention fusion network to obtain fusion mode characteristic representation weighted by local attention and global attention characteristics;
and inputting the fusion modal characteristic representation into a classifier to obtain event category scores, and obtaining short video event detection results based on the scores.
2. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the process for constructing the potential sequence characteristic acquisition module comprises the following steps: inputting the visual mode characteristics and the auditory mode characteristics of the short video, and encoding the visual mode characteristics and the auditory mode characteristics through a bidirectional long-short-time memory network, wherein the encoding formula is as follows:
H v =Bi-LSTM(x v ;θ v )
H s =Bi-LSTM(x s ;θ s )
wherein,,is a visual feature; />Is an auditory feature; d (D) v And D s Dimensions representing visual features and auditory features, respectively; bi-LSTM is a Bi-directional long and short term memory network; />For visual features x v Visual coding feature/obtained after Bi-LSTM training and learning v Is the visual coding feature sequence length, d v Is the dimension of the visual coding feature;for auditory feature x s Auditory coding feature obtained after Bi-LSTM training learning, l s Is the length of the auditory coding feature sequence, d s Is the auditory encoding feature dimension size; θ v And theta s Is the network parameter to be learned.
3. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the method for respectively constructing the cyclic matrix for different modes comprises the following steps: mapping visual features and auditory features into a low-dimensional space; the visual cyclic matrix and the auditory cyclic matrix are constructed from the visual projection vectors and the auditory projection vectors mapped into the low-dimensional space.
4. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the method for obtaining the characteristic representation of the potential element interaction information between the embedded modalities based on the cyclic matrix comprises the following steps: matrix multiplication is used on the projection vector and the cyclic matrix to obtain a characteristic representation of the potential element interaction information between the embedded modalities.
5. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the local attention characteristic is as follows:
Q t =BN(Conv 2 (δ(BN(Conv 1 (F t ))))),t∈{v,s};
wherein Conv 1 Representing that 1*1 point-by-point convolution is used to reduce the dimension number of the input feature sum to 1/r, wherein r is the dimension scaling ratio; BN represents a BatchNorm layer; delta represents a ReLU activation function; conv 2 Representing restoring the feature dimension to the feature dimension of the original input by 1*1 point-by-point convolution;t epsilon { v, s } represents the captured modal feature with local attention characteristics; d, d o To the dimension size after mapping to the low dimensional space.
6. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the global attention characteristic is:
G=GAP[BN[Conv 2 (δ(BN(Conv 1 (GAP(F h )))))]];
wherein GAP represents a global pooling operation;the method comprises the steps of obtaining integral mode characteristic representation for carrying out cascading operation on characteristic representation of potential element interaction information among embedded modes; />Representing a captured overall modality feature representation with global attention characteristics.
7. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula for obtaining the fused modal feature representation is as follows:
wherein,,and->Visual features and auditory features for embedding inter-modal potential element interaction information;representing a captured overall modality feature representation having global attention characteristics; />t epsilon { v, s } represents the captured modal feature with local attention characteristics; />Representing the multiplication of the corresponding elements; sigma represents a sigmoid activation function.
8. The short video event detection method based on multi-modal presentation study as set forth in claim 1, wherein,
the calculation formula of the event category score is as follows:
wherein,,representing event category score,/->Representing parameters to be learned of the full connection layer, C represents mattersNumber of parts categories.
9. A short video event detection device based on multi-modal representation learning, the device comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-8.
CN202310505779.1A 2023-05-05 2023-05-05 Short video event detection method and device based on multi-modal representation learning Pending CN116524407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310505779.1A CN116524407A (en) 2023-05-05 2023-05-05 Short video event detection method and device based on multi-modal representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310505779.1A CN116524407A (en) 2023-05-05 2023-05-05 Short video event detection method and device based on multi-modal representation learning

Publications (1)

Publication Number Publication Date
CN116524407A true CN116524407A (en) 2023-08-01

Family

ID=87406151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310505779.1A Pending CN116524407A (en) 2023-05-05 2023-05-05 Short video event detection method and device based on multi-modal representation learning

Country Status (1)

Country Link
CN (1) CN116524407A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746518A (en) * 2024-02-19 2024-03-22 河海大学 Multi-mode feature fusion and classification method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117746518A (en) * 2024-02-19 2024-03-22 河海大学 Multi-mode feature fusion and classification method
CN117746518B (en) * 2024-02-19 2024-05-31 河海大学 Multi-mode feature fusion and classification method

Similar Documents

Publication Publication Date Title
Zhao et al. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion
Awais et al. Foundational models defining a new era in vision: A survey and outlook
CN115203380B (en) Text processing system and method based on multi-mode data fusion
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
CN111651573B (en) Intelligent customer service dialogue reply generation method and device and electronic equipment
CN114926835A (en) Text generation method and device, and model training method and device
CN113505193A (en) Data processing method and related equipment
CN117216546A (en) Model training method, device, electronic equipment, storage medium and program product
CN116958323A (en) Image generation method, device, electronic equipment, storage medium and program product
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN118229844B (en) Image generation data processing method, image generation method and device
CN116524407A (en) Short video event detection method and device based on multi-modal representation learning
CN116543351A (en) Self-supervision group behavior identification method based on space-time serial-parallel relation coding
WO2022222854A1 (en) Data processing method and related device
CN117972138B (en) Training method and device for pre-training model and computer equipment
Yao et al. Transformers and CNNs fusion network for salient object detection
Huang et al. Dynamic sign language recognition based on CBAM with autoencoder time series neural network
CN116980541B (en) Video editing method, device, electronic equipment and storage medium
US11948090B2 (en) Method and apparatus for video coding
Liu et al. Transformer based Pluralistic Image Completion with Reduced Information Loss
CN114328943A (en) Question answering method, device, equipment and storage medium based on knowledge graph
Wang et al. Multi-scale dense and attention mechanism for image semantic segmentation based on improved DeepLabv3+
CN116740078A (en) Image segmentation processing method, device, equipment and medium
CN117689745A (en) Generating images from text based on hints
Niu et al. An Overview of Text-based Person Search: Recent Advances and Future Directions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination