CN117392582A - Multi-mode video classification method and system - Google Patents
Multi-mode video classification method and system Download PDFInfo
- Publication number
- CN117392582A CN117392582A CN202311329631.3A CN202311329631A CN117392582A CN 117392582 A CN117392582 A CN 117392582A CN 202311329631 A CN202311329631 A CN 202311329631A CN 117392582 A CN117392582 A CN 117392582A
- Authority
- CN
- China
- Prior art keywords
- mode
- gating
- loss
- visual
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000000007 visual effect Effects 0.000 claims abstract description 54
- 230000004927 fusion Effects 0.000 claims abstract description 44
- 230000007246 mechanism Effects 0.000 claims abstract description 22
- 230000014759 maintenance of location Effects 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 5
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 5
- 230000002452 interceptive effect Effects 0.000 abstract description 3
- 230000003044 adaptive effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013145 classification model Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biodiversity & Conservation Biology (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention discloses a multimode video classification method, which is based on a gating mechanism of ResNet18 and global linkage of parallel voice and visual modes, and comprises the following steps: s1, respectively extracting characteristic representations of voice and image modes by using two resnet18 encoders with identical structures, and taking the characteristic representations as input characteristics of gating cross-mode characteristic fusion. S2, designing a global linkage gating mechanism which consists of a cross-mode gating fusion module and an objective function based on gating auxiliary loss, wherein the gating mechanism is a core for balancing different modes. S3, the first part is a cross-modal gating fusion module designed by referring to the GRU gating principle, receives characteristic input of voice and image modes, automatically adjusts input duty ratios of different modes, and further carries out interactive fusion to output the cross-modal characteristics after fusion. And S4, the second part takes the single-mode loss as an auxiliary loss, and takes the gating parameters in the first part as the weight of the auxiliary loss after certain processing to form an adaptive gating adjustment mechanism of the model.
Description
Technical Field
The invention relates to the technical field of multi-mode video classification, in particular to a multi-mode video classification method and a system thereof.
Background
In recent years, deep Learning (DL) has been widely used in fields of image recognition, machine translation, emotion analysis, natural language processing (Natural Language Processing, NLP) and the like, and many research results are obtained, so that a machine can sense the surrounding world more comprehensively and efficiently, and needs to give the machine the ability to understand, infer and fuse multi-modal information, and because people live in an environment where multiple fields are mutually fused, the heard sound, seen object and smelled taste are all one mode, researchers begin focusing on how to fuse and realize heterogeneous complementation of the multi-field data, for example, research on speech recognition shows that the visual mode provides lip movement and pronunciation information of the mouth, including opening and closing, which is helpful for improving the speech recognition performance. It can be seen that the comprehensive semantics of multiple modes are of great significance to deep learning research.
Multimodal data generally provides more information than unimodal data, and therefore learning using multimodal data should match or be superior to unimodal data. However, in some cases, a multimodal model that optimizes a unified learning objective for all modalities using a joint training strategy may not be as good as a unimodal model. This phenomenon is due to the fact that the various modes tend to converge at different speeds, resulting in a mode that reaches a fitted state while the other modes have not yet been fitted, i.e. a mode imbalance problem.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a multi-mode video classification method and a system thereof, which not only effectively integrate much, but also well solve the problem of multi-mode unbalance.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a multimode video classification method based on a ResNet18 and a global linkage gating mechanism of parallel voice and visual modes comprises the following steps:
s1: and respectively extracting characteristic representations of voice and image modes by using two resnet18 encoders with identical structures, and taking the characteristic representations as input characteristics of the gating cross-mode characteristic fusion.
S2: a global linkage gating mechanism is designed, and the global linkage gating mechanism consists of a cross-mode gating fusion module and an objective function based on gating auxiliary loss, and is a core for balancing different modes.
S3: the first part is a cross-modal gating fusion module designed by referring to the GRU gating principle, receives characteristic input of voice and image modes, automatically adjusts input duty ratios of different modes, carries out interactive fusion, and outputs the fused cross-modal characteristics.
S4: the second part takes the single-mode loss as an auxiliary loss, and simultaneously takes the gating parameters in the first part as the weight of the auxiliary loss after certain processing, so as to form a self-adaptive gating adjustment mechanism of the model.
Preferably, the specific steps of step S1 include:
s101: preprocessing the original video data of the related data set, extracting voice information in the video, converting the voice information into a spectrogram to serve as input of a voice mode, and uniformly sampling 3 frames from the video to serve as input of a visual mode by considering the difference of the data set;
s102: the speech mode and the visual mode each employ two identical res net18 as encoders to extract features from the input data of the speech and visual modes, respectively:
wherein the method comprises the steps ofIs an input from a speech (audio) modality,/->Is a speech coder with a speech mode based on ResNet18, theta a Is the encoder parameter, H a Is the voice modal characteristic extracted by the encoder; similarly->Is an input from the visual modality>Is a visual encoder with visual mode based on ResNet18, theta v Is the encoder parameter, H v Is the visual mode characteristic extracted by the encoder;
preferably, the specific steps of step S2 include:
firstly, using a cross-modal fusion module to interactively fuse two input features, and fusing the cross-modal features H after fusion av Speech modal characteristics H a Visual modality characteristics H v After classification prediction, three independent loss are calculated by using a cross entropy formula av 、loss a 、loss v . And finally, taking the single-mode loss as an auxiliary loss, taking the gating parameter after certain processing as the weight of the auxiliary loss, and forming an objective function based on the gating auxiliary loss.
Preferably, the specific steps of step S3 include:
λ=Sigmod(UH a +VH v )
H av =(1-λ)·H a +λ·H v
where U and V are trainable variables, λ is gating parameters controlling how much visual information can be retained, sigmod is an activation function, H av Is a fusion feature of different modalities.
Preferably, the specific steps of step S4 include:
after the gating parameter lambda in the first part of cross-mode fusion module is subjected to average dimension reduction, the gating parameter lambda is used as the weight of the corresponding mode loss in the form of reciprocal, so that the negative correlation of the mode loss and the mode retention information is realized. That is, when a single mode loss is relatively smaller (the mode has reached a fitting state), the model correspondingly reduces the information retention of the mode, and at the same time, another single mode loss is relatively larger (the mode has not reached a fitting state), the model correspondingly improves the information retention of the mode;
λ - =mean(λ)
wherein lambda is - Is one-dimensional gating parameter after averaging, beta is super parameter, loss a Is the cross entropy loss of a single speech mode, loss v Cross entropy loss, loss of single visual modality av Is the cross entropy loss after multi-modal fusion.
The invention also provides a gating mechanism multi-mode video classification system based on the ResNet18 and global linkage of the parallel voice and visual modes, which comprises a parallel multi-mode feature extraction module, a cross-mode gating fusion module, an objective function based on gating auxiliary loss and a classification prediction module;
the parallel multi-mode feature extraction module is used for extracting initial features of voice and visual modes and encoding the initial features as input features of mode fusion;
the cross-mode gating fusion module and the gating auxiliary loss-based objective function jointly construct a global gating adjustment mechanism, the fitting state of each mode can be judged according to the loss of a single mode, the information duty ratio of different modes in cross-mode fusion can be adjusted in a self-adaptive mode, and the problem of unbalanced modes in a multi-mode model is solved;
and the classification prediction module performs classification prediction on the single-mode characteristics and the fused mode characteristics, and takes the classification prediction of the fused mode characteristics as a final prediction result after model training is finished.
The invention has the following characteristics and beneficial effects:
by adopting the technical scheme, the universal global gating adjustment structure is designed for multi-mode video classification tasks, so that the multi-mode video classification tasks are effectively fused, the characteristic information of modes is more, and meanwhile, the problem of unbalance during the mode fusion is more concerned.
In addition, in order to further improve the complementarity between the fusion modal characteristics, the method simply and skillfully solves the problem of multi-modal unbalance based on the gating auxiliary loss objective function, and effectively improves the performance of the multi-modal fusion characteristics.
The method is very helpful to the multi-mode video classification task, and can remarkably improve the classification prediction accuracy of the multi-mode video classification task.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of a gating mechanism multi-modality video classification method based on parallel ResNet18 and global linkage of voice and visual modalities of the present invention;
FIG. 2 is a schematic diagram of a model of the method and system for multi-modal video classification based on a gating mechanism of ResNet18 and global linkage with parallel speech and visual modalities;
FIG. 3 is a schematic diagram of the individual ResNet infrastructures that make up ResNet 18;
FIG. 4 is a schematic diagram of a designed gating-based modality fusion module;
fig. 5 is a block diagram of the architecture of the voice and visual modality parallel-based res net18 and global linkage gating mechanism multi-modality video classification system.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
The invention provides a multi-mode video classification method, as shown in fig. 1-4, a ResNet18 with parallel voice and visual modes and a global linkage gating mechanism, called RGGL for short, comprising the following steps:
s1: two resnet18 encoders with consistent structures are used for respectively extracting characteristic representations of voice and image modes, and the characteristic representations are used as input characteristics for gating cross-mode characteristic fusion;
preliminary processing and feature extraction of the dataset data includes the steps of:
s101: preprocessing the original video data of the related data set, extracting voice information in the video, converting the voice information into a corresponding spectrogram by using librosa as input of a voice mode, and uniformly sampling 3 frames from the video as input of a visual mode by taking the difference of the data set into consideration;
s102: the speech and visual modalities each employ two identical res nets 18 as encoders (wherein the input channel of the res net18 is changed from 3 to 1 for the encoder of the speech modality, the remainder remaining unchanged) to extract features from the input data of the speech and visual modalities, respectively:
wherein the method comprises the steps ofIs an input from a speech (audio) modality,/->Is a speech coder with a speech mode based on ResNet18, theta a Is the encoder parameter, H a Is the voice modal characteristic extracted by the encoder; similarly->Is an input from the visual modality>Is a visual encoder with visual mode based on ResNet18, theta v Is the encoder parameter, H v Is the visual mode characteristic extracted by the encoder;
while SGD (0.9 momentum,1e-4weight decade) is used herein as an optimizer, the initial learning rate is 1e-3, multiplied by 0.1 every 60 rounds.
S2: a global linkage gating mechanism is designed, and the global linkage gating mechanism consists of a cross-mode gating fusion module and an objective function based on gating auxiliary loss, and is a core for balancing different modes.
S3: the first part is a cross-modal gating fusion module designed by referring to the GRU gating principle, receives characteristic input of voice and image modes, automatically adjusts input duty ratios of different modes, carries out interactive fusion, and outputs the fused cross-modal characteristics.
Firstly, the characteristics of voice and visual modes are received, added after passing through a full connection layer, and then activated by a Sigmod function to obtain a multidimensional gating parameter lambda. Then controlling information retention of voice and visual modes by a gating parameter lambda;
s4: the second part takes the single-mode loss as an auxiliary loss, and simultaneously takes the gating parameters in the first part as the weight of the auxiliary loss after certain processing, so as to form a self-adaptive gating adjustment mechanism of the model. After the gating parameter lambda in the first part of cross-mode fusion module is subjected to average dimension reduction, the gating parameter lambda is used as the weight of the corresponding mode loss in the form of reciprocal, so that the negative correlation of the mode loss and the mode retention information is realized. That is, when a single mode loss is relatively smaller (the mode has reached a fitting state), the model correspondingly reduces the information retention of the mode, and at the same time, another single mode loss is relatively larger (the mode has not reached a fitting state), the model correspondingly improves the information retention of the mode;
λ - =mean(λ)
wherein lambda-is one-dimensional gating parameter after averaging, beta is super parameter, loss a Is the cross entropy loss of a single speech mode, loss v Cross entropy loss, loss of single visual modality av Is the cross entropy loss after multi-modal fusion.
It can be understood thatBy loss of a Is the cross entropy loss of a single speech mode, loss v Cross entropy loss as a single visual modality to assist loss av The cross entropy loss after multi-mode fusion is used for completing the main task.
In this embodiment, in step S4, the super parameter β is set to 0.1 in the experiment herein. This is loss of a ,loss v ,loss av Is the loss of different modal characteristics calculated by using cross entropy, and an objective function based on gating loss is used after model training is performed for 30 rounds, and the objective function of the first 30 rounds is loss=loss av . With reference to fig. 2, the whole model is divided into four large blocks, a multi-mode feature extraction module, a gating cross-mode feature fusion module of a first part, a classification prediction module and an objective function of a second part based on gating auxiliary loss.
And using a full connection layer (fully connected layers, FC) as a classifier to conduct classification prediction to obtain classification results of three different modal characteristics, wherein the classification results of the cross-modal characteristics are used as final classification results.
In an embodiment of the present invention, referring to fig. 5, after model training is completed, a multi-mode video classification system based on gating cross-mode feature fusion is provided, including:
and the multi-mode feature extraction module is used for extracting the features of the voice and visual modes and encoding the features as input features for the fusion of the mode features.
And the gating cross-modal feature fusion module is used for carrying out balanced fusion on the input multi-modal features according to the information expression characteristics of the single-modal features.
And the classification prediction module is used for carrying out final classification prediction on the output characteristics after balance fusion.
The present invention has been tested on two common data sets CREMAD and VGGSround. In order to quantitatively evaluate the performance of RGGL, the accuracy acc and the mean average accuracy map are used as evaluation indexes.
TABLE 1
Audio-only is the result of using only voice data and using a voice modality model.
Visual-only is the result of using only Visual data and using a model of Visual modality.
The fusion mode for the modal characteristics in the Baseline model is simple addition, and a basic objective function is used.
The results in Table 1 show that the performance of the video classification model RGGL of the gating mechanism based on the ResNet18 and global linkage of the parallel voice and visual modalities on CREMA-D and VGGSround data sets is significantly higher than that of the Baseline model. Particularly, the data of the Audio-only and Baseline are compared on the CREMAD data set, so that the performance is not increased and reduced after the multi-mode model is used, the problem of mode imbalance can be deduced, but the acc is increased by about 15% after the RRGL is used. The method shows that a global gating mechanism consisting of a cross-mode gating fusion module and an objective function based on gating auxiliary loss can solve the problem of mode unbalance of the multi-mode model, and the performance of the multi-mode video classification model is obviously improved.
The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments, including the components, without departing from the principles and spirit of the invention, yet fall within the scope of the invention.
Claims (8)
1. The multi-mode video classification method is characterized by comprising the following steps:
s1, inputting preprocessed video data, and respectively extracting voice mode characteristics and visual mode characteristics by using two resnet18 encoders with the same structure;
s2, constructing a global linkage gating mechanism, wherein the global linkage gating mechanism consists of a cross-mode gating fusion module and an objective function based on gating auxiliary loss;
s3, taking the voice mode characteristics and the visual mode characteristics as input, obtaining the cross-mode characteristics through a cross-mode gating fusion module, obtaining gating parameters through a Sigmod activation function, controlling information retention of the voice mode and the visual mode by using the gating parameters,
s4, respectively carrying out classification prediction on the voice modal characteristics, the visual modal characteristics and the cross modal characteristics through the full connection layer, and calculating to obtain three independent loss by using a cross entropy formula av 、loss a 、loss v ;
S5, three independent loss av 、loss a 、loss v The loss of one mode is taken as auxiliary loss, and the processed gating parameters are taken as the weight of the auxiliary loss to form an objective function based on the gating auxiliary loss.
2. The method according to claim 1, wherein in the step S1, the video data includes a voice mode and a visual mode,
the voice mode acquisition method comprises the following steps: extracting voice information in the video and converting the voice information into a spectrogram as a voice mode;
the visual mode extraction method comprises the following steps: 3 frames are uniformly sampled in the video as a visual modality.
3. The method for classifying multi-modal video according to claim 2, wherein in the step S1, the method for extracting the speech mode feature and the image mode feature is as follows:
the speech mode and the visual mode respectively adopt two identical ResNet18 as encoders to respectively extract characteristics from input data of the speech mode and the visual mode:
wherein the method comprises the steps ofIs an input from a speech modality, +.>Is a speech coder with a speech mode based on ResNet18, theta a Is the encoder parameter, H a Is the voice modal characteristic extracted by the encoder; similarly->Is an input from a visual modality, +.>Is a visual encoder with visual mode based on ResNet18, theta v Is the encoder parameter, H v Is the visual mode characteristic extracted by the encoder.
4. The multi-modal video classification method according to claim 1, wherein in the step S3, the cross-modal characteristics obtained after the fusion control the information retention of the voice mode and the visual mode by the gating parameters.
5. The multi-modal video classification method according to claim 1, wherein the specific method of step S3 is as follows: the characteristics of the voice mode and the visual mode are received, added after passing through a full connection layer, activated by a Sigmod function to obtain a multidimensional gating parameter lambda, and then the gating parameter controls the information retention of the voice mode and the visual mode, and the expression is as follows:
λ=Sigmod(UH a +VH v )
H av =(1-λ)·H a +λ·H v
where U and V are trainable variables, λ is the gating parameter, sigmod is the activation function, H a Is a speech modal feature, H v Is a visual mode characteristic, H av Is a cross-modal feature.
6. The method for classifying multi-modal videos according to claim 4, wherein in the step S5, the gating parameters are processed as follows: after the gating parameter lambda takes average dimension reduction, the gating parameter lambda is taken as the weight of the corresponding mode loss in the form of reciprocal.
7. The method according to claim 5, wherein in the step S5, an expression of an objective function based on gating auxiliary loss is formed as follows:
wherein lambda is - Is one-dimensional gating parameter after averaging, beta is super parameter, loss a Is the cross entropy loss of a single speech mode, loss v Cross entropy loss, loss of single visual modality av Is the cross entropy loss after multi-modal fusion.
8. A multi-modal video classification system, comprising:
the system comprises a parallel multi-mode feature extraction module, a cross-mode gating fusion module, a gating auxiliary loss-based objective function and a classification prediction module;
the parallel multi-mode feature extraction module comprises two resnet18 encoders with the same structure, and is used for extracting initial features of a voice mode and a visual mode;
the cross-mode gating fusion module and the gating auxiliary loss-based objective function jointly construct a global gating adjustment mechanism, which is used for judging the fitting state of each mode according to the loss of a single mode and adaptively adjusting the information duty ratio of different modes in cross-mode fusion;
and the classification prediction module performs classification prediction on the single-mode characteristics and the fused mode characteristics, and takes the classification prediction of the fused mode characteristics as a final prediction result after model training is finished.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311329631.3A CN117392582A (en) | 2023-10-16 | 2023-10-16 | Multi-mode video classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311329631.3A CN117392582A (en) | 2023-10-16 | 2023-10-16 | Multi-mode video classification method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117392582A true CN117392582A (en) | 2024-01-12 |
Family
ID=89436812
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311329631.3A Pending CN117392582A (en) | 2023-10-16 | 2023-10-16 | Multi-mode video classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117392582A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576784A (en) * | 2024-01-15 | 2024-02-20 | 吉林大学 | Method and system for recognizing diver gesture by fusing event and RGB data |
-
2023
- 2023-10-16 CN CN202311329631.3A patent/CN117392582A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576784A (en) * | 2024-01-15 | 2024-02-20 | 吉林大学 | Method and system for recognizing diver gesture by fusing event and RGB data |
CN117576784B (en) * | 2024-01-15 | 2024-03-26 | 吉林大学 | Method and system for recognizing diver gesture by fusing event and RGB data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112070209B (en) | Stable controllable image generation model training method based on W distance | |
CN110164476B (en) | BLSTM voice emotion recognition method based on multi-output feature fusion | |
CN113408385B (en) | Audio and video multi-mode emotion classification method and system | |
EP3623118B1 (en) | Emotion recognizer, robot including the same, and server including the same | |
CN108806667B (en) | Synchronous recognition method of voice and emotion based on neural network | |
CN117392582A (en) | Multi-mode video classification method and system | |
CN113158875A (en) | Image-text emotion analysis method and system based on multi-mode interactive fusion network | |
CN112348075A (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
CN105700682A (en) | Intelligent gender and emotion recognition detection system and method based on vision and voice | |
CN105739688A (en) | Man-machine interaction method and device based on emotion system, and man-machine interaction system | |
CN106897263A (en) | Robot dialogue exchange method and device based on deep learning | |
CN105469784B (en) | A kind of speaker clustering method and system based on probability linear discriminant analysis model | |
CN113033450B (en) | Multi-mode continuous emotion recognition method, service inference method and system | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
Ocquaye et al. | Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition | |
CN110443309A (en) | A kind of electromyography signal gesture identification method of combination cross-module state association relation model | |
CN116661603A (en) | Multi-mode fusion user intention recognition method under complex man-machine interaction scene | |
CN111259976A (en) | Personality detection method based on multi-mode alignment and multi-vector representation | |
CN115631267A (en) | Method and device for generating animation | |
CN117765981A (en) | Emotion recognition method and system based on cross-modal fusion of voice text | |
CN111581470A (en) | Multi-modal fusion learning analysis method and system for dialog system context matching | |
CN109949438A (en) | Abnormal driving monitoring model method for building up, device and storage medium | |
CN113611318A (en) | Audio data enhancement method and related equipment | |
CN117635964A (en) | Multimode aesthetic quality evaluation method based on transducer | |
CN118279805B (en) | Remote emotion recognition method based on multiple modes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |