CN115346261A - Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss - Google Patents

Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss Download PDF

Info

Publication number
CN115346261A
CN115346261A CN202211015781.2A CN202211015781A CN115346261A CN 115346261 A CN115346261 A CN 115346261A CN 202211015781 A CN202211015781 A CN 202211015781A CN 115346261 A CN115346261 A CN 115346261A
Authority
CN
China
Prior art keywords
feature
layer
network
characteristic
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211015781.2A
Other languages
Chinese (zh)
Inventor
师硕
覃嘉俊
郝小可
郭迎春
于洋
朱叶
刘依
吕华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202211015781.2A priority Critical patent/CN115346261A/en
Publication of CN115346261A publication Critical patent/CN115346261A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention is an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss, which comprises the steps of 1) collecting videos expressing emotion and relating to human face regions, extracting image sequences and audio signals from the videos, and converting the audio signals into Mel cepstrum coefficient spectrograms; 2) Constructing a ConvMixer network combined with the adjacency matrix, and extracting visual features from the ConvMixer network combined with the adjacency matrix; 3) Extracting auditory features from the Mel cepstrum coefficient spectrogram by using a ResNet34 network; 4) Constructing a feature fusion and classification network for fusing visual features and auditory features and carrying out emotion classification on each video according to the fused features; 5) And training the network, and calculating training loss through a focus loss function fused with the dynamic weight. The method solves the problems that the existing method emphasizes on extracting local features of video pictures and ignores global features, loss functions cannot cause a model to pay attention to difficultly-divided samples, and the like.

Description

Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss
Technical Field
The invention belongs to the technical field of audio-visual emotion classification, and particularly relates to an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss.
Background
With the popularization of the internet and computers, human-computer interaction behaviors occur more frequently. The emotion classification method based on audio-visual is used for measuring and analyzing the external expression of human beings and calculating the influence on emotion, and can make the interaction mode more natural and friendly and improve the human-computer interaction experience when applied to a modern human-computer interaction system.
The audio-visual emotion classification method based on deep learning does not need to perform manual feature extraction based on professional knowledge, also shows performance superior to a traditional mode, and achieves better effect when being applied to audio-visual emotion classification.
In a paper "End-to-End multinational emission registration Using Deep Neural Networks" published in the IEEE Journal of Selected Topics in Signal Processing Journal by Trigeorgis et al in 2017, feature extraction is performed on an audio Signal and a video Signal respectively by Using a one-dimensional and two-dimensional Convolutional Neural Networks (CNN), and after audio and video features are spliced, the audio and video features are input into a Recurrent Neural Networks (RNN) for Emotion analysis. In 2018, zhang et al published in IEEE Transactions on Circuits and Systems for Video Technology journal, "Learning affinity Features with a Hybrid Deep Model for Audio-Visual observation registration", constructed a dual-flow network using three-dimensional CNN and two-dimensional CNN, extracted Features of a Video frame sequence and an Audio frequency spectrogram segmented according to a certain time length, fused the corresponding Video Features and Audio Features through a Deep Belief Network (DBN), and finally averaged and pooled the Features of all time periods to obtain global Features for classification, but the Video Features and Audio Features corresponding to all time periods are feature extracted through the same dual-flow network, and cannot be pertinently extracted for each time period, and the Features of all time periods are fused by using averaged pooling, so that the time information of the original Video cannot be highlighted.
CN114724224A discloses a multi-modal emotion recognition method for a medical care robot, which extracts expression self-attention emotion features and action self-attention emotion features according to the video information, and extracts voice self-attention emotion features and text self-attention emotion features according to the audio information; 4 self-attention emotional characteristics are subjected to mutual attention mechanism-based emotional characteristic fusion to obtain complete multi-mode emotional characteristics; extracting context emotional characteristics based on a graph convolution neural network by the aid of the multi-modal emotional characteristics to obtain multi-modal emotional characteristics containing context information; the method comprises the steps of carrying out emotion classification and identification on multi-mode emotion characteristics containing context information to obtain emotion label results, splicing the 4 self-attention emotion characteristics into multi-mode characteristics, then carrying out mutual attention mechanism emotion characteristic fusion, increasing the subsequent mutual attention mechanism calculated amount through splicing operation, and neglecting 4 modal time correlations through splicing operation to splice the 4 characteristics to increase characteristic dimensions. CN114582372A discloses a multi-modal emotion feature recognition method for judging emotion of a driver, which comprises the steps of respectively inputting visual information and voice information into a visual human face expression feature recognition model and a voice emotion feature recognition model, respectively obtaining a visual feature vector and a voice feature vector, and inputting the two features into a bimodal emotion feature recognition model to obtain an emotion recognition result fused in a decision level, wherein the extracted voice feature vector is a statistical feature extracted based on a Mel cepstrum coefficient spectrogram of audio, and energy change information related to emotion in the audio is ignored. CN113989893A discloses a child emotion recognition algorithm based on expressions and voice dual modes, which is characterized in that a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted by a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: an audiovisual emotion classification method based on an improved ConvMixer network and dynamic focus loss is provided. Visual features are extracted from an image sequence by combining a ConvMixer network of an adjacency matrix, auditory features are extracted from a Mel cepstrum coefficient spectrogram by a ResNet34 network, the cross-mode time attention module of a feature fusion and classification network fuses the visual features and the auditory features, and the fused features are used for judging the emotion category of a video; and training a network model by fusing the Focal local Loss function of the dynamic weight, optimizing model parameters and improving the recognition rate of the difficultly-classified samples. The invention solves the problems that the existing audio-visual bimodal emotion recognition method focuses on extracting local features of a video picture and ignores global features, the bimodal fusion method is simple, and a loss function can not cause a model to pay attention to a difficultly-divided sample.
The technical scheme adopted by the invention for solving the technical problem is as follows:
an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss comprises the following steps:
the method comprises the steps of firstly, collecting videos expressing emotion and related to human face regions, extracting image sequences and audio signals from the videos, and converting the audio signals into a Mel cepstrum coefficient spectrogram;
secondly, constructing a ConvMixer network combined with an adjacency matrix, wherein the ConvMixer network comprises three parts of operations, namely block embedding operation, layer module operation and average pooling operation; inputting the image sequence into a ConvMixer network combined with an adjacent matrix to extract visual features, and obtaining a feature map F;
step (2.1), block embedding operation:
sequentially carrying out block embedding operation on the image sequence through a convolution layer, an activation function layer and a normalization layer to obtain a feature diagram F output by the block embedding operation 2.1
Step (2.2), the Layer module is operated and comprises four cascaded Layer modules;
will feature map F 2.1 Input into the first Layer module according to the characteristic diagram F 2.1 Size of space of (1) construct feature map F 2.1 The two-dimensional space coordinate matrix of each image block is obtained according to the characteristic diagram F 2.1 The two-dimensional space coordinate matrix is copied and spliced according to the time size to obtain a characteristic graph F 2.1 Spatial position codes with the same size are coded; will feature chart F 2.1 Splicing with the space position code, and obtaining a characteristic diagram through a linear layer
Figure BDA0003812450430000021
According to the characteristic diagram F 2.1 Randomly generating a spatial adjacency matrix according to the spatial dimension of the feature map
Figure BDA0003812450430000022
Multiplying the obtained result by a spatial adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer s (ii) a Will feature map
Figure BDA0003812450430000023
And F s Overlapping to obtain a characteristic diagram F s ';
According to the characteristic diagram F s ' time dimension of construction of feature map F s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F s ' time position coding with same size; will feature map F s ' splicing with time position coding, and obtaining a characteristic diagram through a linear layer
Figure BDA0003812450430000024
According to the characteristic diagram F s ' the time dimension randomly generates a time adjacency matrix, maps the features
Figure BDA0003812450430000025
Multiplying the time adjacency matrix by the time adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer t (ii) a Will feature map
Figure BDA0003812450430000031
And F t Overlapping to obtain a characteristic diagram F t '; characteristic diagram F t Sequentially carrying out point-by-point convolution Layer, activation function Layer and normalization Layer to obtain a characteristic diagram output by a first Layer module;
step (2.3), average pooling operation:
carrying out spatial dimension average pooling operation on the feature map output by the fourth Layer through an average pooling Layer to obtain a feature map F;
thirdly, extracting auditory features from the Mel cepstrum coefficient spectrogram by using a ResNet34 network to obtain a feature map M;
fourthly, constructing a feature fusion and classification network for fusing the visual features and the auditory features and carrying out emotion classification on each video according to the fused features; the feature fusion and classification network comprises two cross-modal time attention modules, pooling and splicing operation and classification operation;
step (4.1), the first cross-modal time attention module:
inputting the feature map F and the feature map M into a first cross-modal time attention module, and obtaining a feature Q through the feature map F by a linear layer and a normalization layer 1 (ii) a The characteristic graph M respectively obtains characteristics K through two independent linear layers and two independent normalization layers 1 And V 1
According to the characteristic Q 1 And K 1 Time dimension size of (c) to generate a learnable intermediate matrix LIM 1 And for the intermediate matrix LIM 1 Initializing random parameters; will be characteristic of Q 1 And the initialized intermediate matrix LIM 1 And characteristic K 1 By transposition multiplication and division by the feature K 1 Squaring the number of channels, and inputting the squared number into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V 1 After multiplication, the characteristic Q is added 1 Adding to obtain cross-modal attention feature F based on image sequence att (ii) a Cross-modal attention feature F based on image sequence att Obtaining the cross-modal characteristic F based on the image sequence through the point-by-point convolution layer cm
Step (4.2), the second cross-modal time attention module:
inputting the feature map F and the feature map M into a second cross-modal temporal attention module; the characteristic graph M passes through a linear layer and a normalization layer to obtain a characteristic Q 2 (ii) a The characteristic graph F passes through two independent linear layers and two independent normalization layers to obtain a characteristic K 2 And V 2
According to the characteristic Q 2 And K 2 Time dimension size of (c) to generate a learnable intermediate matrix LIM 2 And to the intermediate matrix LIM 2 Initializing random parameters; will be characteristic of Q 2 And intermediate matrix LIM 2 And feature K 2 By transposition multiplication and division by the feature K 2 Squaring the number of channels, and inputting the channels into a softmax layer to obtain a normalized weight; the normalized weight is compared with the characteristic V 2 Multiplied by the characteristic Q 2 Adding to obtain a cross-modal attention feature M based on a Mel cepstrum coefficient spectrogram att (ii) a Will be based on mel-frequency cepstrum coefficientsCross-modal attention feature M of spectrogram att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram through point-by-point convolution layer cm
Step (4.3), pooling and splicing:
cross-modal feature F to be based on image sequence cm And cross-modal characteristic M based on Mel cepstrum coefficient spectrogram cm Respectively carrying out average pooling, and then splicing to obtain the characteristic f FM
And (4.4) classifying:
will the characteristic f FM Inputting the linear layer, and obtaining the prediction probability distribution P { Y) about E emotion classes through the softmax layer 1 ,Y 2 ,...,Y i ,...,Y q },Y i Representing the predicted probability distribution of the ith video for E emotion classes, denoted as Y i {y i1 ,…,y ie ,…,y iE },y ie Representing the prediction probability that the ith video belongs to the e emotion category, and q represents the number of videos;
fifthly, training a ConvMixer network, a ResNet34 network, a feature fusion and classification network combined with the adjacency matrix, and calculating training loss through a focus loss function of fusion dynamic weight; and extracting visual features from the image sequence by using the trained ConvMixer network combined with the adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using the trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using the trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting the emotion classes corresponding to the videos.
Further, the focus loss function of the fusion dynamic weight is:
Figure BDA0003812450430000041
in the formula (21), α represents a dynamic weight, log represents a logarithmic function with base 2,
Figure BDA0003812450430000042
representing the real emotion category to which the ith video belongs
Figure BDA0003812450430000043
A predicted probability distribution of;
the calculation formula of the dynamic weight α is as follows:
Figure BDA0003812450430000044
in equation (22), C represents the confusion matrix for the last training cycle,
Figure BDA0003812450430000045
representing the real emotion classification in the last training period
Figure BDA0003812450430000046
Is predicted as the number of times of the category t,
Figure BDA0003812450430000047
representing the true emotion classification in the last training period
Figure BDA0003812450430000048
The number of the videos of (a) is,
Figure BDA0003812450430000049
representing the true emotion categories
Figure BDA00038124504300000410
Into categories
Figure BDA00038124504300000411
The number of times of (c).
Further, the construction process of the confusion matrix C is as follows: before each training period begins, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 0 to E-1; according to the prediction of each sample in training, when the sample of the real emotion class 0 is predicted to be class 2, adding 1 to the element of the 0 th row and the 2 nd column of the matrix, and updating the all-zero matrix to obtain a confusion matrix C by the same way for the other samples; when the training cycle is completed, the confusion matrix of the current cycle is used for calculating the dynamic weight of the loss function in the next training cycle.
Compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:
(1) The ConvMixer (Adjacent Matrix-based ConvMixer, ACAM for short) combined with the adjacency Matrix, wherein the Layer module operation expands the receptive field of the network by means of the spatial adjacency Matrix and the temporal adjacency Matrix, so that the network can simultaneously extract the global and local spatial and temporal characteristics of the image sequence. In order to capture the time correlation of the image sequence and the Mel cepstrum coefficient spectrogram related to emotion change, the Cross-Modal time Attention Module (CMTAM) provided by the invention relates the time characteristics of different time scales of video and audio through a learnable intermediate matrix, and the CMTAM learns the capability of capturing the Cross-Modal time correlation through the intermediate matrix. The Focal local function with the fused dynamic weights, which is provided by the invention, can adjust the Loss value of the training sample through the dynamic weights, so that the model pays more attention to the wrongly classified samples in the optimization process, and the generalization capability of the model is improved.
(2) CN114694076A discloses a multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion, which is used for splicing and fusing extracted single-modal characteristics and using a cross entropy loss function as an optimization target of a model. Compared with CN114694076A, the cross-modal time attention module with the learnable intermediate matrix captures cross-modal time correlation, extracts cross-modal time attention features and improves the identification effect of the model.
(3) CN114582372A discloses a multi-modal emotion feature recognition method for judging emotion of a driver, which inputs a visual feature vector and a voice feature vector into a bimodal emotion feature recognition model to obtain an emotion recognition result of decision-level fusion. Compared with CN114582372A, the invention adopts end-to-end training method, and the audio and video feature extraction network trains simultaneously, thus saving the model training time; in addition, the invention not only extracts the spatial information of the video image sequence, but also extracts the time characteristics, thereby ensuring the applicability of the algorithm in different scenes.
(4) The method adopts the deep learning idea, the traditional detection method only extracts the spatial features of the low level frame by frame for the video image sequence, the time features of the sequence cannot be extracted, the deep learning can extract the semantic features of the high level, and the image can be better expressed.
Drawings
FIG. 1 is a flow chart of the training phase of the present invention;
FIG. 2 is a flow chart of the classification phase of the present invention;
FIG. 3 is a schematic diagram of the block embedding operation in the process of constructing ConvMixer network combined with adjacency matrix according to the present invention;
FIG. 4 is a schematic diagram of Layer module operation and average pooling operation in constructing ConvMixer networks incorporating adjacency matrices according to the present invention;
FIG. 5 is a schematic diagram of the shallow feature extraction operation in the ResNet34 network of the present invention;
FIG. 6 is a schematic diagram of the first through three residual modules, the five through seven residual modules, the nine through thirteen residual modules, and the fifteen through sixteen residual modules in the ResNet34 network;
FIG. 7 is a schematic diagram of a fourth, eighth, and fourteenth residual module in a ResNet34 network;
FIG. 8 is a schematic diagram of a first cross-modality time attention module in the process of building a feature fusion and classification network;
FIG. 9 is a schematic diagram of a second cross-modality time attention module in the process of constructing a feature fusion and classification network;
FIG. 10 is a schematic illustration of pooling and stitching, sorting operations in constructing a feature fusion and sorting network;
FIG. 11 is a schematic diagram of the present invention constructing the Focal local Loss function of the fused dynamic weights.
Detailed Description
The technical solutions of the present invention are described in detail below with reference to the accompanying drawings and the detailed description, but the scope of the present invention is not limited thereto.
The invention is an audio-visual emotion classification method (for short, refer to fig. 1-11) based on improved ConvMixer network and dynamic focus Loss, which extracts visual features from image sequence by using ConvMixer network combined with adjacency matrix, extracts auditory features from audio Mel cepstral coefficient spectrogram by using ResNet34 network, fuses the audio-visual features by cross-modal time attention module, and obtains Focal Loss function of dynamic weight by combining confusion matrix as the optimization objective function of network model; the method comprises the following specific steps:
the method comprises the steps of firstly, collecting a video which expresses emotion and relates to a human face area, extracting an image sequence and an audio signal from the video, and converting the audio signal into a Mel cepstrum coefficient spectrogram;
step (1.1), extracting images from the video, and converting the video into an image sequence;
a group of video sequences consists of a plurality of videos, images are extracted from the videos by using OpenCV software, and the images extracted from each video form an image sequence, so that an image data set is a set of a plurality of image sequences and is marked as T { V } 1 ,V 2 ,...,V i ,...,V q },V i Representing an image sequence corresponding to the ith video, and q represents the number of videos; each image sequence comprising N images, e.g. N images from the ith video, i.e. the ith image sequence is denoted V i {v i1 ,v i2 ,...,v id ,...,v iN N is 64,v id Performing normalization operation on the obtained image sequence, and adjusting the size of each frame of image to be 112 × 112 pixels, so that the size of each image sequence is 64 × 112 × 112;
step (1.2), separating an audio signal from the video, and converting the audio signal into a Mel cepstrum coefficient spectrogram;
separating an audio signal from a video by using Librosa software, and extracting a Mel cepstrum coefficient spectrogram with the frequency domain number of 32; the Mel cepstrum coefficient spectrogram corresponding to the ith video is represented as A i {a i1 ,a i2 ,...,a id ,...,a iN In which a is id Representing the Mel cepstrum coefficient of the d time section of the Mel cepstrum coefficient spectrogram obtained by extracting the ith video, wherein the set of the Mel cepstrum coefficient spectrograms corresponding to the whole data set is M { A } 1 ,A 2 ,...,A i ,...,A q };
Secondly, constructing a ConvMixer network combined with an adjacency matrix, wherein the ConvMixer network comprises three parts of operations, namely block embedding operation, layer module operation and average pooling operation; inputting the image sequence into a ConvMixer network combined with an adjacency matrix to extract visual features, and obtaining a feature map F;
and (2.1) performing block embedding operation:
sequentially carrying out block embedding operation on the image sequence with the size of 64 multiplied by 112 and the number of channels of 3 obtained in the step (1.1) through a convolution layer, an activation function layer and a normalization layer to obtain a characteristic diagram F with the size of 16 multiplied by 16 and the number of channels of 512 2.1 See, fig. 3; the block embedding operation is shown in equation (1):
Figure BDA0003812450430000061
in the formula (1), F in Represents the input of the block embedding operation,
Figure BDA0003812450430000062
denotes the convolution layer with step size and convolution kernel size of 4 × 7 × 7, c in H is the number of input channels and the number of output channels of the convolution layer respectively, GELU represents an activation function layer, and BN represents a normalization layer;
step (2.2), the Layer module is operated and comprises four cascaded Layer modules;
the characteristic diagram F obtained in the step (2.1) is processed 2.1 Inputting the data into a first Layer module according to a characteristic diagram F 2.1 Size of space of (1) constructing a feature map F 2.1 The two-dimensional space coordinate matrix of each image block (patch) is obtained according to the characteristic diagram F 2.1 Time size ofCopying and splicing the two-dimensional space coordinate matrix to obtain a feature map F 2.1 Spatial position codes with the same size are coded; will feature chart F 2.1 Splicing with the space position code, and obtaining a characteristic diagram through a linear layer
Figure BDA0003812450430000063
According to the characteristic diagram F 2.1 Randomly generating a spatial adjacency matrix according to the spatial dimension of the feature map
Figure BDA0003812450430000064
Multiplying with a spatial adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer s (ii) a Will feature map
Figure BDA0003812450430000065
And F s Overlapping to obtain a characteristic diagram F s ';
According to the characteristic diagram F s ' time dimension of construction of feature map F s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F s ' time position coding with same size; will feature chart F s ' splicing with time position code, and obtaining a characteristic diagram through a linear layer
Figure BDA0003812450430000066
According to the characteristic diagram F s ' the time dimension of the graph randomly generates a time adjacency matrix, and maps the features
Figure BDA0003812450430000067
Multiplying the time adjacency matrix by a time adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer t (ii) a Will feature map
Figure BDA0003812450430000071
And F t Overlapping to obtain a characteristic diagram F t '; feature map F t ' sequentially rolling layers point by point, activatingObtaining a characteristic diagram F after the function layer and the normalization layer 2.2.1 (ii) a Feature map F 2.2.1 Inputting a second Layer module and outputting a characteristic diagram F 2.2.2 (ii) a Feature map F 2.2.2 Inputting the third Layer module and outputting a characteristic diagram F 2.2.3 (ii) a Feature map F 2.2.3 Inputting the fourth Layer module and outputting a characteristic diagram F 2.2.4 See, fig. 4;
each Layer module is shown in equations (2) - (6):
Figure BDA0003812450430000072
in the formula (2), F 0 Represents the input of Layer module, SPC represents spatial position coding, concat represents splicing operation, and Linear represents Linear Layer;
Figure BDA0003812450430000073
in formula (3), SAM represents a spatial adjacency matrix;
Figure BDA0003812450430000074
in formula (4), TPC represents time-position coding;
Figure BDA0003812450430000075
in formula (5), TAM represents a time adjacency matrix;
F out =BN(GELU(Conv pw (F t ')))
(6)
in the formula (6), conv pw A point-by-point convolution layer representing a convolution kernel size of 1 × 1 × 1;
step (2.3), average pooling operation:
the feature map F with the size of 16 multiplied by 16 and the number of channels 512 obtained in the step (2.2) is used 2.2.4 Carrying out space dimension average pooling operation through the average pooling layer to obtain a characteristic diagram F with the size of 16 multiplied by 1 and the number of channels 512;
thereby constructing a ConvMixer network incorporating an adjacency matrix;
thirdly, constructing a ResNet34 network, wherein the ResNet34 network comprises a shallow layer feature extraction layer and sixteen residual modules, and the output of the last residual module is the input of the next residual module; inputting the Mel cepstrum coefficient spectrogram into a ResNet34 network to extract auditory features to obtain a feature map M;
and (3.1) shallow feature extraction operation:
inputting the Mel cepstrum coefficient spectrogram obtained in the step (1.2) into a ResNet34 network, wherein the size of the Mel cepstrum coefficient spectrogram is 32 x 590, and the number of channels is 1; inputting the Mel cepstrum coefficient spectrogram into a shallow layer feature extraction layer composed of convolution layer, normalization layer and activation function layer to obtain feature map M with size of 8 × 148 and channel number of 64 3.1 See, fig. 5; the shallow feature extraction operation is shown in equation (7):
M 3.1 =RELU(BN(Conv 1,64,7,2,3 (M in )))
(7)
in the formula (7), M in Representing the input of the ResNet34 network, M 3.1 Output representing a shallow feature extraction layer, conv 3,64,7,2,3 Representing convolution layers with the number of input channels being 1, the number of output channels being 64, the convolution kernel size being 7, the step size being 2 and the edge fill size being 2;
and (3.2) deep feature extraction operation:
the characteristic diagram M obtained in the step (3.1) is processed 3.1 Inputting a first residual error module, wherein the number of output channels of the first to third residual error modules is 64, the number of output channels of the fourth to seventh residual error modules is 128, the number of output channels of the eighth to thirteen residual error modules is 256, and the number of output channels of the fourteenth to sixteenth residual error modules is 512; the sixteenth residual error module outputs a feature map M with the size of 1 multiplied by 19 and the number of channels of 512;
the operations of the first to the third residual error modules, the five to the seven residual error modules, the nine to the thirteen residual error modules and the fifteen to the sixteen residual error modules are shown in formula (8):
M l =BN 2 (Conv 2 (RELU(BN 1 (Conv 1 (M l-1 )))))+M l-1
(8)
in the formula (8), M l-1 Representing the input of the l residual block, M l Representing the output of the l-th residual block, conv 1 And Conv 2 Representing two independent convolution layers, BN, of convolution kernel size 3, step size 1 and edge fill size 1 1 And BN 2 Representing two independent normalization layers;
the operation of the fourth, eighth, and fourteenth residual error modules is shown in equation (9):
M l =BN 2 (Conv 4 (RELU(BN 1 (Conv 3 (M l-1 )))))+BN 3 (Conv 5 (M l-1 ))
(9)
in the formula (9), conv 3 Represents the convolution layer with convolution kernel size of 3, step size of 2 and edge fill size of 1, conv 4 Representing the convolution layer with a convolution kernel size of 3, a step size of 1 and an edge fill size of 1, conv 5 Representing the convolution layer, BN, with a convolution kernel size of 1, a step size of 2 and an edge fill size of 0 1 、BN 2 And BN 3 Representing three independent normalization layers;
fourthly, constructing a feature fusion and classification network for fusing the visual features and the auditory features and classifying the emotion of each video according to the fused features; the feature fusion and classification network comprises two cross-modal time attention modules, pooling and splicing operation and classification operation;
step (4.1), the first cross-modal time attention module:
inputting the characteristic diagram F obtained in the step (2.3) and the characteristic diagram M obtained in the step (3.2) into a first trans-modal time attention module, and obtaining a characteristic Q through a linear layer and a normalization layer by the characteristic diagram F 1 (ii) a The feature map M passes through two independent linear layers and two independent normalizationLayer formation to obtain characteristic K 1 And V 1 (ii) a Obtaining a characteristic Q 1 、K 1 And V 1 Is shown in equations (10), (11) and (12):
Q 1 =BN(Linear(F))
(10)
K 1 =BN(Linear(M))
(11)
V 1 =BN(Linear(M))
(12)
according to the characteristic Q 1 And K 1 Time dimension size of (c) to generate a learnable intermediate matrix LIM 1 And to the intermediate matrix LIM 1 Initializing random parameters; will be characteristic of Q 1 And the initialized middle matrix LIM 1 And characteristic K 1 By transposition multiplication and division by the feature K 1 Squaring of channel number
Figure BDA0003812450430000091
Inputting the softmax layer to obtain a normalized weight; the normalized weight is compared with the characteristic V 1 After multiplication, the characteristic Q is added 1 Adding to obtain cross-modal attention feature F based on image sequence att (ii) a Calculating F att Is shown in equation (13):
Figure BDA0003812450430000092
in formula (13), T represents a matrix transpose;
cross-modal attention feature F based on image sequence att Obtaining the cross-modal characteristic F based on the image sequence through the point-by-point convolution layer cm See, fig. 8; calculating F cm Is shown in equation (14):
F cm =Conv pw (F att )
(14)
in the formula (14), conv pw A point-by-point convolution layer representing a convolution kernel size of 1;
step (4.2), the second cross-modal time attention module:
inputting the feature map F obtained in the step (2.3) and the feature map M obtained in the step (3.2) into a second cross-modal time attention module; the characteristic graph M passes through a linear layer and a normalization layer to obtain a characteristic Q 2 (ii) a The characteristic graph F passes through two independent linear layers and two independent normalization layers to obtain a characteristic K 2 And V 2 (ii) a Calculating Q 2 、K 2 And V 2 Is shown in equations (15), (16) and (17):
Q 2 =BN(Linear(M))
(15)
K 2 =BN(Linear(F))
(16)
V 2 =BN(Linear(F))
(17)
according to the characteristic Q 2 And K 2 Size of time dimension to generate a learnable intermediate matrix LIM 2 And to the intermediate matrix LIM 2 Initializing random parameters; will be characteristic of Q 2 And intermediate matrix LIM 2 And characteristic K 2 Multiplication by the transpose of (c), and division by the feature K 2 Squaring of channel number
Figure BDA0003812450430000093
Inputting the softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V 2 Multiplied by the characteristic Q 2 Adding to obtain a cross-modal attention feature M based on a Mel cepstrum coefficient spectrogram att (ii) a Calculating M att Is shown in equation (18):
Figure BDA0003812450430000094
cross-modal attention feature M based on Mel cepstrum coefficient spectrogram att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram by point convolution layer cm See, fig. 9; calculating M cm Is shown in equation (19):
M cm =Conv pw (M att )
(19)
step (4.3), pooling and splicing:
the cross-modal characteristics F based on the image sequence obtained in the step (4.1) cm And (4.2) obtaining cross-modal characteristic M based on Mel cepstral coefficient spectrogram cm Respectively carrying out average pooling, and then splicing to obtain the features f with the size of 1 multiplied by 1 and the channel number of 1024 FM (ii) a The pooling and splicing operation is shown in equation (20):
f FM =Concat(AvgPool(F cm ),AvgPool(M cm ))
(20)
in equation (20), avgPool represents the average pooling operation;
and (4.4) classifying:
the feature f obtained in the step (4.3) FM Inputting the linear layer, and obtaining the prediction probability distribution P { Y) about E emotion classes through the softmax layer 1 ,Y 2 ,...,Y i ,...,Y q },Y i Representing the predicted probability distribution of the ith video for E emotion classes, denoted as Y i {y i1 ,…,y ie ,…,y iE In which y is ie Representing the prediction probability that the ith video belongs to the e emotion category;
fifthly, constructing a Focal local Loss function fusing the dynamic weight, training a ConvMixer network, a ResNet34 network and a feature fusion and classification network combining the adjacency matrix, and calculating the training Loss through the Focal local Loss function fusing the dynamic weight; extracting visual features from an image sequence by using a trained ConvMixer network combined with an adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using a trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using a trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting emotion classes corresponding to videos;
calculating the loss between the prediction probability distribution output in the step (4.4) and the real emotion category according to the formula (21);
Figure BDA0003812450430000101
in the formula (21), α represents a dynamic weight, log represents a logarithmic function with a base 2,
Figure BDA0003812450430000102
representing the real emotion category to which the ith video belongs
Figure BDA0003812450430000103
A predicted probability distribution of (a);
the calculation formula of the dynamic weight α is as follows:
Figure BDA0003812450430000104
in equation (22), C represents the confusion matrix for the last training cycle,
Figure BDA0003812450430000105
representing the real emotion classification in the last training period
Figure BDA0003812450430000106
Is predicted as the number of times of the category t,
Figure BDA0003812450430000107
representing the true emotion classification in the last training period
Figure BDA0003812450430000108
The number of videos of the video recording medium,
Figure BDA0003812450430000109
representing the true emotion categories
Figure BDA0003812450430000111
Is a category of video prediction
Figure BDA0003812450430000112
The number of times of (c);
construction of confusion matrix C in equation (22): before each training period starts, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 1 to E; updating the all-zero matrix according to the prediction of each sample in training to obtain a confusion matrix C; when the sample of the real emotion category 1 is predicted to be category 2, adding 1 to the element at the 1 st row and the 2 nd column of the matrix; when the training cycle is completed, the confusion matrix C of the current cycle is used for calculating the dynamic weight α of the loss function in the next training cycle.
In the fifth step, the size of the batch size is 8, the iteration number is set to be 120, an Adam optimizer is adopted, the initial learning rate is 0.0001, the momentum factor is 0.9, and the learning rate is reduced by 90% after every 30 iterations.
Nothing in this specification is said to apply to the prior art.

Claims (3)

1. An audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss is characterized by comprising the following steps of:
the method comprises the steps of firstly, collecting a video which expresses emotion and relates to a human face area, extracting an image sequence and an audio signal from the video, and converting the audio signal into a Mel cepstrum coefficient spectrogram;
secondly, constructing a ConvMixer network combined with an adjacency matrix, wherein the ConvMixer network comprises three parts of operations, namely block embedding operation, layer module operation and average pooling operation; inputting the image sequence into a ConvMixer network combined with an adjacent matrix to extract visual features, and obtaining a feature map F;
step (2.1), block embedding operation:
sequentially carrying out block embedding operation on the image sequence through a convolution layer, an activation function layer and a normalization layer to obtain a feature diagram F output by the block embedding operation 2.1
Step (2.2), the Layer module is operated and comprises four cascaded Layer modules;
will feature map F 2.1 Inputting the data into a first Layer module according to a characteristic diagram F 2.1 Is a space size ofSign diagram F 2.1 The two-dimensional space coordinate matrix of each image block is obtained according to the characteristic diagram F 2.1 The two-dimensional space coordinate matrix is copied and spliced according to the time size to obtain a characteristic graph F 2.1 Spatial position codes with the same size are coded; will feature chart F 2.1 Splicing with the spatial position code, and obtaining a characteristic diagram through a linear layer
Figure FDA0003812450420000011
According to the characteristic diagram F 2.1 Randomly generating a spatial adjacency matrix according to the spatial dimension of the feature map
Figure FDA0003812450420000012
Multiplying with a spatial adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer s (ii) a Will feature map
Figure FDA0003812450420000013
And F s Overlapping to obtain a characteristic diagram F s ';
According to the characteristic diagram F s ' time dimension of construction of feature map F s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F s ' time position coding with same size; will feature map F s ' splicing with time position code, and obtaining a characteristic diagram through a linear layer
Figure FDA0003812450420000014
According to the characteristic diagram F s ' the time dimension randomly generates a time adjacency matrix, maps the features
Figure FDA0003812450420000015
Multiplying the time adjacency matrix by a time adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer t (ii) a Will feature map
Figure FDA0003812450420000016
And F t Overlapping to obtain a characteristic diagram F t '; feature map F t Obtaining a characteristic diagram output by a first Layer module after sequentially carrying out point-by-point convolution Layer, activation function Layer and normalization Layer;
step (2.3), average pooling operation:
carrying out spatial dimension average pooling operation on the feature map output by the fourth Layer through an average pooling Layer to obtain a feature map F;
thirdly, extracting auditory characteristics from the Mel cepstrum coefficient spectrogram by using a ResNet34 network to obtain a characteristic diagram M;
fourthly, constructing a feature fusion and classification network for fusing the visual features and the auditory features and classifying the emotion of each video according to the fused features; the feature fusion and classification network comprises two cross-modal time attention modules, pooling and splicing operation and classification operation;
step (4.1), the first cross-modal time attention module:
inputting the feature map F and the feature map M into a first cross-modal time attention module, and obtaining a feature Q through the feature map F by a linear layer and a normalization layer 1 (ii) a The characteristic graph M respectively obtains characteristics K through two independent linear layers and two independent normalization layers 1 And V 1
According to the characteristic Q 1 And K 1 Size of time dimension to generate a learnable intermediate matrix LIM 1 And to the intermediate matrix LIM 1 Initializing random parameters; will be characteristic of Q 1 And the initialized middle matrix LIM 1 And feature K 1 By transposition multiplication and division by the feature K 1 Squaring the number of channels, and inputting the squared number into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V 1 After multiplication, the characteristic Q is added 1 Adding to obtain cross-modal attention feature F based on image sequence att (ii) a Cross-modal attention feature F based on image sequence att Obtaining the cross-modal characteristics F based on the image sequence through the point-by-point convolution layer cm
Step (4.2), the second cross-modal time attention module:
inputting the feature map F and the feature map M into a second cross-modal temporal attention module; the characteristic graph M passes through a linear layer and a normalization layer to obtain a characteristic Q 2 (ii) a The characteristic graph F passes through two independent linear layers and two independent normalization layers to obtain a characteristic K 2 And V 2
According to the characteristic Q 2 And K 2 Size of time dimension to generate a learnable intermediate matrix LIM 2 And for the intermediate matrix LIM 2 Initializing random parameters; will be characteristic of Q 2 And intermediate matrix LIM 2 And feature K 2 Multiplication by the transpose of (c), and division by the feature K 2 Squaring the number of channels, and inputting the channels into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V 2 Multiplied by the characteristic Q 2 Adding to obtain cross-modal attention feature M based on Mel cepstrum coefficient spectrogram att (ii) a Cross-modal attention feature M based on Mel cepstrum coefficient spectrogram att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram by point convolution layer cm
Step (4.3), pooling and splicing:
cross-modal feature F to be based on image sequence cm And cross-modal feature M based on Mel cepstrum coefficient spectrogram cm Respectively carrying out average pooling, and then splicing to obtain the characteristic f FM
And (4.4) classifying:
will the characteristic f FM Inputting the linear layer, and obtaining the predicted probability distribution P { Y } related to E emotion classes through the softmax layer 1 ,Y 2 ,...,Y i ,...,Y q },Y i Represents the predictive probability distribution of the ith video for the E emotion classes, denoted as Y i {y i1 ,…,y ie ,…,y iE },y ie Representing the prediction probability that the ith video belongs to the e-th emotion category, and q represents the number of videos;
fifthly, training a ConvMixer network, a ResNet34 network and a feature fusion and classification network which are combined with the adjacency matrix, and calculating training loss through a focus loss function which fuses dynamic weights; and extracting visual features from the image sequence by using the trained ConvMixer network combined with the adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using the trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using the trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting the emotion classes corresponding to the videos.
2. The method of claim 1, wherein the focus loss function of the fused dynamic weights is:
Figure FDA0003812450420000021
in the formula (21), α represents a dynamic weight, log represents a logarithmic function with base 2,
Figure FDA0003812450420000022
representing the real emotion category to which the ith video belongs
Figure FDA0003812450420000023
A predicted probability distribution of;
the calculation formula of the dynamic weight α is as follows:
Figure FDA0003812450420000031
in equation (22), C represents the confusion matrix for the last training period,
Figure FDA0003812450420000032
representing the true emotion classification in the last training cycle
Figure FDA0003812450420000033
Is predicted as the number of times of the category t,
Figure FDA0003812450420000034
representing the true emotion classification in the last training period
Figure FDA0003812450420000035
The number of the videos of (a) is,
Figure FDA0003812450420000036
representing the true emotion categories
Figure FDA0003812450420000037
Into categories
Figure FDA0003812450420000038
The number of times of (c).
3. The method for classifying audiovisual emotions based on the ConvMixer network and dynamic focus loss as claimed in claim 2, wherein the confusion matrix C is constructed by the following steps: before each training period starts, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 1 to E; according to the prediction of each sample in training, when the sample of the real emotion type 1 is predicted to be type 2, adding 1 to the element of the 1 st row and the 2 nd column of the matrix, and updating the all-zero matrix to obtain a confusion matrix C by the same way for the other samples; when the training cycle is completed, the confusion matrix of the current cycle is used for calculating the dynamic weight of the loss function in the next training cycle.
CN202211015781.2A 2022-08-24 2022-08-24 Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss Pending CN115346261A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211015781.2A CN115346261A (en) 2022-08-24 2022-08-24 Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211015781.2A CN115346261A (en) 2022-08-24 2022-08-24 Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss

Publications (1)

Publication Number Publication Date
CN115346261A true CN115346261A (en) 2022-11-15

Family

ID=83953410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211015781.2A Pending CN115346261A (en) 2022-08-24 2022-08-24 Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss

Country Status (1)

Country Link
CN (1) CN115346261A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842445A (en) * 2023-07-03 2023-10-03 山东科技大学 Method and system for automatically recognizing awakening based on multi-mode space-time spectrum fusion
CN117292442A (en) * 2023-10-13 2023-12-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842445A (en) * 2023-07-03 2023-10-03 山东科技大学 Method and system for automatically recognizing awakening based on multi-mode space-time spectrum fusion
CN116842445B (en) * 2023-07-03 2024-06-11 山东科技大学 Method and system for automatically recognizing awakening based on multi-mode space-time spectrum fusion
CN117292442A (en) * 2023-10-13 2023-12-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method
CN117292442B (en) * 2023-10-13 2024-03-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method

Similar Documents

Publication Publication Date Title
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
CN110751208B (en) Criminal emotion recognition method for multi-mode feature fusion based on self-weight differential encoder
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
Mroueh et al. Deep multimodal learning for audio-visual speech recognition
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN115346261A (en) Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss
CN112329794B (en) Image description method based on dual self-attention mechanism
CN111161715B (en) Specific sound event retrieval and positioning method based on sequence classification
CN113792177B (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN111488487B (en) Advertisement detection method and detection system for all-media data
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN117149944B (en) Multi-mode situation emotion recognition method and system based on wide time range
Oghbaie et al. Advances and challenges in deep lip reading
CN111680602A (en) Pedestrian re-identification method based on double-flow hierarchical feature correction and model architecture
CN117037017A (en) Video emotion detection method based on key frame erasure
CN115858728A (en) Multi-mode data based emotion analysis method
CN113177112B (en) Neural network visual conversation device and method based on KR product fusion multi-mode information
Vayadande et al. Lipreadnet: A deep learning approach to lip reading
CN115294353A (en) Crowd scene image subtitle description method based on multi-layer attribute guidance
WO2021147084A1 (en) Systems and methods for emotion recognition in user-generated video(ugv)
CN117576279B (en) Digital person driving method and system based on multi-mode data
CN112926662B (en) Target detection method based on multi-scale language embedded REC
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation
Kulkarni Integration of Audio video Speech Recognition using LSTM and Feed Forward Convolutional Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination