CN115346261A

CN115346261A - Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss

Info

Publication number: CN115346261A
Application number: CN202211015781.2A
Authority: CN
Inventors: 师硕; 覃嘉俊; 郝小可; 郭迎春; 于洋; 朱叶; 刘依; 吕华
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-11-15

Abstract

The invention is an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss, which comprises the steps of 1) collecting videos expressing emotion and relating to human face regions, extracting image sequences and audio signals from the videos, and converting the audio signals into Mel cepstrum coefficient spectrograms; 2) Constructing a ConvMixer network combined with the adjacency matrix, and extracting visual features from the ConvMixer network combined with the adjacency matrix; 3) Extracting auditory features from the Mel cepstrum coefficient spectrogram by using a ResNet34 network; 4) Constructing a feature fusion and classification network for fusing visual features and auditory features and carrying out emotion classification on each video according to the fused features; 5) And training the network, and calculating training loss through a focus loss function fused with the dynamic weight. The method solves the problems that the existing method emphasizes on extracting local features of video pictures and ignores global features, loss functions cannot cause a model to pay attention to difficultly-divided samples, and the like.

Description

Audio-visual emotion classification method based on improved ConvMixer network and dynamic focus loss

Technical Field

The invention belongs to the technical field of audio-visual emotion classification, and particularly relates to an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss.

Background

With the popularization of the internet and computers, human-computer interaction behaviors occur more frequently. The emotion classification method based on audio-visual is used for measuring and analyzing the external expression of human beings and calculating the influence on emotion, and can make the interaction mode more natural and friendly and improve the human-computer interaction experience when applied to a modern human-computer interaction system.

The audio-visual emotion classification method based on deep learning does not need to perform manual feature extraction based on professional knowledge, also shows performance superior to a traditional mode, and achieves better effect when being applied to audio-visual emotion classification.

In a paper "End-to-End multinational emission registration Using Deep Neural Networks" published in the IEEE Journal of Selected Topics in Signal Processing Journal by Trigeorgis et al in 2017, feature extraction is performed on an audio Signal and a video Signal respectively by Using a one-dimensional and two-dimensional Convolutional Neural Networks (CNN), and after audio and video features are spliced, the audio and video features are input into a Recurrent Neural Networks (RNN) for Emotion analysis. In 2018, zhang et al published in IEEE Transactions on Circuits and Systems for Video Technology journal, "Learning affinity Features with a Hybrid Deep Model for Audio-Visual observation registration", constructed a dual-flow network using three-dimensional CNN and two-dimensional CNN, extracted Features of a Video frame sequence and an Audio frequency spectrogram segmented according to a certain time length, fused the corresponding Video Features and Audio Features through a Deep Belief Network (DBN), and finally averaged and pooled the Features of all time periods to obtain global Features for classification, but the Video Features and Audio Features corresponding to all time periods are feature extracted through the same dual-flow network, and cannot be pertinently extracted for each time period, and the Features of all time periods are fused by using averaged pooling, so that the time information of the original Video cannot be highlighted.

CN114724224A discloses a multi-modal emotion recognition method for a medical care robot, which extracts expression self-attention emotion features and action self-attention emotion features according to the video information, and extracts voice self-attention emotion features and text self-attention emotion features according to the audio information; 4 self-attention emotional characteristics are subjected to mutual attention mechanism-based emotional characteristic fusion to obtain complete multi-mode emotional characteristics; extracting context emotional characteristics based on a graph convolution neural network by the aid of the multi-modal emotional characteristics to obtain multi-modal emotional characteristics containing context information; the method comprises the steps of carrying out emotion classification and identification on multi-mode emotion characteristics containing context information to obtain emotion label results, splicing the 4 self-attention emotion characteristics into multi-mode characteristics, then carrying out mutual attention mechanism emotion characteristic fusion, increasing the subsequent mutual attention mechanism calculated amount through splicing operation, and neglecting 4 modal time correlations through splicing operation to splice the 4 characteristics to increase characteristic dimensions. CN114582372A discloses a multi-modal emotion feature recognition method for judging emotion of a driver, which comprises the steps of respectively inputting visual information and voice information into a visual human face expression feature recognition model and a voice emotion feature recognition model, respectively obtaining a visual feature vector and a voice feature vector, and inputting the two features into a bimodal emotion feature recognition model to obtain an emotion recognition result fused in a decision level, wherein the extracted voice feature vector is a statistical feature extracted based on a Mel cepstrum coefficient spectrogram of audio, and energy change information related to emotion in the audio is ignored. CN113989893A discloses a child emotion recognition algorithm based on expressions and voice dual modes, which is characterized in that a semantic feature space is constructed by utilizing emotion label information of voice features and expression features, local features and global features of audio and video are extracted by a multi-scale feature extraction method, the local features and the global features of the audio and the video are projected to the semantic feature space, important features contributing to emotion classification are selected from the semantic feature space, and emotion judgment and recognition are carried out.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: an audiovisual emotion classification method based on an improved ConvMixer network and dynamic focus loss is provided. Visual features are extracted from an image sequence by combining a ConvMixer network of an adjacency matrix, auditory features are extracted from a Mel cepstrum coefficient spectrogram by a ResNet34 network, the cross-mode time attention module of a feature fusion and classification network fuses the visual features and the auditory features, and the fused features are used for judging the emotion category of a video; and training a network model by fusing the Focal local Loss function of the dynamic weight, optimizing model parameters and improving the recognition rate of the difficultly-classified samples. The invention solves the problems that the existing audio-visual bimodal emotion recognition method focuses on extracting local features of a video picture and ignores global features, the bimodal fusion method is simple, and a loss function can not cause a model to pay attention to a difficultly-divided sample.

The technical scheme adopted by the invention for solving the technical problem is as follows:

an audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss comprises the following steps:

the method comprises the steps of firstly, collecting videos expressing emotion and related to human face regions, extracting image sequences and audio signals from the videos, and converting the audio signals into a Mel cepstrum coefficient spectrogram;

secondly, constructing a ConvMixer network combined with an adjacency matrix, wherein the ConvMixer network comprises three parts of operations, namely block embedding operation, layer module operation and average pooling operation; inputting the image sequence into a ConvMixer network combined with an adjacent matrix to extract visual features, and obtaining a feature map F;

step (2.1), block embedding operation:

sequentially carrying out block embedding operation on the image sequence through a convolution layer, an activation function layer and a normalization layer to obtain a feature diagram F output by the block embedding operation _2.1 ；

Step (2.2), the Layer module is operated and comprises four cascaded Layer modules;

will feature map F _2.1 Input into the first Layer module according to the characteristic diagram F _2.1 Size of space of (1) construct feature map F _2.1 The two-dimensional space coordinate matrix of each image block is obtained according to the characteristic diagram F _2.1 The two-dimensional space coordinate matrix is copied and spliced according to the time size to obtain a characteristic graph F _2.1 Spatial position codes with the same size are coded; will feature chart F _2.1 Splicing with the space position code, and obtaining a characteristic diagram through a linear layer

According to the characteristic diagram F _2.1 Randomly generating a spatial adjacency matrix according to the spatial dimension of the feature map

Multiplying the obtained result by a spatial adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer _s (ii) a Will feature map

And F _s Overlapping to obtain a characteristic diagram F _s '；

According to the characteristic diagram F _s ' time dimension of construction of feature map F _s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F _s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F _s ' time position coding with same size; will feature map F _s ' splicing with time position coding, and obtaining a characteristic diagram through a linear layer

According to the characteristic diagram F _s ' the time dimension randomly generates a time adjacency matrix, maps the features

Multiplying the time adjacency matrix by the time adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer _t (ii) a Will feature map

And F _t Overlapping to obtain a characteristic diagram F _t '; characteristic diagram F _t Sequentially carrying out point-by-point convolution Layer, activation function Layer and normalization Layer to obtain a characteristic diagram output by a first Layer module;

step (2.3), average pooling operation:

carrying out spatial dimension average pooling operation on the feature map output by the fourth Layer through an average pooling Layer to obtain a feature map F;

thirdly, extracting auditory features from the Mel cepstrum coefficient spectrogram by using a ResNet34 network to obtain a feature map M;

fourthly, constructing a feature fusion and classification network for fusing the visual features and the auditory features and carrying out emotion classification on each video according to the fused features; the feature fusion and classification network comprises two cross-modal time attention modules, pooling and splicing operation and classification operation;

step (4.1), the first cross-modal time attention module:

inputting the feature map F and the feature map M into a first cross-modal time attention module, and obtaining a feature Q through the feature map F by a linear layer and a normalization layer ₁ (ii) a The characteristic graph M respectively obtains characteristics K through two independent linear layers and two independent normalization layers ₁ And V ₁ ；

According to the characteristic Q ₁ And K ₁ Time dimension size of (c) to generate a learnable intermediate matrix LIM ₁ And for the intermediate matrix LIM ₁ Initializing random parameters; will be characteristic of Q ₁ And the initialized intermediate matrix LIM ₁ And characteristic K ₁ By transposition multiplication and division by the feature K ₁ Squaring the number of channels, and inputting the squared number into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V ₁ After multiplication, the characteristic Q is added ₁ Adding to obtain cross-modal attention feature F based on image sequence _att (ii) a Cross-modal attention feature F based on image sequence _att Obtaining the cross-modal characteristic F based on the image sequence through the point-by-point convolution layer _cm ；

Step (4.2), the second cross-modal time attention module:

inputting the feature map F and the feature map M into a second cross-modal temporal attention module; the characteristic graph M passes through a linear layer and a normalization layer to obtain a characteristic Q ₂ (ii) a The characteristic graph F passes through two independent linear layers and two independent normalization layers to obtain a characteristic K ₂ And V ₂ ；

According to the characteristic Q ₂ And K ₂ Time dimension size of (c) to generate a learnable intermediate matrix LIM ₂ And to the intermediate matrix LIM ₂ Initializing random parameters; will be characteristic of Q ₂ And intermediate matrix LIM ₂ And feature K ₂ By transposition multiplication and division by the feature K ₂ Squaring the number of channels, and inputting the channels into a softmax layer to obtain a normalized weight; the normalized weight is compared with the characteristic V ₂ Multiplied by the characteristic Q ₂ Adding to obtain a cross-modal attention feature M based on a Mel cepstrum coefficient spectrogram _att (ii) a Will be based on mel-frequency cepstrum coefficientsCross-modal attention feature M of spectrogram _att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram through point-by-point convolution layer _cm ；

Step (4.3), pooling and splicing:

cross-modal feature F to be based on image sequence _cm And cross-modal characteristic M based on Mel cepstrum coefficient spectrogram _cm Respectively carrying out average pooling, and then splicing to obtain the characteristic f _FM ；

And (4.4) classifying:

will the characteristic f _FM Inputting the linear layer, and obtaining the prediction probability distribution P { Y) about E emotion classes through the softmax layer ₁ ,Y ₂ ,...,Y _i ,...,Y _q }，Y _i Representing the predicted probability distribution of the ith video for E emotion classes, denoted as Y _i {y _i1 ,…,y _ie ,…,y _iE }，y _ie Representing the prediction probability that the ith video belongs to the e emotion category, and q represents the number of videos;

fifthly, training a ConvMixer network, a ResNet34 network, a feature fusion and classification network combined with the adjacency matrix, and calculating training loss through a focus loss function of fusion dynamic weight; and extracting visual features from the image sequence by using the trained ConvMixer network combined with the adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using the trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using the trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting the emotion classes corresponding to the videos.

Further, the focus loss function of the fusion dynamic weight is:

in the formula (21), α represents a dynamic weight, log represents a logarithmic function with base 2,

representing the real emotion category to which the ith video belongs

A predicted probability distribution of;

the calculation formula of the dynamic weight α is as follows:

in equation (22), C represents the confusion matrix for the last training cycle,

representing the real emotion classification in the last training period

Is predicted as the number of times of the category t,

representing the true emotion classification in the last training period

The number of the videos of (a) is,

representing the true emotion categories

Into categories

The number of times of (c).

Further, the construction process of the confusion matrix C is as follows: before each training period begins, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 0 to E-1; according to the prediction of each sample in training, when the sample of the real emotion class 0 is predicted to be class 2, adding 1 to the element of the 0 th row and the 2 nd column of the matrix, and updating the all-zero matrix to obtain a confusion matrix C by the same way for the other samples; when the training cycle is completed, the confusion matrix of the current cycle is used for calculating the dynamic weight of the loss function in the next training cycle.

Compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:

(1) The ConvMixer (Adjacent Matrix-based ConvMixer, ACAM for short) combined with the adjacency Matrix, wherein the Layer module operation expands the receptive field of the network by means of the spatial adjacency Matrix and the temporal adjacency Matrix, so that the network can simultaneously extract the global and local spatial and temporal characteristics of the image sequence. In order to capture the time correlation of the image sequence and the Mel cepstrum coefficient spectrogram related to emotion change, the Cross-Modal time Attention Module (CMTAM) provided by the invention relates the time characteristics of different time scales of video and audio through a learnable intermediate matrix, and the CMTAM learns the capability of capturing the Cross-Modal time correlation through the intermediate matrix. The Focal local function with the fused dynamic weights, which is provided by the invention, can adjust the Loss value of the training sample through the dynamic weights, so that the model pays more attention to the wrongly classified samples in the optimization process, and the generalization capability of the model is improved.

(2) CN114694076A discloses a multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion, which is used for splicing and fusing extracted single-modal characteristics and using a cross entropy loss function as an optimization target of a model. Compared with CN114694076A, the cross-modal time attention module with the learnable intermediate matrix captures cross-modal time correlation, extracts cross-modal time attention features and improves the identification effect of the model.

(3) CN114582372A discloses a multi-modal emotion feature recognition method for judging emotion of a driver, which inputs a visual feature vector and a voice feature vector into a bimodal emotion feature recognition model to obtain an emotion recognition result of decision-level fusion. Compared with CN114582372A, the invention adopts end-to-end training method, and the audio and video feature extraction network trains simultaneously, thus saving the model training time; in addition, the invention not only extracts the spatial information of the video image sequence, but also extracts the time characteristics, thereby ensuring the applicability of the algorithm in different scenes.

(4) The method adopts the deep learning idea, the traditional detection method only extracts the spatial features of the low level frame by frame for the video image sequence, the time features of the sequence cannot be extracted, the deep learning can extract the semantic features of the high level, and the image can be better expressed.

Drawings

FIG. 1 is a flow chart of the training phase of the present invention;

FIG. 2 is a flow chart of the classification phase of the present invention;

FIG. 3 is a schematic diagram of the block embedding operation in the process of constructing ConvMixer network combined with adjacency matrix according to the present invention;

FIG. 4 is a schematic diagram of Layer module operation and average pooling operation in constructing ConvMixer networks incorporating adjacency matrices according to the present invention;

FIG. 5 is a schematic diagram of the shallow feature extraction operation in the ResNet34 network of the present invention;

FIG. 6 is a schematic diagram of the first through three residual modules, the five through seven residual modules, the nine through thirteen residual modules, and the fifteen through sixteen residual modules in the ResNet34 network;

FIG. 7 is a schematic diagram of a fourth, eighth, and fourteenth residual module in a ResNet34 network;

FIG. 8 is a schematic diagram of a first cross-modality time attention module in the process of building a feature fusion and classification network;

FIG. 9 is a schematic diagram of a second cross-modality time attention module in the process of constructing a feature fusion and classification network;

FIG. 10 is a schematic illustration of pooling and stitching, sorting operations in constructing a feature fusion and sorting network;

FIG. 11 is a schematic diagram of the present invention constructing the Focal local Loss function of the fused dynamic weights.

Detailed Description

The technical solutions of the present invention are described in detail below with reference to the accompanying drawings and the detailed description, but the scope of the present invention is not limited thereto.

The invention is an audio-visual emotion classification method (for short, refer to fig. 1-11) based on improved ConvMixer network and dynamic focus Loss, which extracts visual features from image sequence by using ConvMixer network combined with adjacency matrix, extracts auditory features from audio Mel cepstral coefficient spectrogram by using ResNet34 network, fuses the audio-visual features by cross-modal time attention module, and obtains Focal Loss function of dynamic weight by combining confusion matrix as the optimization objective function of network model; the method comprises the following specific steps:

the method comprises the steps of firstly, collecting a video which expresses emotion and relates to a human face area, extracting an image sequence and an audio signal from the video, and converting the audio signal into a Mel cepstrum coefficient spectrogram;

step (1.1), extracting images from the video, and converting the video into an image sequence;

a group of video sequences consists of a plurality of videos, images are extracted from the videos by using OpenCV software, and the images extracted from each video form an image sequence, so that an image data set is a set of a plurality of image sequences and is marked as T { V } ₁ ,V ₂ ,...,V _i ,...,V _q }，V _i Representing an image sequence corresponding to the ith video, and q represents the number of videos; each image sequence comprising N images, e.g. N images from the ith video, i.e. the ith image sequence is denoted V _i {v _i1 ,v _i2 ,...,v _id ,...,v _iN N is 64,v _id Performing normalization operation on the obtained image sequence, and adjusting the size of each frame of image to be 112 × 112 pixels, so that the size of each image sequence is 64 × 112 × 112;

step (1.2), separating an audio signal from the video, and converting the audio signal into a Mel cepstrum coefficient spectrogram;

separating an audio signal from a video by using Librosa software, and extracting a Mel cepstrum coefficient spectrogram with the frequency domain number of 32; the Mel cepstrum coefficient spectrogram corresponding to the ith video is represented as A _i {a _i1 ,a _i2 ,...,a _id ,...,a _iN In which a is _id Representing the Mel cepstrum coefficient of the d time section of the Mel cepstrum coefficient spectrogram obtained by extracting the ith video, wherein the set of the Mel cepstrum coefficient spectrograms corresponding to the whole data set is M { A } ₁ ,A ₂ ,...,A _i ,...,A _q }；

Secondly, constructing a ConvMixer network combined with an adjacency matrix, wherein the ConvMixer network comprises three parts of operations, namely block embedding operation, layer module operation and average pooling operation; inputting the image sequence into a ConvMixer network combined with an adjacency matrix to extract visual features, and obtaining a feature map F;

and (2.1) performing block embedding operation:

sequentially carrying out block embedding operation on the image sequence with the size of 64 multiplied by 112 and the number of channels of 3 obtained in the step (1.1) through a convolution layer, an activation function layer and a normalization layer to obtain a characteristic diagram F with the size of 16 multiplied by 16 and the number of channels of 512 _2.1 See, fig. 3; the block embedding operation is shown in equation (1):

in the formula (1), F _in Represents the input of the block embedding operation,

denotes the convolution layer with step size and convolution kernel size of 4 × 7 × 7, c _in H is the number of input channels and the number of output channels of the convolution layer respectively, GELU represents an activation function layer, and BN represents a normalization layer;

the characteristic diagram F obtained in the step (2.1) is processed _2.1 Inputting the data into a first Layer module according to a characteristic diagram F _2.1 Size of space of (1) constructing a feature map F _2.1 The two-dimensional space coordinate matrix of each image block (patch) is obtained according to the characteristic diagram F _2.1 Time size ofCopying and splicing the two-dimensional space coordinate matrix to obtain a feature map F _2.1 Spatial position codes with the same size are coded; will feature chart F _2.1 Splicing with the space position code, and obtaining a characteristic diagram through a linear layer

Multiplying with a spatial adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer _s (ii) a Will feature map

And F _s Overlapping to obtain a characteristic diagram F _s '；

According to the characteristic diagram F _s ' time dimension of construction of feature map F _s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F _s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F _s ' time position coding with same size; will feature chart F _s ' splicing with time position code, and obtaining a characteristic diagram through a linear layer

According to the characteristic diagram F _s ' the time dimension of the graph randomly generates a time adjacency matrix, and maps the features

Multiplying the time adjacency matrix by a time adjacency matrix, and obtaining a characteristic diagram F through an activation function layer and a normalization layer _t (ii) a Will feature map

And F _t Overlapping to obtain a characteristic diagram F _t '; feature map F _t ' sequentially rolling layers point by point, activatingObtaining a characteristic diagram F after the function layer and the normalization layer _2.2.1 (ii) a Feature map F _2.2.1 Inputting a second Layer module and outputting a characteristic diagram F _2.2.2 (ii) a Feature map F _2.2.2 Inputting the third Layer module and outputting a characteristic diagram F _2.2.3 (ii) a Feature map F _2.2.3 Inputting the fourth Layer module and outputting a characteristic diagram F _2.2.4 See, fig. 4;

each Layer module is shown in equations (2) - (6):

in the formula (2), F ₀ Represents the input of Layer module, SPC represents spatial position coding, concat represents splicing operation, and Linear represents Linear Layer;

in formula (3), SAM represents a spatial adjacency matrix;

in formula (4), TPC represents time-position coding;

in formula (5), TAM represents a time adjacency matrix;

F _out ＝BN(GELU(Conv _pw (F _t ')))

(6)

in the formula (6), conv _pw A point-by-point convolution layer representing a convolution kernel size of 1 × 1 × 1;

step (2.3), average pooling operation:

the feature map F with the size of 16 multiplied by 16 and the number of channels 512 obtained in the step (2.2) is used _2.2.4 Carrying out space dimension average pooling operation through the average pooling layer to obtain a characteristic diagram F with the size of 16 multiplied by 1 and the number of channels 512;

thereby constructing a ConvMixer network incorporating an adjacency matrix;

thirdly, constructing a ResNet34 network, wherein the ResNet34 network comprises a shallow layer feature extraction layer and sixteen residual modules, and the output of the last residual module is the input of the next residual module; inputting the Mel cepstrum coefficient spectrogram into a ResNet34 network to extract auditory features to obtain a feature map M;

and (3.1) shallow feature extraction operation:

inputting the Mel cepstrum coefficient spectrogram obtained in the step (1.2) into a ResNet34 network, wherein the size of the Mel cepstrum coefficient spectrogram is 32 x 590, and the number of channels is 1; inputting the Mel cepstrum coefficient spectrogram into a shallow layer feature extraction layer composed of convolution layer, normalization layer and activation function layer to obtain feature map M with size of 8 × 148 and channel number of 64 _3.1 See, fig. 5; the shallow feature extraction operation is shown in equation (7):

M _3.1 ＝RELU(BN(Conv _1,64,7,2,3 (M _in )))

(7)

in the formula (7), M _in Representing the input of the ResNet34 network, M _3.1 Output representing a shallow feature extraction layer, conv _3,64,7,2,3 Representing convolution layers with the number of input channels being 1, the number of output channels being 64, the convolution kernel size being 7, the step size being 2 and the edge fill size being 2;

and (3.2) deep feature extraction operation:

the characteristic diagram M obtained in the step (3.1) is processed _3.1 Inputting a first residual error module, wherein the number of output channels of the first to third residual error modules is 64, the number of output channels of the fourth to seventh residual error modules is 128, the number of output channels of the eighth to thirteen residual error modules is 256, and the number of output channels of the fourteenth to sixteenth residual error modules is 512; the sixteenth residual error module outputs a feature map M with the size of 1 multiplied by 19 and the number of channels of 512;

the operations of the first to the third residual error modules, the five to the seven residual error modules, the nine to the thirteen residual error modules and the fifteen to the sixteen residual error modules are shown in formula (8):

M _l ＝BN ₂ (Conv ₂ (RELU(BN ₁ (Conv ₁ (M _l-1 )))))+M _l-1

(8)

in the formula (8), M _l-1 Representing the input of the l residual block, M _l Representing the output of the l-th residual block, conv ₁ And Conv ₂ Representing two independent convolution layers, BN, of convolution kernel size 3, step size 1 and edge fill size 1 ₁ And BN ₂ Representing two independent normalization layers;

the operation of the fourth, eighth, and fourteenth residual error modules is shown in equation (9):

M _l ＝BN ₂ (Conv ₄ (RELU(BN ₁ (Conv ₃ (M _l-1 )))))+BN ₃ (Conv ₅ (M _l-1 ))

(9)

in the formula (9), conv ₃ Represents the convolution layer with convolution kernel size of 3, step size of 2 and edge fill size of 1, conv ₄ Representing the convolution layer with a convolution kernel size of 3, a step size of 1 and an edge fill size of 1, conv ₅ Representing the convolution layer, BN, with a convolution kernel size of 1, a step size of 2 and an edge fill size of 0 ₁ 、BN ₂ And BN ₃ Representing three independent normalization layers;

fourthly, constructing a feature fusion and classification network for fusing the visual features and the auditory features and classifying the emotion of each video according to the fused features; the feature fusion and classification network comprises two cross-modal time attention modules, pooling and splicing operation and classification operation;

step (4.1), the first cross-modal time attention module:

inputting the characteristic diagram F obtained in the step (2.3) and the characteristic diagram M obtained in the step (3.2) into a first trans-modal time attention module, and obtaining a characteristic Q through a linear layer and a normalization layer by the characteristic diagram F ₁ (ii) a The feature map M passes through two independent linear layers and two independent normalizationLayer formation to obtain characteristic K ₁ And V ₁ (ii) a Obtaining a characteristic Q ₁ 、K ₁ And V ₁ Is shown in equations (10), (11) and (12):

Q ₁ ＝BN(Linear(F))

(10)

K ₁ ＝BN(Linear(M))

(11)

V ₁ ＝BN(Linear(M))

(12)

according to the characteristic Q ₁ And K ₁ Time dimension size of (c) to generate a learnable intermediate matrix LIM ₁ And to the intermediate matrix LIM ₁ Initializing random parameters; will be characteristic of Q ₁ And the initialized middle matrix LIM ₁ And characteristic K ₁ By transposition multiplication and division by the feature K ₁ Squaring of channel number

Inputting the softmax layer to obtain a normalized weight; the normalized weight is compared with the characteristic V ₁ After multiplication, the characteristic Q is added ₁ Adding to obtain cross-modal attention feature F based on image sequence _att (ii) a Calculating F _att Is shown in equation (13):

in formula (13), T represents a matrix transpose;

cross-modal attention feature F based on image sequence _att Obtaining the cross-modal characteristic F based on the image sequence through the point-by-point convolution layer _cm See, fig. 8; calculating F _cm Is shown in equation (14):

F _cm ＝Conv _pw (F _att )

(14)

in the formula (14), conv _pw A point-by-point convolution layer representing a convolution kernel size of 1;

step (4.2), the second cross-modal time attention module:

inputting the feature map F obtained in the step (2.3) and the feature map M obtained in the step (3.2) into a second cross-modal time attention module; the characteristic graph M passes through a linear layer and a normalization layer to obtain a characteristic Q ₂ (ii) a The characteristic graph F passes through two independent linear layers and two independent normalization layers to obtain a characteristic K ₂ And V ₂ (ii) a Calculating Q ₂ 、K ₂ And V ₂ Is shown in equations (15), (16) and (17):

Q ₂ ＝BN(Linear(M))

(15)

K ₂ ＝BN(Linear(F))

(16)

V ₂ ＝BN(Linear(F))

(17)

according to the characteristic Q ₂ And K ₂ Size of time dimension to generate a learnable intermediate matrix LIM ₂ And to the intermediate matrix LIM ₂ Initializing random parameters; will be characteristic of Q ₂ And intermediate matrix LIM ₂ And characteristic K ₂ Multiplication by the transpose of (c), and division by the feature K ₂ Squaring of channel number

Inputting the softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V ₂ Multiplied by the characteristic Q ₂ Adding to obtain a cross-modal attention feature M based on a Mel cepstrum coefficient spectrogram _att (ii) a Calculating M _att Is shown in equation (18):

cross-modal attention feature M based on Mel cepstrum coefficient spectrogram _att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram by point convolution layer _cm See, fig. 9; calculating M _cm Is shown in equation (19):

M _cm ＝Conv _pw (M _att )

(19)

step (4.3), pooling and splicing:

the cross-modal characteristics F based on the image sequence obtained in the step (4.1) _cm And (4.2) obtaining cross-modal characteristic M based on Mel cepstral coefficient spectrogram _cm Respectively carrying out average pooling, and then splicing to obtain the features f with the size of 1 multiplied by 1 and the channel number of 1024 _FM (ii) a The pooling and splicing operation is shown in equation (20):

f _FM ＝Concat(AvgPool(F _cm ),AvgPool(M _cm ))

(20)

in equation (20), avgPool represents the average pooling operation;

and (4.4) classifying:

the feature f obtained in the step (4.3) _FM Inputting the linear layer, and obtaining the prediction probability distribution P { Y) about E emotion classes through the softmax layer ₁ ,Y ₂ ,...,Y _i ,...,Y _q }，Y _i Representing the predicted probability distribution of the ith video for E emotion classes, denoted as Y _i {y _i1 ,…,y _ie ,…,y _iE In which y is _ie Representing the prediction probability that the ith video belongs to the e emotion category;

fifthly, constructing a Focal local Loss function fusing the dynamic weight, training a ConvMixer network, a ResNet34 network and a feature fusion and classification network combining the adjacency matrix, and calculating the training Loss through the Focal local Loss function fusing the dynamic weight; extracting visual features from an image sequence by using a trained ConvMixer network combined with an adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using a trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using a trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting emotion classes corresponding to videos;

calculating the loss between the prediction probability distribution output in the step (4.4) and the real emotion category according to the formula (21);

in the formula (21), α represents a dynamic weight, log represents a logarithmic function with a base 2,

representing the real emotion category to which the ith video belongs

A predicted probability distribution of (a);

the calculation formula of the dynamic weight α is as follows:

representing the real emotion classification in the last training period

Is predicted as the number of times of the category t,

representing the true emotion classification in the last training period

The number of videos of the video recording medium,

representing the true emotion categories

Is a category of video prediction

The number of times of (c);

construction of confusion matrix C in equation (22): before each training period starts, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 1 to E; updating the all-zero matrix according to the prediction of each sample in training to obtain a confusion matrix C; when the sample of the real emotion category 1 is predicted to be category 2, adding 1 to the element at the 1 st row and the 2 nd column of the matrix; when the training cycle is completed, the confusion matrix C of the current cycle is used for calculating the dynamic weight α of the loss function in the next training cycle.

In the fifth step, the size of the batch size is 8, the iteration number is set to be 120, an Adam optimizer is adopted, the initial learning rate is 0.0001, the momentum factor is 0.9, and the learning rate is reduced by 90% after every 30 iterations.

Nothing in this specification is said to apply to the prior art.

Claims

1. An audio-visual emotion classification method based on an improved ConvMixer network and dynamic focus loss is characterized by comprising the following steps of:

step (2.1), block embedding operation:

will feature map F _2.1 Inputting the data into a first Layer module according to a characteristic diagram F _2.1 Is a space size ofSign diagram F _2.1 The two-dimensional space coordinate matrix of each image block is obtained according to the characteristic diagram F _2.1 The two-dimensional space coordinate matrix is copied and spliced according to the time size to obtain a characteristic graph F _2.1 Spatial position codes with the same size are coded; will feature chart F _2.1 Splicing with the spatial position code, and obtaining a characteristic diagram through a linear layer

And F _s Overlapping to obtain a characteristic diagram F _s '；

According to the characteristic diagram F _s ' time dimension of construction of feature map F _s ' in which a one-dimensional time coordinate matrix of each image block is obtained from the feature map F _s The space size of the' is used for copying and splicing the one-dimensional time coordinate matrix to obtain a characteristic graph F _s ' time position coding with same size; will feature map F _s ' splicing with time position code, and obtaining a characteristic diagram through a linear layer

And F _t Overlapping to obtain a characteristic diagram F _t '; feature map F _t Obtaining a characteristic diagram output by a first Layer module after sequentially carrying out point-by-point convolution Layer, activation function Layer and normalization Layer;

step (2.3), average pooling operation:

thirdly, extracting auditory characteristics from the Mel cepstrum coefficient spectrogram by using a ResNet34 network to obtain a characteristic diagram M;

step (4.1), the first cross-modal time attention module:

According to the characteristic Q ₁ And K ₁ Size of time dimension to generate a learnable intermediate matrix LIM ₁ And to the intermediate matrix LIM ₁ Initializing random parameters; will be characteristic of Q ₁ And the initialized middle matrix LIM ₁ And feature K ₁ By transposition multiplication and division by the feature K ₁ Squaring the number of channels, and inputting the squared number into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V ₁ After multiplication, the characteristic Q is added ₁ Adding to obtain cross-modal attention feature F based on image sequence _att (ii) a Cross-modal attention feature F based on image sequence _att Obtaining the cross-modal characteristics F based on the image sequence through the point-by-point convolution layer _cm ；

Step (4.2), the second cross-modal time attention module:

According to the characteristic Q ₂ And K ₂ Size of time dimension to generate a learnable intermediate matrix LIM ₂ And for the intermediate matrix LIM ₂ Initializing random parameters; will be characteristic of Q ₂ And intermediate matrix LIM ₂ And feature K ₂ Multiplication by the transpose of (c), and division by the feature K ₂ Squaring the number of channels, and inputting the channels into a softmax layer to obtain a normalized weight; the normalized weight is compared with the feature V ₂ Multiplied by the characteristic Q ₂ Adding to obtain cross-modal attention feature M based on Mel cepstrum coefficient spectrogram _att (ii) a Cross-modal attention feature M based on Mel cepstrum coefficient spectrogram _att Obtaining a cross-modal characteristic M based on a Mel cepstrum coefficient spectrogram by point convolution layer _cm ；

Step (4.3), pooling and splicing:

cross-modal feature F to be based on image sequence _cm And cross-modal feature M based on Mel cepstrum coefficient spectrogram _cm Respectively carrying out average pooling, and then splicing to obtain the characteristic f _FM ；

And (4.4) classifying:

will the characteristic f _FM Inputting the linear layer, and obtaining the predicted probability distribution P { Y } related to E emotion classes through the softmax layer ₁ ,Y ₂ ,...,Y _i ,...,Y _q }，Y _i Represents the predictive probability distribution of the ith video for the E emotion classes, denoted as Y _i {y _i1 ,…,y _ie ,…,y _iE }，y _ie Representing the prediction probability that the ith video belongs to the e-th emotion category, and q represents the number of videos;

fifthly, training a ConvMixer network, a ResNet34 network and a feature fusion and classification network which are combined with the adjacency matrix, and calculating training loss through a focus loss function which fuses dynamic weights; and extracting visual features from the image sequence by using the trained ConvMixer network combined with the adjacency matrix, extracting auditory features from a Mel cepstrum coefficient spectrogram by using the trained ResNet34 network, performing feature fusion on the visual features and the auditory features by using the trained feature fusion and classification network, performing emotion classification according to the fused features, and predicting the emotion classes corresponding to the videos.

2. The method of claim 1, wherein the focus loss function of the fused dynamic weights is:

representing the real emotion category to which the ith video belongs

A predicted probability distribution of;

the calculation formula of the dynamic weight α is as follows:

in equation (22), C represents the confusion matrix for the last training period,

representing the true emotion classification in the last training cycle

Is predicted as the number of times of the category t,

representing the true emotion classification in the last training period

The number of the videos of (a) is,

representing the true emotion categories

Into categories

The number of times of (c).

3. The method for classifying audiovisual emotions based on the ConvMixer network and dynamic focus loss as claimed in claim 2, wherein the confusion matrix C is constructed by the following steps: before each training period starts, generating an all-zero matrix with the size of E multiplied by E, wherein the number of rows and columns is from 1 to E; according to the prediction of each sample in training, when the sample of the real emotion type 1 is predicted to be type 2, adding 1 to the element of the 1 st row and the 2 nd column of the matrix, and updating the all-zero matrix to obtain a confusion matrix C by the same way for the other samples; when the training cycle is completed, the confusion matrix of the current cycle is used for calculating the dynamic weight of the loss function in the next training cycle.