CN111402928B

CN111402928B - Attention-based speech emotion state evaluation method, device, medium and equipment

Info

Publication number: CN111402928B
Application number: CN202010143924.2A
Authority: CN
Inventors: 李淑贞; 邢晓芬; 徐向民; 郭锴凌
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2022-06-14
Anticipated expiration: 2040-03-04
Also published as: CN111402928A

Abstract

The invention provides a voice emotion state assessment method, device, medium and equipment based on attention. The method comprises the following steps: s1, building a speech emotion state evaluation model: building a basic framework by adopting four layers of coiling layers; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer; s2, inputting a speech emotion database to train and test the speech emotion state evaluation model; and S3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into the speech emotion state evaluation model to evaluate the emotion state. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.

Description

Attention-based speech emotion state evaluation method, device, medium and equipment

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice emotion state assessment method, device, medium and equipment based on attention.

Background

With the development of society and the advancement of science and technology, human-computer interaction technology has quietly entered into various aspects of our lives, such as smart homes, mobile phones, vehicles, smart wearing, robots and the like. In recent years, man-machine interaction technology has revolutionized, and people have not met the GUI (user interface) era and have moved more towards the natural human conversation experience. As a new interaction technology, VUI (voice user interface) is a man-machine interaction mode with human internal intention as a center, and intelligent man-machine interaction experience with natural conversation as a core. Voice interaction is more efficient and more natural-expressing than interface interaction input, such as voice assistants of Siri, Alexa, Cortana, VR virtual chat rooms, VR interrogation systems, and the like. These tools do not analyze the emotional state of the interlocutor when performing human-computer interaction. Because the same sentence substantially expresses different meanings under different emotional states of the interlocutor, the acquisition of the emotional state of the interlocutor has very important significance for the machine to accurately understand the semantics.

The traditional speech emotion recognition method is based on acoustic statistical features and machine learning models. Acoustic statistical features commonly used for emotion recognition include mel-frequency cepstral coefficients (MFCCs), GeMAPS feature sets, vocal prosodic features, BoAW feature sets, and the like. And machine learning models applied to the acoustic statistical features include hidden markov models, gaussian mixture models, decision trees, and the like. However, the emotion is knowledge in a high semantic category, and conventional acoustic statistical features are not strong in emotion representation capability and even limit the performance of the model to a certain extent.

In recent years, by means of strong nonlinear characterization capability of a deep network, a deep learning method is gradually introduced into the field of speech emotion recognition. The acoustic statistical characteristics can be subjected to nonlinear emotion depth characteristics extraction through CNN, DNN, DBN and LSTM, emotion characterization capability is improved, and the emotion depth characteristics are fed to machine learning models such as ELM (model extreme learning machine) and SVM (support vector machine) for judgment.

While the conventional acoustic statistical features have limited characterization capability and global statistical features easily lose local information, researchers focus on the research attention on the spectrogram. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Therefore, the speech emotion recognition method based on the spectrogram and the convolutional neural network becomes a recent hot technology. How to quickly and effectively organize and extract emotional features from a lengthy spectrogram becomes a key technical problem in the speech emotion recognition field nowadays.

Since emotion is accompanied in the speaking content of the speaker, emotion is hidden only in the speech information rich frames, not the silent frames, in a piece of audio. To solve this problem, attention can be paid if blind searching for mood-related regions and features in a lengthy spectrogram without indication is difficult and time consuming. The attention mechanism is a weighting mechanism that can highlight important information and suppress irrelevant information without cutting the audio.

The attention mechanism commonly used for speech emotion recognition is global soft attention. The weight of the attention mechanism is obtained by performing linear and nonlinear transformation on the original sequence and then performing normalization processing, and the final result is that the original sequence and the corresponding weight are multiplied point by point and added. Because the original sequence feature vector is very large in the speech emotion recognition convolutional network, and the corresponding learning parameter is also very large, the attention mechanism needs huge calculation amount, and cannot be well transplanted to a mobile terminal.

In addition to this, there are some unconventional attention mechanisms, such as those based on maximum pooling and special convolution kernel sizes, which require a large amount of computation and are prone to affecting discrimination performance due to noise introduced by the maximum pooling operation.

How to design a lightweight attention mechanism which effectively emphasizes emotional features is also a new key technical hotspot in the field of speech emotion recognition.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a method, a device, a medium and equipment for evaluating a speech emotion state based on attention. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.

In order to achieve the purpose, the invention is realized by the following technical scheme: a speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:

s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;

s2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;

and step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state.

Preferably, in step S1, the spectrogram obtains a feature map after being processed by each convolution layer; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;

in the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptor

Channel descriptor by two full connectivity layers and Sigmoid activation function

Mapping to channel attention weights

Weighting channel attention

Multiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagram

The process formula is as follows:

wherein, W₁And B₁Weight coefficient and offset value, W, of the first fully-connected layer, respectively₂And B₂Respectively, the weight coefficient and the offset value of the second fully-connected layer, wherein σ s is a Sigmoid activation function, and Avg is a global average pooling function.

In the space attention module unit, a new feature map is put

Performing global average pooling compression along the C-axis to obtain spatial descriptors

Generating spatial attention weights by a convolutional layer and a ReLU activation function

Weighting spatial attention

And characteristic diagram

Multiplying corresponding points to obtain a brand new characteristic diagram

The process formula is as follows:

wherein, W_7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B₃For the bias values of this convolution layer, σ r is the RELU activation function and Avg is the global average pooling function.

In the frequency attention module, the output characteristic diagram of the fourth convolution layer

Performing a depth-column convolution process with a depth-column convolution kernel of

Obtain different frequency mode results of different channels

The formula of the process is as follows:

wherein, B₄Is the offset value of the depth column convolution.

Feature map along time axis W axis

Global average pooling compression is carried out to obtain channel descriptors

Describing channels

Input to a fully connected layer having C neurons; calculating the result of frequency attention F_FQ(ii) a The formula of the process is as follows:

wherein the content of the first and second substances,

the weight coefficient of the full connection layer is also the channel weight learned by the network;

and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.

Preferably, in steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data. The spectrogram shows how the voice energy changes along with time and frequency in a time-frequency graph form, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.

An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:

the voice emotion state evaluation model building module is used for setting and inputting a voice spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;

the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;

and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.

A storage medium, wherein the storage medium stores a computer program which, when executed by a processor, causes the processor to execute the above-described attention-based speech emotional state assessment method.

A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the above-described attention-based speech emotional state assessment method when executing the program stored in the memory.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. in the invention, the space-time attention can highlight the emotion related region (space-time region) in a lengthy spectrogram; frequency attention captures emotional frequency features according to a frequency distribution in the upper candidate region; the invention adopts a light-weight novel attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved;

2. the intelligent question-answering system can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and then feeds the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand semantics, correct the text and voice output of the system, and enable the answer of the voice interaction system to be more suitable for the use requirements of the interlocutor.

Drawings

FIG. 1 is a schematic diagram of a speech emotional state assessment model of the present invention;

FIG. 2 is a schematic diagram of a feature map obtained after processing each convolution layer of a spectrogram in the present invention;

FIG. 3 is a schematic diagram of the spatiotemporal attention module of the present invention;

FIG. 4 is a schematic diagram of a frequency attention module of the present invention;

FIG. 5 is a flow chart of the present invention for training and testing a speech emotional state assessment model.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example one

In the attention-based speech emotion state assessment method, the input is audio data. And extracting a spectrogram from the audio data through short-time Fourier transform, and feeding the spectrogram into a speech emotion state evaluation model for training. The spectrogram shows how the voice energy changes along with time and frequency in the form of a time-frequency graph, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Because the convolutional neural network has strong graphic representation capability, the main network structure of the speech emotion state evaluation model adopts four convolutional layers. In order to extract features related to emotion from a spectrogram and suppress irrelevant information, the invention designs a lightweight novel attention mechanism: spatio-temporal-frequency attention. Unlike previous attention mechanisms, spatio-temporal-frequency attention is a cascade of attention mechanisms consisting of spatio-temporal attention and frequency attention. Since the emotion is hidden in the spoken segment of audio, the spatio-temporal attention highlights these speech information regions (speech information spatio-temporal regions) by the channel attention and the spatial attention, suppressing the unvoiced regions and the noise regions. Because research shows that emotion has a great relationship with voice frequency, frequency attention acquires emotion-related frequency combination characteristics through frequency-channel attention in a voice information candidate area. The space-time attention and the frequency attention are mutually matched, the auxiliary neural network extracts emotion characteristics from a long audio frequency quickly and accurately, and the effect and the performance of the emotion state assessment model are effectively improved.

The method comprises the following steps:

s1, building a speech emotion state evaluation model: as shown in fig. 1, a basic framework is built by using four convolutional layers, and the size of the convolution kernel of each convolutional layer is set, for example, the size of the convolution kernel of each convolutional layer is: 16 × 16 × 12,24 × 12 × 8,32 × 7 × 15, 64 × 5 × 3; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;

because the spectrogram in the form of an RGB image is input, the method mainly adopts a convolutional neural network. In order to be transplanted to the client, the method adopts a small network, namely only four convolutional layers. In order to rapidly extract emotional features from a lengthy spectrogram, the invention provides space-time-frequency cascade attention. Spatiotemporal attention can focus on a speech information region (speech information spatiotemporal region) from a lengthy spectrogram, and frequency attention can extract emotional frequency features from the speech information candidate region. The two are mutually matched, the auxiliary model extracts emotional characteristics quickly and accurately, and the accuracy of the model is improved.

In step S1, each pixel of the spectrogram represents 10Hz and 10ms information, and in order to capture sufficient information from the spectrogram, the size of the convolution kernel must be designed according to the image resolution of the spectrogram; thus, the underlying skeletal network is as follows:

in a piece of audio, emotion is hidden in a place where the amount of speech information is abundant, and the emotion is related to a specific speech frequency, so that each pixel point of a spectrogram contributes differently to the emotion. How to effectively highlight the emotion related area and extract effective emotion frequency patterns becomes the key of speech emotion recognition. Aiming at the key problem, the invention provides a space-time-frequency cascade attention mechanism for effectively extracting emotional characteristics: 1) spatiotemporal attention can highlight emotion related regions (spatiotemporal regions) in a lengthy spectrogram; 2) frequency attention captures emotional frequency features from the frequency distribution in the upper candidate region. The space-time attention and the frequency attention are mutually matched, and the emotional characteristics are gradually captured in the spectrogram.

Since emotion occurs only at the moment of speech occurrence and not at the moment of silence, these areas of speech information are first addressed by spatio-temporal attention.

Obtaining a feature map after each convolution layer of the spectrogram; as shown in fig. 2, the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; can be viewed as being three-dimensional in time-space. The spatiotemporal attention module is composed of a channel attention module unit and a spatial attention module unit.

The channel attention module unit mainly highlights channels highly related to emotion (in a convolutional neural network, each channel actually represents a characteristic type), and the spatial attention module unit strengthens an emotion space region on an F-T space surface; as shown in fig. 3.

Channel descriptor D by Global average pooling operation_cThe spatial global information of each channel is possessed, so that important channels can be highlighted, and unimportant channels can be suppressed. Channel descriptor by two full connectivity layers and Sigmoid activation function

Mapping to channel attention weights

This weight is

Important channels can be given a high weighting value, highlighting which channels are highly mood-related, i.e. which type of feature is highly mood-related. Weighting channel attention

The process formula is as follows:

In the space attention module unit, a new feature map is put

Apparently by performing a global average pooling operation on the channels, the spatial descriptor D_sThe spatially important information areas can be highlighted. Spatial attention weight generation via a convolutional layer and ReLU activation function

The spatial attention weight

The voice information rich region on the H x W space plane can be emphasized. Weighting spatial attention

And characteristic diagram

Multiplication of corresponding points to obtainNovel characteristic diagram

The process formula is as follows:

The method is a comprehensive result of the channel attention and the space attention of the original feature map F, namely, each pixel point of the original feature map can obtain corresponding weight. It is noted that the descriptors of both sub-attention modules are obtained by an average pooling operation, rather than a maximum pooling operation, which can suppress strong noise to some extent. Through space-time attention, the network can quickly find out which channels and which spatial regions are regions rich in voice information, which are also emotion hidden regions.

Unlike the spatiotemporal attention module, the frequency attention module aims to learn specific emotional frequency patterns through frequency distributions in the speech information region. As a lightweight module, the frequency attention module can replace a conventional full link layer, avoiding overfitting to some extent. In frequency attention, the invention mainly utilizes a depth column convolution and a weighted grouping full-connection layer to respectively extract frequency emotion characteristics and channel emotion characteristics, as shown in figure 4.

Frequency attention is applied to the frequency axis to extract emotional frequency patterns. In a convolutional neural network, each channel represents one feature type extracted. For theOutput characteristic diagram of the fourth layer convolution layer

With C-type features, if conventional convolution operations are used, the C-type features must be summed up, and the individuality of the different types of features is lost. The present invention uses a depth-column convolution to extract different frequency patterns for different channels. The deep convolution is a spatial convolution performed independently on each input channel, and the convolutions of the channels do not affect each other. In order to extract emotional frequency pattern on frequency axis, the output characteristic diagram of the fourth convolution layer is set in the frequency attention module

Performing deep-row convolution with a convolution kernel of

This is also the frequency weight learned through the network; different frequency mode results of different channels are obtained

The result still retains timing information; the formula of the process is as follows:

wherein, B₄Is the offset value of the depth column convolution.

Channel weights are mainly highlighted

Of the type of mood-related features (channel). The input characteristic diagram is

In order to obtain a channel descriptor

Feature map along time axis W axis

Global average pooling compression is carried out to obtain channel descriptors

To better highlight important channels, channel descriptors are provided

wherein the content of the first and second substances,

In summary, the attention weight of the spatiotemporal attention comes from the descriptor of the feature map itself, so that the feature map has strong adaptivity, and the attention weight of the frequency attention comes from the network learning parameter, so that the emotional frequency pattern can be well learned. The space-time attention provides a candidate voice information area for the frequency attention, so that the frequency attention can quickly and accurately extract frequency emotion characteristics, the frequency emotion characteristics and the frequency emotion characteristics are matched with each other, a network is guided to search the emotion characteristics, and the performance of a discrimination model is effectively improved.

S2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; the existing Voice emotion databases are combined into a large database according to seven basic emotions (happy, surprised, angry, aversion, fear, sadness and slight bamboo), and the large database comprises Emotional Voices, Emotional Voice, MELD, VoxColeb, GEMEP, RML, ENTERFACE and IEMOCAP databases. Dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; the spectrogram is input to a speech emotional state assessment model for training and testing, as shown in fig. 5.

And step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state. The method can be transplanted to a client for forward reasoning and feedback to the intelligent question-answering system, the auxiliary system can better understand the semantics, and the text and voice output of the system is corrected, so that the answer of the voice interaction system is more suitable for the use requirements of interlocutors.

In the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.

A specific example is: the method comprises the following steps:

A. and (3) dividing: segmenting the audio data to enable each section of sub audio to be less than or equal to 3s, wherein the label of each section of sub audio is the label of the original long audio, and the label of the original long audio is the average of the prediction result of each section of sub audio in prediction;

B. framing: performing framing processing on each section of sub audio, wherein windows adopted by framing are Hamming windows with the window length of 40ms and the time shift of 10ms, and for data enhancement, the Hamming windows with the window length of 20ms and the time shift of 10ms are also adopted, so that data can be doubled;

C. short-time Fourier transform: carrying out short-time Fourier transform on the audio after framing to obtain a spectrogram;

D. normalization: carrying out logarithm, mean value reduction and variance removal operation on the spectrogram;

E. fixed length: since the network input must be of fixed size, the frequency axis of each spectrogram takes 400 points (representing within 4KHz, which is the frequency range of human speech) and the time axis takes 300 points (representing 3s, less than zero padding).

On one hand, because the traditional acoustic statistical features are global features, important information in time and local is easy to smooth; on the other hand because it is difficult to combine with convolutional neural networks with powerful characterization capabilities. Therefore, the method adopts the spectrogram as the input of the speech emotion state evaluation model. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.

The method can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and further feed the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand the semantics, and correct the text and voice output of the system, and the answer of the voice interaction system is more suitable for the use requirements of the interlocutor. In addition, the method can be applied to a VR virtual chat room, and helps the virtual character projected by the interlocutor to have rich expression by acquiring the emotional state of the speaker, so that the VR virtual chat meets the requirement of virtual reality. The method can also be applied to a VR inquiry system to assist doctors in obtaining the emotional state information of patients, so that the time of the doctors is saved, and the medical resources are saved.

Example two

In order to implement the method for evaluating an attention-based speech emotion state according to the first embodiment, the present embodiment provides an attention-based speech emotion state evaluation device, which is characterized in that: the method comprises the following steps:

the voice emotion state evaluation model building module is used for setting input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;

EXAMPLE III

The present embodiment is a storage medium storing a computer program, which when executed by a processor causes the processor to execute the attention-based speech emotional state assessment method according to the first embodiment.

Example four

The embodiment of the invention relates to a computing device, which comprises a processor and a memory for storing a program executable by the processor, and is characterized in that when the processor executes the program stored in the memory, the attention-based speech emotion state assessment method described in the first embodiment is implemented.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. A speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:

s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;

2. The attention-based speech emotional state assessment method according to claim 1, wherein: in the step S1, the spectrogram obtains a feature map after each layer of convolution layer processing; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;

Mapping to channel attention weights

Weighting the channel attention

The process formula is as follows:

wherein, W₁And B₁Weight coefficient and offset value, W, of the first fully-connected layer, respectively₂And B₂Respectively the weight coefficient and the offset value of the second full connection layer, and σ s is Sigmoid activation function and Avg^spatialRefers to the global average pooling function along the H × W spatial plane;

in the space attention module unit, a new feature map is put

Global average pooling compression along the C-axis to obtain spatial descriptors

Weighting spatial attention

And characteristic diagram

Multiplying corresponding points to obtain a brand new characteristic diagram

The process formula is as follows:

wherein, W_7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B₃Bias values for this convolution layer are the convolution operation symbols, and σ r is the RELU activation function, Avg^channelIs a global average pooling function along the C-axis;

Performing deep-column convolution processing, deep-column convolution kernel

Different frequency mode results of different channels are obtained

The formula of the process is as follows:

wherein, B₄For the offset value of the depth column convolution,

the operation sign is a depth column convolution operation sign;

along the time axis WFor characteristic diagram

Global average pooling compression is carried out to obtain channel descriptors

Describing channels

wherein the content of the first and second substances,

weights for the full connection layer, also learned by the network, B₅Bias value of fully connected layer, Avg^timeRefers to the global average pooling function along the W axis;

3. The attention-based speech emotional state assessment method according to claim 1, wherein: in the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.

4. An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:

5. A storage medium storing a computer program which, when executed by a processor, causes the processor to execute the attention-based speech emotional state assessment method according to any one of claims 1 to 3.

6. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the attention-based speech emotional state assessment method of any of claims 1-3.