CN111402928B - Attention-based speech emotion state evaluation method, device, medium and equipment - Google Patents

Attention-based speech emotion state evaluation method, device, medium and equipment Download PDF

Info

Publication number
CN111402928B
CN111402928B CN202010143924.2A CN202010143924A CN111402928B CN 111402928 B CN111402928 B CN 111402928B CN 202010143924 A CN202010143924 A CN 202010143924A CN 111402928 B CN111402928 B CN 111402928B
Authority
CN
China
Prior art keywords
layer
attention
spectrogram
convolution
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010143924.2A
Other languages
Chinese (zh)
Other versions
CN111402928A (en
Inventor
李淑贞
邢晓芬
徐向民
郭锴凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010143924.2A priority Critical patent/CN111402928B/en
Publication of CN111402928A publication Critical patent/CN111402928A/en
Application granted granted Critical
Publication of CN111402928B publication Critical patent/CN111402928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention provides a voice emotion state assessment method, device, medium and equipment based on attention. The method comprises the following steps: s1, building a speech emotion state evaluation model: building a basic framework by adopting four layers of coiling layers; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer; s2, inputting a speech emotion database to train and test the speech emotion state evaluation model; and S3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into the speech emotion state evaluation model to evaluate the emotion state. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.

Description

Attention-based speech emotion state evaluation method, device, medium and equipment
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice emotion state assessment method, device, medium and equipment based on attention.
Background
With the development of society and the advancement of science and technology, human-computer interaction technology has quietly entered into various aspects of our lives, such as smart homes, mobile phones, vehicles, smart wearing, robots and the like. In recent years, man-machine interaction technology has revolutionized, and people have not met the GUI (user interface) era and have moved more towards the natural human conversation experience. As a new interaction technology, VUI (voice user interface) is a man-machine interaction mode with human internal intention as a center, and intelligent man-machine interaction experience with natural conversation as a core. Voice interaction is more efficient and more natural-expressing than interface interaction input, such as voice assistants of Siri, Alexa, Cortana, VR virtual chat rooms, VR interrogation systems, and the like. These tools do not analyze the emotional state of the interlocutor when performing human-computer interaction. Because the same sentence substantially expresses different meanings under different emotional states of the interlocutor, the acquisition of the emotional state of the interlocutor has very important significance for the machine to accurately understand the semantics.
The traditional speech emotion recognition method is based on acoustic statistical features and machine learning models. Acoustic statistical features commonly used for emotion recognition include mel-frequency cepstral coefficients (MFCCs), GeMAPS feature sets, vocal prosodic features, BoAW feature sets, and the like. And machine learning models applied to the acoustic statistical features include hidden markov models, gaussian mixture models, decision trees, and the like. However, the emotion is knowledge in a high semantic category, and conventional acoustic statistical features are not strong in emotion representation capability and even limit the performance of the model to a certain extent.
In recent years, by means of strong nonlinear characterization capability of a deep network, a deep learning method is gradually introduced into the field of speech emotion recognition. The acoustic statistical characteristics can be subjected to nonlinear emotion depth characteristics extraction through CNN, DNN, DBN and LSTM, emotion characterization capability is improved, and the emotion depth characteristics are fed to machine learning models such as ELM (model extreme learning machine) and SVM (support vector machine) for judgment.
While the conventional acoustic statistical features have limited characterization capability and global statistical features easily lose local information, researchers focus on the research attention on the spectrogram. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Therefore, the speech emotion recognition method based on the spectrogram and the convolutional neural network becomes a recent hot technology. How to quickly and effectively organize and extract emotional features from a lengthy spectrogram becomes a key technical problem in the speech emotion recognition field nowadays.
Since emotion is accompanied in the speaking content of the speaker, emotion is hidden only in the speech information rich frames, not the silent frames, in a piece of audio. To solve this problem, attention can be paid if blind searching for mood-related regions and features in a lengthy spectrogram without indication is difficult and time consuming. The attention mechanism is a weighting mechanism that can highlight important information and suppress irrelevant information without cutting the audio.
The attention mechanism commonly used for speech emotion recognition is global soft attention. The weight of the attention mechanism is obtained by performing linear and nonlinear transformation on the original sequence and then performing normalization processing, and the final result is that the original sequence and the corresponding weight are multiplied point by point and added. Because the original sequence feature vector is very large in the speech emotion recognition convolutional network, and the corresponding learning parameter is also very large, the attention mechanism needs huge calculation amount, and cannot be well transplanted to a mobile terminal.
In addition to this, there are some unconventional attention mechanisms, such as those based on maximum pooling and special convolution kernel sizes, which require a large amount of computation and are prone to affecting discrimination performance due to noise introduced by the maximum pooling operation.
How to design a lightweight attention mechanism which effectively emphasizes emotional features is also a new key technical hotspot in the field of speech emotion recognition.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a method, a device, a medium and equipment for evaluating a speech emotion state based on attention. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.
In order to achieve the purpose, the invention is realized by the following technical scheme: a speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:
s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
s2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state.
Preferably, in step S1, the spectrogram obtains a feature map after being processed by each convolution layer; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;
in the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptor
Figure BDA0002400056370000031
Channel descriptor by two full connectivity layers and Sigmoid activation function
Figure BDA0002400056370000032
Mapping to channel attention weights
Figure BDA0002400056370000033
Weighting channel attention
Figure BDA0002400056370000034
Multiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagram
Figure BDA0002400056370000035
The process formula is as follows:
Figure BDA0002400056370000036
Figure BDA0002400056370000041
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively, the weight coefficient and the offset value of the second fully-connected layer, wherein σ s is a Sigmoid activation function, and Avg is a global average pooling function.
In the space attention module unit, a new feature map is put
Figure BDA0002400056370000042
Performing global average pooling compression along the C-axis to obtain spatial descriptors
Figure BDA0002400056370000043
Generating spatial attention weights by a convolutional layer and a ReLU activation function
Figure BDA0002400056370000044
Weighting spatial attention
Figure BDA0002400056370000045
And characteristic diagram
Figure BDA0002400056370000046
Multiplying corresponding points to obtain a brand new characteristic diagram
Figure BDA0002400056370000047
The process formula is as follows:
Figure BDA0002400056370000048
Figure BDA0002400056370000049
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3For the bias values of this convolution layer, σ r is the RELU activation function and Avg is the global average pooling function.
In the frequency attention module, the output characteristic diagram of the fourth convolution layer
Figure BDA00024000563700000410
Performing a depth-column convolution process with a depth-column convolution kernel of
Figure BDA00024000563700000411
Obtain different frequency mode results of different channels
Figure BDA00024000563700000412
The formula of the process is as follows:
Figure BDA00024000563700000413
wherein, B4Is the offset value of the depth column convolution.
Feature map along time axis W axis
Figure BDA00024000563700000414
Global average pooling compression is carried out to obtain channel descriptors
Figure BDA0002400056370000051
Describing channels
Figure BDA0002400056370000052
Input to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
Figure BDA0002400056370000053
wherein the content of the first and second substances,
Figure BDA0002400056370000054
the weight coefficient of the full connection layer is also the channel weight learned by the network;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
Preferably, in steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data. The spectrogram shows how the voice energy changes along with time and frequency in a time-frequency graph form, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.
An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting and inputting a voice spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
A storage medium, wherein the storage medium stores a computer program which, when executed by a processor, causes the processor to execute the above-described attention-based speech emotional state assessment method.
A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the above-described attention-based speech emotional state assessment method when executing the program stored in the memory.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. in the invention, the space-time attention can highlight the emotion related region (space-time region) in a lengthy spectrogram; frequency attention captures emotional frequency features according to a frequency distribution in the upper candidate region; the invention adopts a light-weight novel attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved;
2. the intelligent question-answering system can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and then feeds the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand semantics, correct the text and voice output of the system, and enable the answer of the voice interaction system to be more suitable for the use requirements of the interlocutor.
Drawings
FIG. 1 is a schematic diagram of a speech emotional state assessment model of the present invention;
FIG. 2 is a schematic diagram of a feature map obtained after processing each convolution layer of a spectrogram in the present invention;
FIG. 3 is a schematic diagram of the spatiotemporal attention module of the present invention;
FIG. 4 is a schematic diagram of a frequency attention module of the present invention;
FIG. 5 is a flow chart of the present invention for training and testing a speech emotional state assessment model.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
In the attention-based speech emotion state assessment method, the input is audio data. And extracting a spectrogram from the audio data through short-time Fourier transform, and feeding the spectrogram into a speech emotion state evaluation model for training. The spectrogram shows how the voice energy changes along with time and frequency in the form of a time-frequency graph, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Because the convolutional neural network has strong graphic representation capability, the main network structure of the speech emotion state evaluation model adopts four convolutional layers. In order to extract features related to emotion from a spectrogram and suppress irrelevant information, the invention designs a lightweight novel attention mechanism: spatio-temporal-frequency attention. Unlike previous attention mechanisms, spatio-temporal-frequency attention is a cascade of attention mechanisms consisting of spatio-temporal attention and frequency attention. Since the emotion is hidden in the spoken segment of audio, the spatio-temporal attention highlights these speech information regions (speech information spatio-temporal regions) by the channel attention and the spatial attention, suppressing the unvoiced regions and the noise regions. Because research shows that emotion has a great relationship with voice frequency, frequency attention acquires emotion-related frequency combination characteristics through frequency-channel attention in a voice information candidate area. The space-time attention and the frequency attention are mutually matched, the auxiliary neural network extracts emotion characteristics from a long audio frequency quickly and accurately, and the effect and the performance of the emotion state assessment model are effectively improved.
The method comprises the following steps:
s1, building a speech emotion state evaluation model: as shown in fig. 1, a basic framework is built by using four convolutional layers, and the size of the convolution kernel of each convolutional layer is set, for example, the size of the convolution kernel of each convolutional layer is: 16 × 16 × 12,24 × 12 × 8,32 × 7 × 15, 64 × 5 × 3; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
because the spectrogram in the form of an RGB image is input, the method mainly adopts a convolutional neural network. In order to be transplanted to the client, the method adopts a small network, namely only four convolutional layers. In order to rapidly extract emotional features from a lengthy spectrogram, the invention provides space-time-frequency cascade attention. Spatiotemporal attention can focus on a speech information region (speech information spatiotemporal region) from a lengthy spectrogram, and frequency attention can extract emotional frequency features from the speech information candidate region. The two are mutually matched, the auxiliary model extracts emotional characteristics quickly and accurately, and the accuracy of the model is improved.
In step S1, each pixel of the spectrogram represents 10Hz and 10ms information, and in order to capture sufficient information from the spectrogram, the size of the convolution kernel must be designed according to the image resolution of the spectrogram; thus, the underlying skeletal network is as follows:
Figure BDA0002400056370000081
in a piece of audio, emotion is hidden in a place where the amount of speech information is abundant, and the emotion is related to a specific speech frequency, so that each pixel point of a spectrogram contributes differently to the emotion. How to effectively highlight the emotion related area and extract effective emotion frequency patterns becomes the key of speech emotion recognition. Aiming at the key problem, the invention provides a space-time-frequency cascade attention mechanism for effectively extracting emotional characteristics: 1) spatiotemporal attention can highlight emotion related regions (spatiotemporal regions) in a lengthy spectrogram; 2) frequency attention captures emotional frequency features from the frequency distribution in the upper candidate region. The space-time attention and the frequency attention are mutually matched, and the emotional characteristics are gradually captured in the spectrogram.
Since emotion occurs only at the moment of speech occurrence and not at the moment of silence, these areas of speech information are first addressed by spatio-temporal attention.
Obtaining a feature map after each convolution layer of the spectrogram; as shown in fig. 2, the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; can be viewed as being three-dimensional in time-space. The spatiotemporal attention module is composed of a channel attention module unit and a spatial attention module unit.
The channel attention module unit mainly highlights channels highly related to emotion (in a convolutional neural network, each channel actually represents a characteristic type), and the spatial attention module unit strengthens an emotion space region on an F-T space surface; as shown in fig. 3.
In the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptor
Figure BDA0002400056370000091
Channel descriptor D by Global average pooling operationcThe spatial global information of each channel is possessed, so that important channels can be highlighted, and unimportant channels can be suppressed. Channel descriptor by two full connectivity layers and Sigmoid activation function
Figure BDA0002400056370000092
Mapping to channel attention weights
Figure BDA0002400056370000093
This weight is
Figure BDA0002400056370000094
Important channels can be given a high weighting value, highlighting which channels are highly mood-related, i.e. which type of feature is highly mood-related. Weighting channel attention
Figure BDA0002400056370000095
Multiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagram
Figure BDA0002400056370000096
The process formula is as follows:
Figure BDA0002400056370000097
Figure BDA0002400056370000098
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively, the weight coefficient and the offset value of the second fully-connected layer, wherein σ s is a Sigmoid activation function, and Avg is a global average pooling function.
In the space attention module unit, a new feature map is put
Figure BDA0002400056370000099
Performing global average pooling compression along the C-axis to obtain spatial descriptors
Figure BDA00024000563700000910
Apparently by performing a global average pooling operation on the channels, the spatial descriptor DsThe spatially important information areas can be highlighted. Spatial attention weight generation via a convolutional layer and ReLU activation function
Figure BDA00024000563700000911
The spatial attention weight
Figure BDA00024000563700000912
The voice information rich region on the H x W space plane can be emphasized. Weighting spatial attention
Figure BDA0002400056370000101
And characteristic diagram
Figure BDA0002400056370000102
Multiplication of corresponding points to obtainNovel characteristic diagram
Figure BDA0002400056370000103
The process formula is as follows:
Figure BDA0002400056370000104
Figure BDA0002400056370000105
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3For the bias values of this convolution layer, σ r is the RELU activation function and Avg is the global average pooling function.
Figure BDA0002400056370000106
The method is a comprehensive result of the channel attention and the space attention of the original feature map F, namely, each pixel point of the original feature map can obtain corresponding weight. It is noted that the descriptors of both sub-attention modules are obtained by an average pooling operation, rather than a maximum pooling operation, which can suppress strong noise to some extent. Through space-time attention, the network can quickly find out which channels and which spatial regions are regions rich in voice information, which are also emotion hidden regions.
Unlike the spatiotemporal attention module, the frequency attention module aims to learn specific emotional frequency patterns through frequency distributions in the speech information region. As a lightweight module, the frequency attention module can replace a conventional full link layer, avoiding overfitting to some extent. In frequency attention, the invention mainly utilizes a depth column convolution and a weighted grouping full-connection layer to respectively extract frequency emotion characteristics and channel emotion characteristics, as shown in figure 4.
Frequency attention is applied to the frequency axis to extract emotional frequency patterns. In a convolutional neural network, each channel represents one feature type extracted. For theOutput characteristic diagram of the fourth layer convolution layer
Figure BDA0002400056370000107
With C-type features, if conventional convolution operations are used, the C-type features must be summed up, and the individuality of the different types of features is lost. The present invention uses a depth-column convolution to extract different frequency patterns for different channels. The deep convolution is a spatial convolution performed independently on each input channel, and the convolutions of the channels do not affect each other. In order to extract emotional frequency pattern on frequency axis, the output characteristic diagram of the fourth convolution layer is set in the frequency attention module
Figure BDA0002400056370000111
Performing deep-row convolution with a convolution kernel of
Figure BDA0002400056370000112
This is also the frequency weight learned through the network; different frequency mode results of different channels are obtained
Figure BDA0002400056370000113
The result still retains timing information; the formula of the process is as follows:
Figure BDA0002400056370000114
wherein, B4Is the offset value of the depth column convolution.
Channel weights are mainly highlighted
Figure BDA0002400056370000115
Of the type of mood-related features (channel). The input characteristic diagram is
Figure BDA0002400056370000116
In order to obtain a channel descriptor
Figure BDA0002400056370000117
Feature map along time axis W axis
Figure BDA0002400056370000118
Global average pooling compression is carried out to obtain channel descriptors
Figure BDA0002400056370000119
To better highlight important channels, channel descriptors are provided
Figure BDA00024000563700001110
Input to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
Figure BDA00024000563700001112
wherein the content of the first and second substances,
Figure BDA00024000563700001111
the weight coefficient of the full connection layer is also the channel weight learned by the network;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
In summary, the attention weight of the spatiotemporal attention comes from the descriptor of the feature map itself, so that the feature map has strong adaptivity, and the attention weight of the frequency attention comes from the network learning parameter, so that the emotional frequency pattern can be well learned. The space-time attention provides a candidate voice information area for the frequency attention, so that the frequency attention can quickly and accurately extract frequency emotion characteristics, the frequency emotion characteristics and the frequency emotion characteristics are matched with each other, a network is guided to search the emotion characteristics, and the performance of a discrimination model is effectively improved.
S2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; the existing Voice emotion databases are combined into a large database according to seven basic emotions (happy, surprised, angry, aversion, fear, sadness and slight bamboo), and the large database comprises Emotional Voices, Emotional Voice, MELD, VoxColeb, GEMEP, RML, ENTERFACE and IEMOCAP databases. Dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; the spectrogram is input to a speech emotional state assessment model for training and testing, as shown in fig. 5.
And step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state. The method can be transplanted to a client for forward reasoning and feedback to the intelligent question-answering system, the auxiliary system can better understand the semantics, and the text and voice output of the system is corrected, so that the answer of the voice interaction system is more suitable for the use requirements of interlocutors.
In the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.
A specific example is: the method comprises the following steps:
A. and (3) dividing: segmenting the audio data to enable each section of sub audio to be less than or equal to 3s, wherein the label of each section of sub audio is the label of the original long audio, and the label of the original long audio is the average of the prediction result of each section of sub audio in prediction;
B. framing: performing framing processing on each section of sub audio, wherein windows adopted by framing are Hamming windows with the window length of 40ms and the time shift of 10ms, and for data enhancement, the Hamming windows with the window length of 20ms and the time shift of 10ms are also adopted, so that data can be doubled;
C. short-time Fourier transform: carrying out short-time Fourier transform on the audio after framing to obtain a spectrogram;
D. normalization: carrying out logarithm, mean value reduction and variance removal operation on the spectrogram;
E. fixed length: since the network input must be of fixed size, the frequency axis of each spectrogram takes 400 points (representing within 4KHz, which is the frequency range of human speech) and the time axis takes 300 points (representing 3s, less than zero padding).
On one hand, because the traditional acoustic statistical features are global features, important information in time and local is easy to smooth; on the other hand because it is difficult to combine with convolutional neural networks with powerful characterization capabilities. Therefore, the method adopts the spectrogram as the input of the speech emotion state evaluation model. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.
The method can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and further feed the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand the semantics, and correct the text and voice output of the system, and the answer of the voice interaction system is more suitable for the use requirements of the interlocutor. In addition, the method can be applied to a VR virtual chat room, and helps the virtual character projected by the interlocutor to have rich expression by acquiring the emotional state of the speaker, so that the VR virtual chat meets the requirement of virtual reality. The method can also be applied to a VR inquiry system to assist doctors in obtaining the emotional state information of patients, so that the time of the doctors is saved, and the medical resources are saved.
Example two
In order to implement the method for evaluating an attention-based speech emotion state according to the first embodiment, the present embodiment provides an attention-based speech emotion state evaluation device, which is characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
EXAMPLE III
The present embodiment is a storage medium storing a computer program, which when executed by a processor causes the processor to execute the attention-based speech emotional state assessment method according to the first embodiment.
Example four
The embodiment of the invention relates to a computing device, which comprises a processor and a memory for storing a program executable by the processor, and is characterized in that when the processor executes the program stored in the memory, the attention-based speech emotion state assessment method described in the first embodiment is implemented.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims (6)

1. A speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:
s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
s2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state.
2. The attention-based speech emotional state assessment method according to claim 1, wherein: in the step S1, the spectrogram obtains a feature map after each layer of convolution layer processing; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;
in the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptor
Figure FDA0003525597060000011
Channel descriptor by two full connectivity layers and Sigmoid activation function
Figure FDA0003525597060000012
Mapping to channel attention weights
Figure FDA0003525597060000013
Figure FDA0003525597060000014
Weighting the channel attention
Figure FDA0003525597060000015
Multiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagram
Figure FDA0003525597060000016
The process formula is as follows:
Figure FDA0003525597060000021
Figure FDA0003525597060000022
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively the weight coefficient and the offset value of the second full connection layer, and σ s is Sigmoid activation function and AvgspatialRefers to the global average pooling function along the H × W spatial plane;
in the space attention module unit, a new feature map is put
Figure FDA0003525597060000023
Global average pooling compression along the C-axis to obtain spatial descriptors
Figure FDA0003525597060000024
Generating spatial attention weights by a convolutional layer and a ReLU activation function
Figure FDA0003525597060000025
Weighting spatial attention
Figure FDA0003525597060000026
And characteristic diagram
Figure FDA0003525597060000027
Multiplying corresponding points to obtain a brand new characteristic diagram
Figure FDA0003525597060000028
The process formula is as follows:
Figure FDA0003525597060000029
Figure FDA00035255970600000210
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3Bias values for this convolution layer are the convolution operation symbols, and σ r is the RELU activation function, AvgchannelIs a global average pooling function along the C-axis;
in the frequency attention module, the output characteristic diagram of the fourth convolution layer
Figure FDA00035255970600000211
Performing deep-column convolution processing, deep-column convolution kernel
Figure FDA00035255970600000212
Different frequency mode results of different channels are obtained
Figure FDA00035255970600000213
The formula of the process is as follows:
Figure FDA00035255970600000214
wherein, B4For the offset value of the depth column convolution,
Figure FDA00035255970600000215
the operation sign is a depth column convolution operation sign;
along the time axis WFor characteristic diagram
Figure FDA00035255970600000216
Global average pooling compression is carried out to obtain channel descriptors
Figure FDA00035255970600000217
Describing channels
Figure FDA00035255970600000218
Input to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
Figure FDA0003525597060000031
wherein the content of the first and second substances,
Figure FDA0003525597060000032
weights for the full connection layer, also learned by the network, B5Bias value of fully connected layer, AvgtimeRefers to the global average pooling function along the W axis;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
3. The attention-based speech emotional state assessment method according to claim 1, wherein: in the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.
4. An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting and inputting a voice spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
5. A storage medium storing a computer program which, when executed by a processor, causes the processor to execute the attention-based speech emotional state assessment method according to any one of claims 1 to 3.
6. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the attention-based speech emotional state assessment method of any of claims 1-3.
CN202010143924.2A 2020-03-04 2020-03-04 Attention-based speech emotion state evaluation method, device, medium and equipment Active CN111402928B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010143924.2A CN111402928B (en) 2020-03-04 2020-03-04 Attention-based speech emotion state evaluation method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010143924.2A CN111402928B (en) 2020-03-04 2020-03-04 Attention-based speech emotion state evaluation method, device, medium and equipment

Publications (2)

Publication Number Publication Date
CN111402928A CN111402928A (en) 2020-07-10
CN111402928B true CN111402928B (en) 2022-06-14

Family

ID=71430481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010143924.2A Active CN111402928B (en) 2020-03-04 2020-03-04 Attention-based speech emotion state evaluation method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN111402928B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933188B (en) * 2020-09-14 2021-02-05 电子科技大学 Sound event detection method based on convolutional neural network
CN112581979B (en) * 2020-12-10 2022-07-12 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112735477B (en) * 2020-12-31 2023-03-17 沈阳康慧类脑智能协同创新中心有限公司 Voice emotion analysis method and device
CN112581980B (en) * 2021-02-26 2021-05-25 中国科学院自动化研究所 Method and network for time-frequency channel attention weight calculation and vectorization
CN114343670B (en) * 2022-01-07 2023-07-14 北京师范大学 Interpretation information generation method and electronic equipment
CN115206305B (en) * 2022-09-16 2023-01-20 北京达佳互联信息技术有限公司 Semantic text generation method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5866728B2 (en) * 2011-10-14 2016-02-17 サイバーアイ・エンタテインメント株式会社 Knowledge information processing server system with image recognition system
KR102210908B1 (en) * 2017-10-17 2021-02-03 주식회사 네오펙트 Method, apparatus and computer program for providing cognitive training
CN108682431B (en) * 2018-05-09 2021-08-03 武汉理工大学 Voice emotion recognition method in PAD three-dimensional emotion space
CN110059587A (en) * 2019-03-29 2019-07-26 西安交通大学 Human bodys' response method based on space-time attention
CN110610168B (en) * 2019-09-20 2021-10-26 合肥工业大学 Electroencephalogram emotion recognition method based on attention mechanism
CN110853630B (en) * 2019-10-30 2022-02-18 华南师范大学 Lightweight speech recognition method facing edge calculation

Also Published As

Publication number Publication date
CN111402928A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111402928B (en) Attention-based speech emotion state evaluation method, device, medium and equipment
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
Venkataramanan et al. Emotion recognition from speech
CN108597539B (en) Speech emotion recognition method based on parameter migration and spectrogram
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US11281945B1 (en) Multimodal dimensional emotion recognition method
WO2018227781A1 (en) Voice recognition method, apparatus, computer device, and storage medium
US20230267916A1 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
CN107452379B (en) Dialect language identification method and virtual reality teaching method and system
CN111312245B (en) Voice response method, device and storage medium
CN109377981B (en) Phoneme alignment method and device
Noroozi et al. Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost
CN113421547B (en) Voice processing method and related equipment
CN107972028A (en) Man-machine interaction method, device and electronic equipment
Wang et al. Research on speech emotion recognition technology based on deep and shallow neural network
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
CN114863938A (en) Bird language identification method and system based on attention residual error and feature fusion
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN112749567A (en) Question-answering system based on reality information environment knowledge graph
Liu et al. Learning salient features for speech emotion recognition using CNN
Huilian et al. Speech emotion recognition based on BLSTM and CNN feature fusion
CN115985320A (en) Intelligent device control method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant