CN111402928B - Attention-based speech emotion state evaluation method, device, medium and equipment - Google Patents
Attention-based speech emotion state evaluation method, device, medium and equipment Download PDFInfo
- Publication number
- CN111402928B CN111402928B CN202010143924.2A CN202010143924A CN111402928B CN 111402928 B CN111402928 B CN 111402928B CN 202010143924 A CN202010143924 A CN 202010143924A CN 111402928 B CN111402928 B CN 111402928B
- Authority
- CN
- China
- Prior art keywords
- layer
- attention
- spectrogram
- convolution
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention provides a voice emotion state assessment method, device, medium and equipment based on attention. The method comprises the following steps: s1, building a speech emotion state evaluation model: building a basic framework by adopting four layers of coiling layers; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer; s2, inputting a speech emotion database to train and test the speech emotion state evaluation model; and S3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into the speech emotion state evaluation model to evaluate the emotion state. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.
Description
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice emotion state assessment method, device, medium and equipment based on attention.
Background
With the development of society and the advancement of science and technology, human-computer interaction technology has quietly entered into various aspects of our lives, such as smart homes, mobile phones, vehicles, smart wearing, robots and the like. In recent years, man-machine interaction technology has revolutionized, and people have not met the GUI (user interface) era and have moved more towards the natural human conversation experience. As a new interaction technology, VUI (voice user interface) is a man-machine interaction mode with human internal intention as a center, and intelligent man-machine interaction experience with natural conversation as a core. Voice interaction is more efficient and more natural-expressing than interface interaction input, such as voice assistants of Siri, Alexa, Cortana, VR virtual chat rooms, VR interrogation systems, and the like. These tools do not analyze the emotional state of the interlocutor when performing human-computer interaction. Because the same sentence substantially expresses different meanings under different emotional states of the interlocutor, the acquisition of the emotional state of the interlocutor has very important significance for the machine to accurately understand the semantics.
The traditional speech emotion recognition method is based on acoustic statistical features and machine learning models. Acoustic statistical features commonly used for emotion recognition include mel-frequency cepstral coefficients (MFCCs), GeMAPS feature sets, vocal prosodic features, BoAW feature sets, and the like. And machine learning models applied to the acoustic statistical features include hidden markov models, gaussian mixture models, decision trees, and the like. However, the emotion is knowledge in a high semantic category, and conventional acoustic statistical features are not strong in emotion representation capability and even limit the performance of the model to a certain extent.
In recent years, by means of strong nonlinear characterization capability of a deep network, a deep learning method is gradually introduced into the field of speech emotion recognition. The acoustic statistical characteristics can be subjected to nonlinear emotion depth characteristics extraction through CNN, DNN, DBN and LSTM, emotion characterization capability is improved, and the emotion depth characteristics are fed to machine learning models such as ELM (model extreme learning machine) and SVM (support vector machine) for judgment.
While the conventional acoustic statistical features have limited characterization capability and global statistical features easily lose local information, researchers focus on the research attention on the spectrogram. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Therefore, the speech emotion recognition method based on the spectrogram and the convolutional neural network becomes a recent hot technology. How to quickly and effectively organize and extract emotional features from a lengthy spectrogram becomes a key technical problem in the speech emotion recognition field nowadays.
Since emotion is accompanied in the speaking content of the speaker, emotion is hidden only in the speech information rich frames, not the silent frames, in a piece of audio. To solve this problem, attention can be paid if blind searching for mood-related regions and features in a lengthy spectrogram without indication is difficult and time consuming. The attention mechanism is a weighting mechanism that can highlight important information and suppress irrelevant information without cutting the audio.
The attention mechanism commonly used for speech emotion recognition is global soft attention. The weight of the attention mechanism is obtained by performing linear and nonlinear transformation on the original sequence and then performing normalization processing, and the final result is that the original sequence and the corresponding weight are multiplied point by point and added. Because the original sequence feature vector is very large in the speech emotion recognition convolutional network, and the corresponding learning parameter is also very large, the attention mechanism needs huge calculation amount, and cannot be well transplanted to a mobile terminal.
In addition to this, there are some unconventional attention mechanisms, such as those based on maximum pooling and special convolution kernel sizes, which require a large amount of computation and are prone to affecting discrimination performance due to noise introduced by the maximum pooling operation.
How to design a lightweight attention mechanism which effectively emphasizes emotional features is also a new key technical hotspot in the field of speech emotion recognition.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a method, a device, a medium and equipment for evaluating a speech emotion state based on attention. The invention adopts a novel light attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved.
In order to achieve the purpose, the invention is realized by the following technical scheme: a speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:
s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
s2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state.
Preferably, in step S1, the spectrogram obtains a feature map after being processed by each convolution layer; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;
in the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptorChannel descriptor by two full connectivity layers and Sigmoid activation functionMapping to channel attention weightsWeighting channel attentionMultiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagramThe process formula is as follows:
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively, the weight coefficient and the offset value of the second fully-connected layer, wherein σ s is a Sigmoid activation function, and Avg is a global average pooling function.
In the space attention module unit, a new feature map is putPerforming global average pooling compression along the C-axis to obtain spatial descriptorsGenerating spatial attention weights by a convolutional layer and a ReLU activation functionWeighting spatial attentionAnd characteristic diagramMultiplying corresponding points to obtain a brand new characteristic diagramThe process formula is as follows:
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3For the bias values of this convolution layer, σ r is the RELU activation function and Avg is the global average pooling function.
In the frequency attention module, the output characteristic diagram of the fourth convolution layerPerforming a depth-column convolution process with a depth-column convolution kernel ofObtain different frequency mode results of different channelsThe formula of the process is as follows:
wherein, B4Is the offset value of the depth column convolution.
Feature map along time axis W axisGlobal average pooling compression is carried out to obtain channel descriptorsDescribing channelsInput to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
wherein the content of the first and second substances,the weight coefficient of the full connection layer is also the channel weight learned by the network;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
Preferably, in steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data. The spectrogram shows how the voice energy changes along with time and frequency in a time-frequency graph form, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.
An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting and inputting a voice spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
A storage medium, wherein the storage medium stores a computer program which, when executed by a processor, causes the processor to execute the above-described attention-based speech emotional state assessment method.
A computing device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the above-described attention-based speech emotional state assessment method when executing the program stored in the memory.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. in the invention, the space-time attention can highlight the emotion related region (space-time region) in a lengthy spectrogram; frequency attention captures emotional frequency features according to a frequency distribution in the upper candidate region; the invention adopts a light-weight novel attention mechanism, the space-time attention and the frequency attention are mutually matched, the emotional characteristics are quickly and accurately extracted from a long audio frequency, and the effect and the performance of the emotional state evaluation model are effectively improved;
2. the intelligent question-answering system can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and then feeds the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand semantics, correct the text and voice output of the system, and enable the answer of the voice interaction system to be more suitable for the use requirements of the interlocutor.
Drawings
FIG. 1 is a schematic diagram of a speech emotional state assessment model of the present invention;
FIG. 2 is a schematic diagram of a feature map obtained after processing each convolution layer of a spectrogram in the present invention;
FIG. 3 is a schematic diagram of the spatiotemporal attention module of the present invention;
FIG. 4 is a schematic diagram of a frequency attention module of the present invention;
FIG. 5 is a flow chart of the present invention for training and testing a speech emotional state assessment model.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
In the attention-based speech emotion state assessment method, the input is audio data. And extracting a spectrogram from the audio data through short-time Fourier transform, and feeding the spectrogram into a speech emotion state evaluation model for training. The spectrogram shows how the voice energy changes along with time and frequency in the form of a time-frequency graph, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information. Because the convolutional neural network has strong graphic representation capability, the main network structure of the speech emotion state evaluation model adopts four convolutional layers. In order to extract features related to emotion from a spectrogram and suppress irrelevant information, the invention designs a lightweight novel attention mechanism: spatio-temporal-frequency attention. Unlike previous attention mechanisms, spatio-temporal-frequency attention is a cascade of attention mechanisms consisting of spatio-temporal attention and frequency attention. Since the emotion is hidden in the spoken segment of audio, the spatio-temporal attention highlights these speech information regions (speech information spatio-temporal regions) by the channel attention and the spatial attention, suppressing the unvoiced regions and the noise regions. Because research shows that emotion has a great relationship with voice frequency, frequency attention acquires emotion-related frequency combination characteristics through frequency-channel attention in a voice information candidate area. The space-time attention and the frequency attention are mutually matched, the auxiliary neural network extracts emotion characteristics from a long audio frequency quickly and accurately, and the effect and the performance of the emotion state assessment model are effectively improved.
The method comprises the following steps:
s1, building a speech emotion state evaluation model: as shown in fig. 1, a basic framework is built by using four convolutional layers, and the size of the convolution kernel of each convolutional layer is set, for example, the size of the convolution kernel of each convolutional layer is: 16 × 16 × 12,24 × 12 × 8,32 × 7 × 15, 64 × 5 × 3; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
because the spectrogram in the form of an RGB image is input, the method mainly adopts a convolutional neural network. In order to be transplanted to the client, the method adopts a small network, namely only four convolutional layers. In order to rapidly extract emotional features from a lengthy spectrogram, the invention provides space-time-frequency cascade attention. Spatiotemporal attention can focus on a speech information region (speech information spatiotemporal region) from a lengthy spectrogram, and frequency attention can extract emotional frequency features from the speech information candidate region. The two are mutually matched, the auxiliary model extracts emotional characteristics quickly and accurately, and the accuracy of the model is improved.
In step S1, each pixel of the spectrogram represents 10Hz and 10ms information, and in order to capture sufficient information from the spectrogram, the size of the convolution kernel must be designed according to the image resolution of the spectrogram; thus, the underlying skeletal network is as follows:
in a piece of audio, emotion is hidden in a place where the amount of speech information is abundant, and the emotion is related to a specific speech frequency, so that each pixel point of a spectrogram contributes differently to the emotion. How to effectively highlight the emotion related area and extract effective emotion frequency patterns becomes the key of speech emotion recognition. Aiming at the key problem, the invention provides a space-time-frequency cascade attention mechanism for effectively extracting emotional characteristics: 1) spatiotemporal attention can highlight emotion related regions (spatiotemporal regions) in a lengthy spectrogram; 2) frequency attention captures emotional frequency features from the frequency distribution in the upper candidate region. The space-time attention and the frequency attention are mutually matched, and the emotional characteristics are gradually captured in the spectrogram.
Since emotion occurs only at the moment of speech occurrence and not at the moment of silence, these areas of speech information are first addressed by spatio-temporal attention.
Obtaining a feature map after each convolution layer of the spectrogram; as shown in fig. 2, the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; can be viewed as being three-dimensional in time-space. The spatiotemporal attention module is composed of a channel attention module unit and a spatial attention module unit.
The channel attention module unit mainly highlights channels highly related to emotion (in a convolutional neural network, each channel actually represents a characteristic type), and the spatial attention module unit strengthens an emotion space region on an F-T space surface; as shown in fig. 3.
In the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptorChannel descriptor D by Global average pooling operationcThe spatial global information of each channel is possessed, so that important channels can be highlighted, and unimportant channels can be suppressed. Channel descriptor by two full connectivity layers and Sigmoid activation functionMapping to channel attention weightsThis weight isImportant channels can be given a high weighting value, highlighting which channels are highly mood-related, i.e. which type of feature is highly mood-related. Weighting channel attentionMultiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagramThe process formula is as follows:
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively, the weight coefficient and the offset value of the second fully-connected layer, wherein σ s is a Sigmoid activation function, and Avg is a global average pooling function.
In the space attention module unit, a new feature map is putPerforming global average pooling compression along the C-axis to obtain spatial descriptorsApparently by performing a global average pooling operation on the channels, the spatial descriptor DsThe spatially important information areas can be highlighted. Spatial attention weight generation via a convolutional layer and ReLU activation functionThe spatial attention weightThe voice information rich region on the H x W space plane can be emphasized. Weighting spatial attentionAnd characteristic diagramMultiplication of corresponding points to obtainNovel characteristic diagramThe process formula is as follows:
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3For the bias values of this convolution layer, σ r is the RELU activation function and Avg is the global average pooling function.
The method is a comprehensive result of the channel attention and the space attention of the original feature map F, namely, each pixel point of the original feature map can obtain corresponding weight. It is noted that the descriptors of both sub-attention modules are obtained by an average pooling operation, rather than a maximum pooling operation, which can suppress strong noise to some extent. Through space-time attention, the network can quickly find out which channels and which spatial regions are regions rich in voice information, which are also emotion hidden regions.
Unlike the spatiotemporal attention module, the frequency attention module aims to learn specific emotional frequency patterns through frequency distributions in the speech information region. As a lightweight module, the frequency attention module can replace a conventional full link layer, avoiding overfitting to some extent. In frequency attention, the invention mainly utilizes a depth column convolution and a weighted grouping full-connection layer to respectively extract frequency emotion characteristics and channel emotion characteristics, as shown in figure 4.
Frequency attention is applied to the frequency axis to extract emotional frequency patterns. In a convolutional neural network, each channel represents one feature type extracted. For theOutput characteristic diagram of the fourth layer convolution layerWith C-type features, if conventional convolution operations are used, the C-type features must be summed up, and the individuality of the different types of features is lost. The present invention uses a depth-column convolution to extract different frequency patterns for different channels. The deep convolution is a spatial convolution performed independently on each input channel, and the convolutions of the channels do not affect each other. In order to extract emotional frequency pattern on frequency axis, the output characteristic diagram of the fourth convolution layer is set in the frequency attention modulePerforming deep-row convolution with a convolution kernel ofThis is also the frequency weight learned through the network; different frequency mode results of different channels are obtainedThe result still retains timing information; the formula of the process is as follows:
wherein, B4Is the offset value of the depth column convolution.
Channel weights are mainly highlightedOf the type of mood-related features (channel). The input characteristic diagram isIn order to obtain a channel descriptorFeature map along time axis W axisGlobal average pooling compression is carried out to obtain channel descriptorsTo better highlight important channels, channel descriptors are providedInput to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
wherein the content of the first and second substances,the weight coefficient of the full connection layer is also the channel weight learned by the network;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
In summary, the attention weight of the spatiotemporal attention comes from the descriptor of the feature map itself, so that the feature map has strong adaptivity, and the attention weight of the frequency attention comes from the network learning parameter, so that the emotional frequency pattern can be well learned. The space-time attention provides a candidate voice information area for the frequency attention, so that the frequency attention can quickly and accurately extract frequency emotion characteristics, the frequency emotion characteristics and the frequency emotion characteristics are matched with each other, a network is guided to search the emotion characteristics, and the performance of a discrimination model is effectively improved.
S2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; the existing Voice emotion databases are combined into a large database according to seven basic emotions (happy, surprised, angry, aversion, fear, sadness and slight bamboo), and the large database comprises Emotional Voices, Emotional Voice, MELD, VoxColeb, GEMEP, RML, ENTERFACE and IEMOCAP databases. Dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; the spectrogram is input to a speech emotional state assessment model for training and testing, as shown in fig. 5.
And step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state. The method can be transplanted to a client for forward reasoning and feedback to the intelligent question-answering system, the auxiliary system can better understand the semantics, and the text and voice output of the system is corrected, so that the answer of the voice interaction system is more suitable for the use requirements of interlocutors.
In the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.
A specific example is: the method comprises the following steps:
A. and (3) dividing: segmenting the audio data to enable each section of sub audio to be less than or equal to 3s, wherein the label of each section of sub audio is the label of the original long audio, and the label of the original long audio is the average of the prediction result of each section of sub audio in prediction;
B. framing: performing framing processing on each section of sub audio, wherein windows adopted by framing are Hamming windows with the window length of 40ms and the time shift of 10ms, and for data enhancement, the Hamming windows with the window length of 20ms and the time shift of 10ms are also adopted, so that data can be doubled;
C. short-time Fourier transform: carrying out short-time Fourier transform on the audio after framing to obtain a spectrogram;
D. normalization: carrying out logarithm, mean value reduction and variance removal operation on the spectrogram;
E. fixed length: since the network input must be of fixed size, the frequency axis of each spectrogram takes 400 points (representing within 4KHz, which is the frequency range of human speech) and the time axis takes 300 points (representing 3s, less than zero padding).
On one hand, because the traditional acoustic statistical features are global features, important information in time and local is easy to smooth; on the other hand because it is difficult to combine with convolutional neural networks with powerful characterization capabilities. Therefore, the method adopts the spectrogram as the input of the speech emotion state evaluation model. The spectrogram is a time-frequency graph, can display how the voice energy changes along with time and frequency, and can well embody the harmonic structure and frequency information of the voice while preserving time sequence information and local information.
The method can help the voice interaction system to evaluate the emotion state of the interlocutor in real time during man-machine conversation, and further feed the emotion state back to the intelligent question-answering system, so that the auxiliary system can better understand the semantics, and correct the text and voice output of the system, and the answer of the voice interaction system is more suitable for the use requirements of the interlocutor. In addition, the method can be applied to a VR virtual chat room, and helps the virtual character projected by the interlocutor to have rich expression by acquiring the emotional state of the speaker, so that the VR virtual chat meets the requirement of virtual reality. The method can also be applied to a VR inquiry system to assist doctors in obtaining the emotional state information of patients, so that the time of the doctors is saved, and the medical resources are saved.
Example two
In order to implement the method for evaluating an attention-based speech emotion state according to the first embodiment, the present embodiment provides an attention-based speech emotion state evaluation device, which is characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
EXAMPLE III
The present embodiment is a storage medium storing a computer program, which when executed by a processor causes the processor to execute the attention-based speech emotional state assessment method according to the first embodiment.
Example four
The embodiment of the invention relates to a computing device, which comprises a processor and a memory for storing a program executable by the processor, and is characterized in that when the processor executes the program stored in the memory, the attention-based speech emotion state assessment method described in the first embodiment is implemented.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.
Claims (6)
1. A speech emotion state assessment method based on attention is characterized in that: the method comprises the following steps:
s1, building a speech emotion state evaluation model: setting an input as a spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
s2, inputting a voice emotion database, wherein each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and step 3, processing the audio data to be evaluated to obtain a spectrogram, and inputting the spectrogram into a trained and tested speech emotion state evaluation model to evaluate the emotion state.
2. The attention-based speech emotional state assessment method according to claim 1, wherein: in the step S1, the spectrogram obtains a feature map after each layer of convolution layer processing; the characteristic diagram has three dimensions including a thickness C representing the number of channels, a height H representing the frequency axis, and a width W representing the time axis; the space-time attention module consists of a channel attention module unit and a space attention module unit;
in the channel attention module unit, inputting a characteristic diagram F e C multiplied by H multiplied by W, and performing global average pooling compression on H multiplied by W space planes to obtain a channel descriptorChannel descriptor by two full connectivity layers and Sigmoid activation functionMapping to channel attention weights Weighting the channel attentionMultiplying the corresponding points with the original characteristic diagram F to obtain a new characteristic diagramThe process formula is as follows:
wherein, W1And B1Weight coefficient and offset value, W, of the first fully-connected layer, respectively2And B2Respectively the weight coefficient and the offset value of the second full connection layer, and σ s is Sigmoid activation function and AvgspatialRefers to the global average pooling function along the H × W spatial plane;
in the space attention module unit, a new feature map is putGlobal average pooling compression along the C-axis to obtain spatial descriptorsGenerating spatial attention weights by a convolutional layer and a ReLU activation functionWeighting spatial attentionAnd characteristic diagramMultiplying corresponding points to obtain a brand new characteristic diagramThe process formula is as follows:
wherein, W7×7The convolution kernel weight coefficient of the convolution layer has a convolution kernel size of 7 × 7, B3Bias values for this convolution layer are the convolution operation symbols, and σ r is the RELU activation function, AvgchannelIs a global average pooling function along the C-axis;
in the frequency attention module, the output characteristic diagram of the fourth convolution layerPerforming deep-column convolution processing, deep-column convolution kernelDifferent frequency mode results of different channels are obtainedThe formula of the process is as follows:
wherein, B4For the offset value of the depth column convolution,the operation sign is a depth column convolution operation sign;
along the time axis WFor characteristic diagramGlobal average pooling compression is carried out to obtain channel descriptorsDescribing channelsInput to a fully connected layer having C neurons; calculating the result of frequency attention FFQ(ii) a The formula of the process is as follows:
wherein the content of the first and second substances,weights for the full connection layer, also learned by the network, B5Bias value of fully connected layer, AvgtimeRefers to the global average pooling function along the W axis;
and finally, inputting the emotional state prediction result into a full connection layer with 4 neurons and a softmax function to obtain the emotional state prediction result.
3. The attention-based speech emotional state assessment method according to claim 1, wherein: in the steps S2 and S3, the spectrogram is obtained by performing segmentation, framing, short-time fourier transform, and normalization on the audio data.
4. An attention-based speech emotion state evaluation device characterized in that: the method comprises the following steps:
the voice emotion state evaluation model building module is used for setting and inputting a voice spectrogram; building a basic framework by adopting four layers of convolution layers, and respectively setting the convolution kernel size of each layer of convolution layer; each convolutional layer is followed by a batch normalization layer, a RELU activation function and an average pooling operation; a space-time attention module is connected behind the third layer of the convolution layer; connecting a frequency attention module behind the fourth layer of the convolution layer; finally, connecting a softmax layer to obtain an emotional state prediction result;
the voice emotion state evaluation model training and testing module is used for inputting a voice emotion database, and each audio data in the voice emotion database is provided with a corresponding emotion label; dividing audio data of a speech emotion database into a training set and a test set; processing all audio data respectively to obtain a spectrogram; inputting the spectrogram into a speech emotion state evaluation model for training and testing;
and the voice emotion state evaluation module is used for processing the audio data to be evaluated to obtain a spectrogram and inputting the spectrogram into the voice emotion state evaluation model which completes training and testing so as to evaluate the emotion state.
5. A storage medium storing a computer program which, when executed by a processor, causes the processor to execute the attention-based speech emotional state assessment method according to any one of claims 1 to 3.
6. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the attention-based speech emotional state assessment method of any of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010143924.2A CN111402928B (en) | 2020-03-04 | 2020-03-04 | Attention-based speech emotion state evaluation method, device, medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010143924.2A CN111402928B (en) | 2020-03-04 | 2020-03-04 | Attention-based speech emotion state evaluation method, device, medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111402928A CN111402928A (en) | 2020-07-10 |
CN111402928B true CN111402928B (en) | 2022-06-14 |
Family
ID=71430481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010143924.2A Active CN111402928B (en) | 2020-03-04 | 2020-03-04 | Attention-based speech emotion state evaluation method, device, medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111402928B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933188B (en) * | 2020-09-14 | 2021-02-05 | 电子科技大学 | Sound event detection method based on convolutional neural network |
CN112581979B (en) * | 2020-12-10 | 2022-07-12 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN112735477B (en) * | 2020-12-31 | 2023-03-17 | 沈阳康慧类脑智能协同创新中心有限公司 | Voice emotion analysis method and device |
CN112581980B (en) * | 2021-02-26 | 2021-05-25 | 中国科学院自动化研究所 | Method and network for time-frequency channel attention weight calculation and vectorization |
CN114343670B (en) * | 2022-01-07 | 2023-07-14 | 北京师范大学 | Interpretation information generation method and electronic equipment |
CN115206305B (en) * | 2022-09-16 | 2023-01-20 | 北京达佳互联信息技术有限公司 | Semantic text generation method and device, electronic equipment and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5866728B2 (en) * | 2011-10-14 | 2016-02-17 | サイバーアイ・エンタテインメント株式会社 | Knowledge information processing server system with image recognition system |
KR102210908B1 (en) * | 2017-10-17 | 2021-02-03 | 주식회사 네오펙트 | Method, apparatus and computer program for providing cognitive training |
CN108682431B (en) * | 2018-05-09 | 2021-08-03 | 武汉理工大学 | Voice emotion recognition method in PAD three-dimensional emotion space |
CN110059587A (en) * | 2019-03-29 | 2019-07-26 | 西安交通大学 | Human bodys' response method based on space-time attention |
CN110610168B (en) * | 2019-09-20 | 2021-10-26 | 合肥工业大学 | Electroencephalogram emotion recognition method based on attention mechanism |
CN110853630B (en) * | 2019-10-30 | 2022-02-18 | 华南师范大学 | Lightweight speech recognition method facing edge calculation |
-
2020
- 2020-03-04 CN CN202010143924.2A patent/CN111402928B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111402928A (en) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111402928B (en) | Attention-based speech emotion state evaluation method, device, medium and equipment | |
CN108717856B (en) | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network | |
Venkataramanan et al. | Emotion recognition from speech | |
CN108597539B (en) | Speech emotion recognition method based on parameter migration and spectrogram | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
US11281945B1 (en) | Multimodal dimensional emotion recognition method | |
WO2018227781A1 (en) | Voice recognition method, apparatus, computer device, and storage medium | |
US20230267916A1 (en) | Text-based virtual object animation generation method, apparatus, storage medium, and terminal | |
CN107452379B (en) | Dialect language identification method and virtual reality teaching method and system | |
CN111312245B (en) | Voice response method, device and storage medium | |
CN109377981B (en) | Phoneme alignment method and device | |
Noroozi et al. | Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost | |
CN113421547B (en) | Voice processing method and related equipment | |
CN107972028A (en) | Man-machine interaction method, device and electronic equipment | |
Wang et al. | Research on speech emotion recognition technology based on deep and shallow neural network | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
CN114863938A (en) | Bird language identification method and system based on attention residual error and feature fusion | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
CN112749567A (en) | Question-answering system based on reality information environment knowledge graph | |
Liu et al. | Learning salient features for speech emotion recognition using CNN | |
Huilian et al. | Speech emotion recognition based on BLSTM and CNN feature fusion | |
CN115985320A (en) | Intelligent device control method and device, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |