CN116230020A

CN116230020A - Speech emotion recognition and classification method

Info

Publication number: CN116230020A
Application number: CN202211516305.9A
Authority: CN
Inventors: 王国伟; 朱红坤; 贺光华; 李奇隆
Original assignee: Chongqing Chuannan Environmental Protection Technology Co ltd
Current assignee: Chongqing Chuannan Environmental Protection Technology Co ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-06-06

Abstract

The invention belongs to the technical field of voice recognition, and provides a voice emotion recognition classification method, which comprises the following steps: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result. The invention improves the processing effect on the non-stable voice signal, combines the effective characteristics in the voice signal, and improves the precision and accuracy of the voice emotion recognition result.

Description

Speech emotion recognition and classification method

Technical Field

The invention belongs to the technical field of voice emotion recognition, and particularly relates to a voice emotion recognition classification method.

Background

The emotion of the person can be judged by the human through expressions, voices, gestures and the like, wherein the voices are the most direct and effective communication bridge of the human, and are the most rapid and efficient medium in man-machine interaction; therefore, the voice emotion recognition analysis gradually becomes an active research field, and research results thereof are widely applied to fields such as artificial intelligence, robot technology, natural man-machine interaction technology and the like, for example, voice emotion quality inspection is performed on telephone customer service in various industries, service quality is inspected, and efficiency and quality inspection management level of quality inspection personnel are improved.

In the prior art, when speech emotion recognition is carried out, a novel linear equidistant frequency band division method for speech signals is generally adopted for extracting log-Mel spectrograms from original speech, and the method can only carry out characteristic extraction on stable speech and has poor processing effect on non-stable speech signals, so that the accuracy of speech emotion recognition is low; in addition, the prior art usually extracts deep features on the basis of artificial design features, ignores effective features in original voice signals, such as genes, formants and the like, and leads to the reduction of voice emotion recognition accuracy.

Disclosure of Invention

The invention aims at least solving the technical problems in the prior art, and provides a voice emotion recognition and classification method which improves the processing effect on non-stable voice signals and improves the precision and accuracy of voice emotion recognition results by combining with effective characteristics in the voice signals.

In order to achieve the above object of the present invention, according to a first aspect of the present invention, there is provided a speech emotion recognition classification method comprising the steps of: placing the voice data into an emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.

Further, the steps of extracting WP-log-Mel spectrogram characteristics in voice data by the first extraction layer are specifically as follows: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.

Further, the fused feature output dimension is (B, T, 64), where B represents the lot and T represents the frame length.

Further, the emotion recognition model includes a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network, and a softmax layer; the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.

Further, the second extraction network comprises a segmentation layer, a sinc-based convolution structure and a multi-pooling-based convolution structure in sequence.

Further, the convolution structure based on multi-pooling comprises a convolution layer, and an average pooling layer and a maximum pooling layer are arranged behind the convolution layer in parallel.

Further, the convolution layers are standardized, and the dropout value of the convolution layers is between 0.2 and 0.5.

Further, the step of placing the voice data into the emotion recognition model specifically includes: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.

Further, the emotion recognition model is trained by the following steps: acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling; the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.

Further, the training preprocessing step further includes generating a plurality of voice copies from the voice data.

The invention has the basic principle and beneficial effects that: the invention replaces Fourier transformation in log-Mel spectrogram feature extraction with wavelet packet transformation, finally obtains WP-log-Mel spectrogram feature so as to extract deep features in voice signals, and the wavelet packet transformation has the characteristic of multiple resolutions, so that signals can be observed gradually from coarse to fine. Compared with the prior art, the method and the device have the advantages that the auxiliary language features in the voice data are extracted, the auxiliary language features and the deep features are fused to identify emotion in the voice data, so that the robustness of an emotion identification model is higher, and the accuracy of emotion identification results are improved.

Drawings

FIG. 1 is a schematic diagram of a speech emotion recognition method of the present invention;

FIG. 2 is a logical schematic of the WP-log-Mel acquisition process of the present invention;

FIG. 3 is a schematic diagram of the structure of a sinc-based convolution structure of the present invention;

FIG. 4 is a schematic diagram of the architecture of the present invention based on a multi-pooling convolution structure;

FIG. 5 is a schematic diagram of the emotion recognition model of the present invention;

FIG. 6 is a schematic diagram of the steps of the training pre-process of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and defined, it should be noted that the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanical or electrical, or may be in communication with each other between two elements, directly or indirectly through intermediaries, as would be understood by those skilled in the art, in view of the specific meaning of the terms described above.

As shown in fig. 1, the invention provides a voice emotion recognition classification method, which comprises the following steps:

placing the voice data into an emotion recognition model; the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;

the first extraction network extracts WP-log-Mel spectrogram features from voice signals of voice data by wavelet packet transformation and audio processing, and inputs the spectrogram features into an encoder to obtain deep features;

specifically, the WP-log-Mel spectrogram features are log-Mel spectrogram features based on wavelet packet transformation; the extraction process of the WP-log-Mel spectrogram features is as shown in figure 2, voice data is input, wavelet packet coefficients of each frequency band are obtained through wavelet packet decomposition, the frequency spectrum of each layer of wavelet packet coefficient is obtained through wavelet packet reconstruction, the frequency spectrums of each layer of wavelet packet coefficient are spliced according to the frequency sequence of the frequency spectrum to obtain a complete frequency spectrum, a Mel filter bank is constructed through an audio processing library, energy passing through the Mel filter bank is calculated to obtain Mel spectrogram, the Mel spectrogram is logarithmized to obtain WP-log-Mel spectrogram features, and the spectrogram feature matrix is input into a transducer encoder to obtain deep features.

In this example, the voice data is divided into 46ms wide frames and 260 frames are obtained with a frame shift of 23ms, in other implementations, the frame width and frame shift can be defined as other numbers; in this embodiment, the wavelet basis function used in the wavelet transform is Duabechies4 wavelet, the number of filter banks is set to 64, and in other embodiments, other wavelet basis functions and the number of filters may be selected according to the application scenario; in this embodiment, the library is a library audio processing library, and in other embodiments, an audio processing library with the same effect may be used.

Specifically, the first extraction network includes a transform encoder, where the transform encoder includes more than 2 identical blocks, and the number of heads in the multi-Head attention mechanism in the blocks is more than 2. In this embodiment, the first extraction network employs layer normalization; the transducer encoder consists of 6 identical blocks, the number of heads in the multi-Head attention mechanism in each block is 8, and a 260×64 spectrogram feature matrix is input into the transducer encoder to obtain deep features.

The activation function of the feed-forward layer is as follows:

MHA＝concat(H ₁ ,H ₂ ,……,H ₈ )*W

Out＝LayerNorm(out)

Out＝gelu(out)

wherein X represents the input sequence of voice behavior and emotion, i represents the ith attention head, Q _i 、K _i 、V _i Respectively representing the query, the health value and the value of the attention head, W represents the learned linear transformation weight matrix, H represents the hidden dimension of the attention head, and H _i Represents the attention score of the attention head,

represent K _i Is a transposed matrix of (a); MHA represents a multi-headed attention score,

in other embodiments, other activation functions may be selected as the activation function of the feed-forward layer, and the number of blocks and the number of heads in the multi-Head attention mechanism in each block may be set as desired.

The first extraction network has the advantages that: in the prior art, a convolutional neural network, a cyclic neural network and the like are generally used for extracting deep features, the extraction performance of the network structure is reduced due to the problem of long-term dependence, and the training time is long; when processing long voice data, the voice data needs to be segmented into a plurality of blocks and then input into a model, and when using a cyclic neural network model, each block must also be input in sequence. The present embodiment uses a second extraction network including a multi-headed attention mechanism of a transducer encoder to extract deep features of speech data, completely avoiding recursion; the multi-head attention mechanism and the position embedding are benefited, deep features in the voice data can be accurately extracted without inputting the voice data in sequence, the multi-head attention mechanism can remarkably reduce training time and resource requirements, and repetition and convolution are avoided.

The second extraction network performs segmentation processing on the voice data and extracts the secondary language features of the segmented voice data; specifically, the second extraction network comprises, in order, a split layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure (CMPU). The segmentation layer is used for carrying out segmentation processing on the voice data; extracting auxiliary language features in the segmented voice data based on a sinc convolution structure and a multi-pooling convolution structure (CMPU), wherein the auxiliary language features comprise genes and formants in the voice data;

in this embodiment, the segmentation process specifically segments the waveform of the voice data into blocks of 100ms, where each block overlaps 5ms, for example, the first block is 1ms to 100ms, the second block is 95ms to 195ms, and the third block is 190ms to 290ms; in other embodiments, the length of each block and the length of the overlap of the division processing may be set according to the need.

The structure of the convolution structure based on sinc is shown in fig. 3, and a plurality of filters are used in the convolution structure based on sinc, and the length of the filter is 251 and the number of the filters is 80 in the embodiment; in other embodiments the length and number of filters may be selected as desired; the calculation process of the convolution structure based on sinc is as follows:

y[n]＝x[n]*g[n,f1,f2]

wherein y n represents the output result of the sinc-based convolution structure, x n represents the input speech, n represents the number of frames, f1 and f2 represent the learned low cut-off frequency and high cut-off frequency, respectively, the cut-off frequency can be randomly initialized within the range of [0, fs/2], and fs represents the sampling frequency of the input signal.

Preferably, in order to capture different characteristic information in the voice data through different pooling modes, the second extraction network comprises three convolution structures (CMPU) based on multi-pooling; each structure of the multi-pooling-based convolution structure (CMPU) is shown in fig. 4, and includes a convolution layer, where an average pooling layer and a maximum pooling layer are disposed behind the convolution layer; the outputs of the two pooling layers are connected and then input into the next multi-pooling based convolution structure. The calculation process of the convolution structure based on the multi-pooling is as follows:

S＝conv(x)

S1＝Maxpoold(S)

S2＝Avgpool1d(S)

S＝concat(S1,S2)

where x represents the output of a sinc-based convolution structure and S represents the output of a multi-pooling-based convolution structure.

Preferably, all convolution layers in the second extraction network are layer normalized, the activation function is preferably a Relu activation function, and in other embodiments, other activation functions may be selected. Preferably, the dropout value of the convolution layer of the second extraction network is between 0.2 and 0.5, and in this embodiment, the dropout value is preferably 0.5.

The second extraction network extraction can more effectively extract LLDs feature from voice data. Compared with the prior art that the voice data is directly convolved, the second extraction network convolves the voice data by using a convolution module based on a sine function, and the second extraction network comprises a filter bank for emotion recognition, so that more effective characteristics are provided. The embodiment also provides a convolution unit (CMPU) based on multi-pooling to further extract the characteristics, and different information is obtained through different pooling modes, so that the auxiliary language characteristics can be more comprehensively captured.

The first extraction network and the second extraction network do not have a sequence in the step of extracting the voice data, and are extracted simultaneously.

The fusion network fuses the secondary language features and the deep features; specifically, the deep features and the secondary language features are mapped to the same dimension through the multi-layer perceptron respectively, and the mapping process is as follows:

S1＝MLP(X1，64)

S2＝MLP(X2，64)

wherein X1 and X2 respectively represent the dimensions of deep features and secondary language features, and MLP is a multi-layer perceptron.

And fusing the mapped depth features and the mapped secondary language features through a Concat function, wherein the fused features have output dimensions of (B, T, 64), B represents a batch, and T represents a frame length. Inputting the fused features into a self-attention mechanism network, capturing global voice expression by the self-attention mechanism network, and acquiring key features in the fused features; the calculation process of the self-attention mechanism is as follows:

and inputting the key features into a softmax layer for emotion prediction classification, and outputting a prediction classification result.

Preferably, in order to accelerate the extraction efficiency of the voice data, the interference factor in the voice data is deleted between the extractions. Before the voice data are put into the emotion recognition model, classifying pretreatment is carried out on the voice data, wherein the classifying pretreatment comprises sampling, silence segment cutting and denoising; specifically, the voice data is sampled to 16K, and silence segment removal and denoising are performed on the voice data based on the VAD technique.

Preferably, as shown in fig. 5, the emotion recognition model is trained by the following steps:

acquiring voice data, and performing training pretreatment on the voice data; as shown in fig. 6, training preprocessing comprises sampling, mute segment cutting, denoising, data enhancement and data labeling; in the training preprocessing, compared with the classifying preprocessing, two steps of data enhancement and data labeling are added, wherein the data enhancement is used for capturing more information in voice data, so that the training effect is improved, the data labeling is used for judging the emotion recognition result output during training, and the accuracy of an emotion recognition model is improved; in this embodiment, the data enhancement method used is channel length perturbation.

The voice data is put into an initial model, and the initial model is trained to be an untrained emotion recognition model; extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data; fusing the secondary language features and the deep features to obtain key features in the fused features;

carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result; and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model. In this embodiment, the loss function is preferably a pyrach cross entropy loss function, which has a better effect on unbalanced data, and in the training process, the learning rate is reduced according to a certain proportion if there is no improvement in 20 epochs by using a dynamic learning rate strategy.

Preferably, if the number of the voice data involved in the training is small in the training process, the training preprocessing step further includes generating a plurality of voice copies of the voice data, and the data labels of the voice copies are the same, so as to increase the number of the voice data for training.

In the prior art, a method for translating voice data into text and then carrying out emotion recognition is also provided, in this embodiment, the voice data is directly input into an emotion recognition model, so that the process of translating voice into text is avoided, the problem of error in the translation process is avoided, and the emotion recognition precision is improved.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The voice emotion recognition and classification method is characterized by comprising the following steps of:

placing the voice data into an emotion recognition model;

extracting WP-log-Mel spectrogram characteristics in voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;

fusing the secondary language features and the deep features to obtain key features in the fused features; and carrying out emotion prediction classification according to the key features by the emotion recognition model, and outputting a prediction classification result.

2. The method of claim 1, wherein the step of extracting WP-log-Mel spectrogram features in the speech data by the first extraction layer comprises the steps of: WP-log-Mel spectrogram features are extracted from speech signals of speech data using wavelet packet transformation and audio processing.

3. The speech emotion recognition classification method of claim 1, wherein the fused feature output dimension is (B, T, 64), wherein B represents a lot and T represents a frame length.

4. A method of classifying speech emotion recognition as claimed in claim 1, 2 or 3, characterized in that the emotion recognition model comprises a first extraction network, a second extraction network, a fusion network, a self-attention mechanism network and a softmax layer;

the first extraction network comprises a transducer encoder comprising more than 2 identical blocks, the number of heads in the multi-Head attention mechanism in a block being more than 2.

5. The speech emotion recognition classification method of claim 4, wherein the second extraction network comprises, in order, a segmentation layer, a sinc-based convolution structure, and a multi-pooling-based convolution structure.

6. The speech emotion recognition classification method of claim 5, wherein the multi-pooling-based convolution structure comprises a convolution layer followed by parallel average pooling layers and maximum pooling layers.

7. The speech emotion recognition classification method of claim 6, wherein the convolutional layers are standardized, and the dropout value of the convolutional layers is between 0.2 and 0.5.

8. The method for classifying speech emotion recognition of claim 1, 2, 3, 5, 6 or 7, wherein the step of placing speech data into emotion recognition model comprises: and carrying out classification pretreatment on the voice data, wherein the classification pretreatment comprises mute segment cutting and denoising, and putting the processed voice data into an emotion recognition model.

9. The speech emotion recognition classification method of claim 1, 2, 3, 5, 6 or 7, wherein emotion recognition model is trained by:

acquiring voice data, and performing training pretreatment on the voice data, wherein the training pretreatment comprises mute segment cutting, denoising, data enhancement and data labeling;

the method comprises the steps of putting voice data into an initial model for training, wherein the initial model is an untrained emotion recognition model, extracting WP-log-Mel spectrogram characteristics in the voice data, and inputting the spectrogram characteristics into an encoder to obtain deep characteristics; dividing the voice data, and extracting the secondary language characteristics of the divided voice data;

fusing the secondary language features and the deep features to obtain key features in the fused features; carrying out emotion prediction classification according to key features by the emotion recognition model, and outputting a prediction classification result;

and calculating the loss of the prediction classification result and the data mark based on the loss function until the model converges to generate an emotion recognition model.

10. A method of speech emotion recognition classification as claimed in claim 8 or 9, wherein the training preprocessing step further comprises generating a plurality of speech copies from the speech data.