CN113808620B

CN113808620B - Tibetan language emotion recognition method based on CNN and LSTM

Info

Publication number: CN113808620B
Application number: CN202110995181.6A
Authority: CN
Inventors: 边巴旺堆; 王希; 王君堡; 卓嘎; 云登努布
Original assignee: Tibet University
Current assignee: Tibet University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-03-21
Anticipated expiration: 2041-08-27
Also published as: CN113808620A

Abstract

The invention discloses a Tibetan language voice emotion recognition method based on CNN and LSTM, belonging to the technical field of voice emotion recognition and comprising the following steps: establishing a Tibetan language emotion corpus; preprocessing Tibetan speech data in the Tibetan speech emotion corpus; performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum; training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network; preprocessing Tibetan speech data to be recognized and extracting characteristics, and inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data; the Tibetan speech emotion recognition method based on the CNN and the LSTM solves the problem of Tibetan speech emotion recognition.

Description

Tibetan language emotion recognition method based on CNN and LSTM

Technical Field

The invention belongs to the technical field of speech emotion recognition, and particularly relates to a Tibetan speech emotion recognition method based on CNN and LSTM.

Background

The Tibetan language emotion recognition is a special case of speech emotion recognition, namely Tibetan speech with emotion is used as input, so that the computer can recognize Tibetan speech with emotion on the basis of building a mapping relation, and man-machine interaction is realized.

In recent years, with the gradual fire of deep learning, many researchers apply the deep learning to various fields, wherein the speech recognition field is also a hot field applied to the deep learning, and in the speech emotion recognition field, the improvement of the robustness and accuracy of recognition is always an important problem needing to be explored and solved, and is also a core problem. Many researchers have made a lot of efforts, and various research results are endless, such as: the multi-modal emotion recognition method is based on an emotion recognition method with fusion of a plurality of classifiers, an emotion recognition system based on a deep neural network and the like.

However, most of speech data applied in the speech emotion recognition field in recent years are speech libraries such as chinese and english, and a speech emotion recognition method for a Tibetan language library does not exist yet; second, the robustness and accuracy of current speech emotion recognition still needs to be improved. Aiming at the two points, the scheme provides a Tibetan speech emotion recognition method which can improve the recognition robustness and accuracy, namely a Tibetan speech emotion recognition method based on CNN and LSTM.

Disclosure of Invention

Aiming at the defects in the prior art, the Tibetan speech emotion recognition method based on CNN and LSTM provided by the invention solves the problem of Tibetan speech emotion recognition.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

the scheme provides a Tibetan language speech emotion recognition method based on CNN and LSTM, which comprises the following steps:

s1, establishing a Tibetan language speech emotion corpus;

s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;

s3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;

s4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;

and S5, preprocessing and feature extracting Tibetan speech data to be recognized, and inputting the preprocessed and feature extracted Tibetan speech data into the trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data.

The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; the method adopts a Hamming window to preprocess a voice signal, utilizes a Mel scale filter to obtain a Tibetan voice spectrum related to human ear auditory sense, inputs the Tibetan voice spectrum into a Tibetan voice emotion recognition network consisting of a CNN network and an LSTM to carry out forward and reverse training on the Tibetan voice emotion recognition network, and obtains the trained Tibetan voice emotion recognition network.

Further, the specific steps of step S1 are as follows:

s11, recording Tibetan voice data;

s12, emotion marking is carried out on the Tibetan speech data to obtain an initial Tibetan speech emotion corpus;

and S13, dividing the initial Tibetan language emotion corpus into a training set and a test set, and completing the establishment of the Tibetan language emotion corpus.

The beneficial effect of adopting the further scheme is as follows: corresponding Tibetan speech emotion data are recorded by a professional, a Tibetan speech emotion corpus is established, speech data are provided for accurately recognizing Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set for training and testing of a Tibetan speech emotion recognition network.

Further, the specific steps of step S2 are as follows:

s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:

g(n)＝x(n)-ax(n-1)

wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;

s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;

s23, windowing: and multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal, and finishing the preprocessing of the Tibetan speech data.

The beneficial effect of adopting the above further scheme is that: the method comprises the steps of carrying out pre-emphasis processing on Tibetan speech data to improve the energy of high-frequency signals in the speech data, carrying out framing and windowing processing on the Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the Tibetan speech data before and after interception, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves a 1/3 or 1/2 window when the window is moved, and the lost data of the previous frame or the previous two frames can be re-reflected to ensure the accuracy and the integrity of the Tibetan speech data.

Further, the window function in step S23 adopts a hamming window, and the expression of the hamming window w (n) is as follows:

wherein, a ₀ A constant of 0.53836 is expressed, n represents the length of the hamming window, and n represents the input window function signal.

The beneficial effect of adopting the above further scheme is that: the amplitude-frequency characteristic of the Hamming window is that the side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved.

Further, the specific steps of step S3 are as follows:

s31, performing short-time Fourier transform on the framing windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;

and S32, processing the Tibetan spectrogram by using a Mel scale filter to obtain the Tibetan voice spectrum related to the auditory sense of the human ear.

The beneficial effect of adopting the above further scheme is that: the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; the unit of the Mel frequency scale of the Mel scale filter is Mel which is defined for describing the pitch and reflects the non-linear relationship between the frequency and the pitch more vividly, wherein the Tibetan language spectrum is a Tibetan language spectrogram with Mel characteristics.

Further, the Mel Scale Filter H in the step S32 _m (k) The expression is as follows:

where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel-scale filter H _m (k) The center frequency of each triangular filter.

The beneficial effect of adopting the above further scheme is that: the Mel Scale Filter H _m (k) Is a triangular filter bank that maps linear spectra into a mel-frequency nonlinear spectrum based on auditory perception and then converts to a cepstrum.

Further, the specific steps of step S4 are as follows:

s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network formed by a CNN (CNN) network and an LSTM (LSTM) for training to obtain a predicted emotion feature category y (t);

s42, back propagation training: and performing reverse training on the Tibetan speech emotion recognition network formed by the CNN network and the LSTM by taking the set predicted emotion category Y' (t) as input to obtain the closest real emotion feature category Y (t), and adjusting network parameters to enable the error between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network.

The beneficial effect of adopting the above further scheme is that: forward propagation and backward propagation training are carried out on a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM, and a gradient descent algorithm is utilized to reduce the error between the category Y (t) closest to the real emotion characteristic and the category Y (t) of the measured emotion characteristic, so that the trained Tibetan speech emotion recognition network is obtained, and accurate emotion recognition of a Tibetan speech signal is realized.

Further, the specific steps of step S41 are as follows:

s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;

the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel is composed of 60 convolution layers which are sequentially arranged and have the sizes of 7 multiplied by 7 and the step length of 2;

s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);

s413, normalizing the emotional feature vector, and inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability _i (t)；

The long-time domain emotional characteristic y _i The expression of (t) is as follows:

g _out (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t))

g _forget (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

g _in (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

wherein, y _i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g _in (t) input gate for LSTM, g _forget (t) forgetting gate of LSTM, g _out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.

S414, the long-time domain emotional characteristics y are used _i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;

the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:

wherein, y _i (t) represents a long-term emotional characteristic, e represents a constant e,

represents the sum of all long-term emotion feature classes, i represents the number of long-term emotion feature classes, where i =1,2, …, k, k represents the total number of long-term emotion classes.

The beneficial effect of adopting the further scheme is as follows: the three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network _i And (t) obtaining all long-term domain emotional characteristics by using the all connection layers, and outputting the predicted emotional characteristic type y (t) through a Softmax layer.

Further, the expression of the error function e (t) in step S42 is as follows:

wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.

The beneficial effect of adopting the further scheme is as follows: the error function e (t) adopts a quadratic cost function, and is suitable for the condition that the output neuron is linear.

Further, the specific steps of step S5 are as follows:

s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer;

s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data.

The beneficial effect of adopting the above further scheme is that: and the Softmax classification layer outputs a plurality of Tibetan speech emotion category results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature category y (t) with the maximum probability.

Drawings

FIG. 1 is a flowchart of steps of a Tibetan language emotion recognition method based on CNN and LSTM in the embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, in an embodiment of the present invention, the present invention provides a CNN and LSTM based Tibetan language emotion recognition method, which includes the following steps:

s1, establishing a Tibetan language emotion corpus;

the specific steps of step S1 are as follows:

s11, recording Tibetan voice data;

s12, emotion marking is carried out on the Tibetan language voice data to obtain an initial Tibetan language emotion corpus;

s13, dividing the initial Tibetan language emotion corpus into a training set and a test set to complete the establishment of the Tibetan language emotion corpus;

corresponding Tibetan speech emotion data are recorded by professionals, a Tibetan speech emotion corpus is established, speech data are provided for accurately identifying Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set and used for training and testing of a Tibetan speech emotion identification network.

S2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;

the specific steps of step S2 are as follows:

g(n)＝x(n)-ax(n-1)

In the step S23, a hamming window is used as a window function, and the expression of the hamming window w (n) is as follows:

wherein, a ₀ A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal;

the method comprises the steps of performing framing and windowing processing on Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the part of the Tibetan speech data before and after intercepting, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves 1/3 or 1/2 window when moving the window, the lost data of the previous frame or the previous two frames can be reflected again, the accuracy and the integrity of the Tibetan speech data are ensured, the amplitude-frequency characteristic of the Hamming window is that side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved. .

the specific steps of step S3 are as follows:

s31, carrying out short-time Fourier transform on the framed windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;

and S32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of human ears, wherein the Tibetan language voice spectrum is the Tibetan language spectrogram with Mel characteristics. .

The Meyer scale filter H in the step S32 _m (k) The expression is as follows:

where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel Scale Filter H _m (k) The center frequency of each triangular filter;

the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; said plumThe unit of the Mel frequency scale of the Mel scale filter is Mel, which is defined for tone delineation, and which more vividly reflects the non-linear relationship of frequency and tone, the Mel scale filter H _m (k) A triangular filter bank that maps linear spectra into mel-frequency nonlinear spectra based on auditory perception and then converts to cepstrum.

the specific steps of step S4 are as follows:

s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM for training to obtain a predicted emotion feature type y (t);

the specific steps of step S41 are as follows:

the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel consists of 60 convolution layers which are sequentially arranged and convolution layers with the size of 7 multiplied by 7 and the step length of 2;

s413, normalizing the emotional feature vector, inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability _i (t)；

g _out (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t))

g _forget (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

g _in (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

wherein x (t) represents an emotional feature vector, y _i (t) long-term emotional characteristics, y _i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, f (-) represents the activation function tanh, g _in (t) input gate for LSTM, g _forget (t) forgetting gate of LSTM, g _out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.

representing the sum of all long-term emotional feature classes, i representing a long-term emotional feature classNumbering, where i =1,2, …, k, k represents the total number of long-term emotion categories;

s42, back propagation training: reversely training a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as an input to obtain a closest real emotion feature category Y (t), and adjusting network parameters according to a gradient descent algorithm to enable an error e (t) between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value so as to obtain the trained Tibetan speech emotion recognition network;

the expression of the error function e (t) in step S42 is as follows:

The three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network _i (t), obtaining all long-time domain emotional characteristics by using a full connection layer, outputting and predicting an emotional characteristic category Y (t) by a Softmax layer, carrying out forward propagation and backward propagation training on a Tibetan language voice emotion recognition network consisting of a CNN network and an LSTM, reducing an error e (t) between the nearest real emotional characteristic category Y (t) and a measured emotional characteristic category Y (t) by using a gradient descent algorithm, obtaining a trained Tibetan language voice emotion recognition network, and realizing accurate emotion recognition of a Tibetan language voice signal, wherein the error function e (t) adopts a quadratic cost function and is suitable for the condition that an output neuron is linear.

S5, preprocessing Tibetan voice data to be recognized and extracting characteristics, and then inputting the preprocessed Tibetan voice data into a trained Tibetan voice emotion recognition network to obtain a Tibetan voice emotion classification result corresponding to the Tibetan voice data;

the specific steps of step S5 are as follows:

s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer, wherein the preprocessing and feature extraction method is the same as that in S2 and S3;

s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data;

and the Softmax classification layer outputs a plurality of Tibetan speech emotion classification results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature classification y (t) with the maximum probability.

The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; according to the scheme, a Hamming window is adopted to preprocess a voice signal, a Meier scale filter is utilized to obtain a Tibetan voice spectrum related to human ear auditory sense, the Tibetan voice spectrum is input into a Tibetan voice emotion recognition network formed by a CNN network and an LSTM to perform forward and reverse training on the Tibetan voice emotion recognition network, the trained Tibetan voice emotion recognition network is obtained, and Tibetan voice data can be accurately recognized and classified in emotion mode.

Claims

1. A Tibetan language emotion recognition method based on CNN and LSTM is characterized by comprising the following steps:

s1, establishing a Tibetan language emotion corpus;

s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;

the specific steps of step S2 are as follows:

g(n)＝x(n)-ax(n-1)

s23, windowing: multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal and finishing the preprocessing of the Tibetan speech data;

the specific steps of step S3 are as follows:

s32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of a human ear;

the specific steps of step S4 are as follows:

the specific steps of step S41 are as follows:

g _out (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t))

g _forget (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

g _in (t)＝sigmoid(W·x(t)+V·y _i (t-1)+U·C(t-1))

wherein, y _i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g _in (t) input gate for LSTM, g _forget (t) forgetting gate of LSTM, g _out (t) represents the output gate of the LSTM, sigmoid (t) represents the activation function, t represents time;

s414, will be instituteLong time domain emotional characteristics y _i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;

representing the sum of all long-term emotion feature types, i represents the number of the long-term emotion feature types, wherein i =1,2, …, k represents the total number of the long-term emotion types;

s42, back propagation training: performing reverse training on a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as input to obtain a most approximate real emotion feature category Y (t), and adjusting network parameters to enable an error between the most approximate real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network;

2. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 1, wherein the specific steps of step S1 are as follows:

s11, recording Tibetan voice data;

3. The CNN and LSTM based Tibetan language emotion recognition method of claim 1, wherein the window function in step S23 is a hamming window, and the expression of the hamming window w (n) is as follows:

wherein, a ₀ A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal.

4. The CNN-and LSTM-based Tibetan language emotion recognition method of claim 1, wherein the Mel scale filter H in step S32 _m (k) The expression is as follows:

5. The CNN and LSTM based tibetan speech emotion recognition method of claim 1, wherein the expression of the error function e (t) in step S42 is as follows:

6. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 5, wherein the specific steps of step S5 are as follows:

s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting emotion feature categories with probabilities through a Softmax classification layer;

and S52, selecting the emotional feature type with the maximum probability as a Tibetan language voice emotion classification result corresponding to the Tibetan language voice data.