CN113808620B - Tibetan language emotion recognition method based on CNN and LSTM - Google Patents
Tibetan language emotion recognition method based on CNN and LSTM Download PDFInfo
- Publication number
- CN113808620B CN113808620B CN202110995181.6A CN202110995181A CN113808620B CN 113808620 B CN113808620 B CN 113808620B CN 202110995181 A CN202110995181 A CN 202110995181A CN 113808620 B CN113808620 B CN 113808620B
- Authority
- CN
- China
- Prior art keywords
- tibetan
- speech
- emotion
- lstm
- emotion recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000008451 emotion Effects 0.000 claims abstract description 94
- 238000012549 training Methods 0.000 claims abstract description 38
- 238000001228 spectrum Methods 0.000 claims abstract description 32
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 230000002996 emotional effect Effects 0.000 claims description 43
- 230000006870 function Effects 0.000 claims description 22
- 230000007774 longterm Effects 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 15
- 238000009432 framing Methods 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 description 22
- 230000009286 beneficial effect Effects 0.000 description 11
- 238000013135 deep learning Methods 0.000 description 3
- 210000004205 output neuron Anatomy 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a Tibetan language voice emotion recognition method based on CNN and LSTM, belonging to the technical field of voice emotion recognition and comprising the following steps: establishing a Tibetan language emotion corpus; preprocessing Tibetan speech data in the Tibetan speech emotion corpus; performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum; training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network; preprocessing Tibetan speech data to be recognized and extracting characteristics, and inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data; the Tibetan speech emotion recognition method based on the CNN and the LSTM solves the problem of Tibetan speech emotion recognition.
Description
Technical Field
The invention belongs to the technical field of speech emotion recognition, and particularly relates to a Tibetan speech emotion recognition method based on CNN and LSTM.
Background
The Tibetan language emotion recognition is a special case of speech emotion recognition, namely Tibetan speech with emotion is used as input, so that the computer can recognize Tibetan speech with emotion on the basis of building a mapping relation, and man-machine interaction is realized.
In recent years, with the gradual fire of deep learning, many researchers apply the deep learning to various fields, wherein the speech recognition field is also a hot field applied to the deep learning, and in the speech emotion recognition field, the improvement of the robustness and accuracy of recognition is always an important problem needing to be explored and solved, and is also a core problem. Many researchers have made a lot of efforts, and various research results are endless, such as: the multi-modal emotion recognition method is based on an emotion recognition method with fusion of a plurality of classifiers, an emotion recognition system based on a deep neural network and the like.
However, most of speech data applied in the speech emotion recognition field in recent years are speech libraries such as chinese and english, and a speech emotion recognition method for a Tibetan language library does not exist yet; second, the robustness and accuracy of current speech emotion recognition still needs to be improved. Aiming at the two points, the scheme provides a Tibetan speech emotion recognition method which can improve the recognition robustness and accuracy, namely a Tibetan speech emotion recognition method based on CNN and LSTM.
Disclosure of Invention
Aiming at the defects in the prior art, the Tibetan speech emotion recognition method based on CNN and LSTM provided by the invention solves the problem of Tibetan speech emotion recognition.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
the scheme provides a Tibetan language speech emotion recognition method based on CNN and LSTM, which comprises the following steps:
s1, establishing a Tibetan language speech emotion corpus;
s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
s3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
s4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
and S5, preprocessing and feature extracting Tibetan speech data to be recognized, and inputting the preprocessed and feature extracted Tibetan speech data into the trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data.
The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; the method adopts a Hamming window to preprocess a voice signal, utilizes a Mel scale filter to obtain a Tibetan voice spectrum related to human ear auditory sense, inputs the Tibetan voice spectrum into a Tibetan voice emotion recognition network consisting of a CNN network and an LSTM to carry out forward and reverse training on the Tibetan voice emotion recognition network, and obtains the trained Tibetan voice emotion recognition network.
Further, the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan speech data to obtain an initial Tibetan speech emotion corpus;
and S13, dividing the initial Tibetan language emotion corpus into a training set and a test set, and completing the establishment of the Tibetan language emotion corpus.
The beneficial effect of adopting the further scheme is as follows: corresponding Tibetan speech emotion data are recorded by a professional, a Tibetan speech emotion corpus is established, speech data are provided for accurately recognizing Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set for training and testing of a Tibetan speech emotion recognition network.
Further, the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: and multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal, and finishing the preprocessing of the Tibetan speech data.
The beneficial effect of adopting the above further scheme is that: the method comprises the steps of carrying out pre-emphasis processing on Tibetan speech data to improve the energy of high-frequency signals in the speech data, carrying out framing and windowing processing on the Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the Tibetan speech data before and after interception, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves a 1/3 or 1/2 window when the window is moved, and the lost data of the previous frame or the previous two frames can be re-reflected to ensure the accuracy and the integrity of the Tibetan speech data.
Further, the window function in step S23 adopts a hamming window, and the expression of the hamming window w (n) is as follows:
wherein, a 0 A constant of 0.53836 is expressed, n represents the length of the hamming window, and n represents the input window function signal.
The beneficial effect of adopting the above further scheme is that: the amplitude-frequency characteristic of the Hamming window is that the side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved.
Further, the specific steps of step S3 are as follows:
s31, performing short-time Fourier transform on the framing windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
and S32, processing the Tibetan spectrogram by using a Mel scale filter to obtain the Tibetan voice spectrum related to the auditory sense of the human ear.
The beneficial effect of adopting the above further scheme is that: the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; the unit of the Mel frequency scale of the Mel scale filter is Mel which is defined for describing the pitch and reflects the non-linear relationship between the frequency and the pitch more vividly, wherein the Tibetan language spectrum is a Tibetan language spectrogram with Mel characteristics.
Further, the Mel Scale Filter H in the step S32 m (k) The expression is as follows:
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel-scale filter H m (k) The center frequency of each triangular filter.
The beneficial effect of adopting the above further scheme is that: the Mel Scale Filter H m (k) Is a triangular filter bank that maps linear spectra into a mel-frequency nonlinear spectrum based on auditory perception and then converts to a cepstrum.
Further, the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network formed by a CNN (CNN) network and an LSTM (LSTM) for training to obtain a predicted emotion feature category y (t);
s42, back propagation training: and performing reverse training on the Tibetan speech emotion recognition network formed by the CNN network and the LSTM by taking the set predicted emotion category Y' (t) as input to obtain the closest real emotion feature category Y (t), and adjusting network parameters to enable the error between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network.
The beneficial effect of adopting the above further scheme is that: forward propagation and backward propagation training are carried out on a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM, and a gradient descent algorithm is utilized to reduce the error between the category Y (t) closest to the real emotion characteristic and the category Y (t) of the measured emotion characteristic, so that the trained Tibetan speech emotion recognition network is obtained, and accurate emotion recognition of a Tibetan speech signal is realized.
Further, the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel is composed of 60 convolution layers which are sequentially arranged and have the sizes of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, and inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
wherein, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.
S414, the long-time domain emotional characteristics y are used i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,represents the sum of all long-term emotion feature classes, i represents the number of long-term emotion feature classes, where i =1,2, …, k, k represents the total number of long-term emotion classes.
The beneficial effect of adopting the further scheme is as follows: the three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network i And (t) obtaining all long-term domain emotional characteristics by using the all connection layers, and outputting the predicted emotional characteristic type y (t) through a Softmax layer.
Further, the expression of the error function e (t) in step S42 is as follows:
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
The beneficial effect of adopting the further scheme is as follows: the error function e (t) adopts a quadratic cost function, and is suitable for the condition that the output neuron is linear.
Further, the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer;
s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data.
The beneficial effect of adopting the above further scheme is that: and the Softmax classification layer outputs a plurality of Tibetan speech emotion category results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature category y (t) with the maximum probability.
Drawings
FIG. 1 is a flowchart of steps of a Tibetan language emotion recognition method based on CNN and LSTM in the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, in an embodiment of the present invention, the present invention provides a CNN and LSTM based Tibetan language emotion recognition method, which includes the following steps:
s1, establishing a Tibetan language emotion corpus;
the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan language voice data to obtain an initial Tibetan language emotion corpus;
s13, dividing the initial Tibetan language emotion corpus into a training set and a test set to complete the establishment of the Tibetan language emotion corpus;
corresponding Tibetan speech emotion data are recorded by professionals, a Tibetan speech emotion corpus is established, speech data are provided for accurately identifying Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set and used for training and testing of a Tibetan speech emotion identification network.
S2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: and multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal, and finishing the preprocessing of the Tibetan speech data.
In the step S23, a hamming window is used as a window function, and the expression of the hamming window w (n) is as follows:
wherein, a 0 A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal;
the method comprises the steps of performing framing and windowing processing on Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the part of the Tibetan speech data before and after intercepting, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves 1/3 or 1/2 window when moving the window, the lost data of the previous frame or the previous two frames can be reflected again, the accuracy and the integrity of the Tibetan speech data are ensured, the amplitude-frequency characteristic of the Hamming window is that side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved. .
S3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
the specific steps of step S3 are as follows:
s31, carrying out short-time Fourier transform on the framed windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
and S32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of human ears, wherein the Tibetan language voice spectrum is the Tibetan language spectrogram with Mel characteristics. .
The Meyer scale filter H in the step S32 m (k) The expression is as follows:
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel Scale Filter H m (k) The center frequency of each triangular filter;
the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; said plumThe unit of the Mel frequency scale of the Mel scale filter is Mel, which is defined for tone delineation, and which more vividly reflects the non-linear relationship of frequency and tone, the Mel scale filter H m (k) A triangular filter bank that maps linear spectra into mel-frequency nonlinear spectra based on auditory perception and then converts to cepstrum.
S4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM for training to obtain a predicted emotion feature type y (t);
the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel consists of 60 convolution layers which are sequentially arranged and convolution layers with the size of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
wherein x (t) represents an emotional feature vector, y i (t) long-term emotional characteristics, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, f (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.
S414, the long-time domain emotional characteristics y are used i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,representing the sum of all long-term emotional feature classes, i representing a long-term emotional feature classNumbering, where i =1,2, …, k, k represents the total number of long-term emotion categories;
s42, back propagation training: reversely training a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as an input to obtain a closest real emotion feature category Y (t), and adjusting network parameters according to a gradient descent algorithm to enable an error e (t) between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value so as to obtain the trained Tibetan speech emotion recognition network;
the expression of the error function e (t) in step S42 is as follows:
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
The three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network i (t), obtaining all long-time domain emotional characteristics by using a full connection layer, outputting and predicting an emotional characteristic category Y (t) by a Softmax layer, carrying out forward propagation and backward propagation training on a Tibetan language voice emotion recognition network consisting of a CNN network and an LSTM, reducing an error e (t) between the nearest real emotional characteristic category Y (t) and a measured emotional characteristic category Y (t) by using a gradient descent algorithm, obtaining a trained Tibetan language voice emotion recognition network, and realizing accurate emotion recognition of a Tibetan language voice signal, wherein the error function e (t) adopts a quadratic cost function and is suitable for the condition that an output neuron is linear.
S5, preprocessing Tibetan voice data to be recognized and extracting characteristics, and then inputting the preprocessed Tibetan voice data into a trained Tibetan voice emotion recognition network to obtain a Tibetan voice emotion classification result corresponding to the Tibetan voice data;
the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer, wherein the preprocessing and feature extraction method is the same as that in S2 and S3;
s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data;
and the Softmax classification layer outputs a plurality of Tibetan speech emotion classification results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature classification y (t) with the maximum probability.
The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; according to the scheme, a Hamming window is adopted to preprocess a voice signal, a Meier scale filter is utilized to obtain a Tibetan voice spectrum related to human ear auditory sense, the Tibetan voice spectrum is input into a Tibetan voice emotion recognition network formed by a CNN network and an LSTM to perform forward and reverse training on the Tibetan voice emotion recognition network, the trained Tibetan voice emotion recognition network is obtained, and Tibetan voice data can be accurately recognized and classified in emotion mode.
Claims (6)
1. A Tibetan language emotion recognition method based on CNN and LSTM is characterized by comprising the following steps:
s1, establishing a Tibetan language emotion corpus;
s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal and finishing the preprocessing of the Tibetan speech data;
s3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
the specific steps of step S3 are as follows:
s31, carrying out short-time Fourier transform on the framed windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
s32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of a human ear;
s4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM for training to obtain a predicted emotion feature type y (t);
the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel is composed of 60 convolution layers which are sequentially arranged and have the sizes of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, and inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
wherein, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) represents the output gate of the LSTM, sigmoid (t) represents the activation function, t represents time;
s414, will be instituteLong time domain emotional characteristics y i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,representing the sum of all long-term emotion feature types, i represents the number of the long-term emotion feature types, wherein i =1,2, …, k represents the total number of the long-term emotion types;
s42, back propagation training: performing reverse training on a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as input to obtain a most approximate real emotion feature category Y (t), and adjusting network parameters to enable an error between the most approximate real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network;
and S5, preprocessing and feature extracting Tibetan speech data to be recognized, and inputting the preprocessed and feature extracted Tibetan speech data into the trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data.
2. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 1, wherein the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan language voice data to obtain an initial Tibetan language emotion corpus;
and S13, dividing the initial Tibetan language emotion corpus into a training set and a test set, and completing the establishment of the Tibetan language emotion corpus.
3. The CNN and LSTM based Tibetan language emotion recognition method of claim 1, wherein the window function in step S23 is a hamming window, and the expression of the hamming window w (n) is as follows:
wherein, a 0 A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal.
4. The CNN-and LSTM-based Tibetan language emotion recognition method of claim 1, wherein the Mel scale filter H in step S32 m (k) The expression is as follows:
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel-scale filter H m (k) The center frequency of each triangular filter.
5. The CNN and LSTM based tibetan speech emotion recognition method of claim 1, wherein the expression of the error function e (t) in step S42 is as follows:
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
6. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 5, wherein the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting emotion feature categories with probabilities through a Softmax classification layer;
and S52, selecting the emotional feature type with the maximum probability as a Tibetan language voice emotion classification result corresponding to the Tibetan language voice data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110995181.6A CN113808620B (en) | 2021-08-27 | 2021-08-27 | Tibetan language emotion recognition method based on CNN and LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110995181.6A CN113808620B (en) | 2021-08-27 | 2021-08-27 | Tibetan language emotion recognition method based on CNN and LSTM |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113808620A CN113808620A (en) | 2021-12-17 |
CN113808620B true CN113808620B (en) | 2023-03-21 |
Family
ID=78942011
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110995181.6A Active CN113808620B (en) | 2021-08-27 | 2021-08-27 | Tibetan language emotion recognition method based on CNN and LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113808620B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114596960B (en) * | 2022-03-01 | 2023-08-08 | 中山大学 | Alzheimer's disease risk prediction method based on neural network and natural dialogue |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
CN110164476A (en) * | 2019-05-24 | 2019-08-23 | 广西师范大学 | A kind of speech-emotion recognition method of the BLSTM based on multi output Fusion Features |
CN110415728A (en) * | 2019-07-29 | 2019-11-05 | 内蒙古工业大学 | A kind of method and apparatus identifying emotional speech |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
WO2020196978A1 (en) * | 2019-03-25 | 2020-10-01 | 한국과학기술원 | Electronic device for multi-scale voice emotion recognition and operation method of same |
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
CN112562725A (en) * | 2020-12-09 | 2021-03-26 | 山西财经大学 | Mixed voice emotion classification method based on spectrogram and capsule network |
CN112581979A (en) * | 2020-12-10 | 2021-03-30 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN112712824A (en) * | 2021-03-26 | 2021-04-27 | 之江实验室 | Crowd information fused speech emotion recognition method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717856B (en) * | 2018-06-16 | 2022-03-08 | 台州学院 | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network |
-
2021
- 2021-08-27 CN CN202110995181.6A patent/CN113808620B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597539A (en) * | 2018-02-09 | 2018-09-28 | 桂林电子科技大学 | Speech-emotion recognition method based on parameter migration and sound spectrograph |
WO2020196978A1 (en) * | 2019-03-25 | 2020-10-01 | 한국과학기술원 | Electronic device for multi-scale voice emotion recognition and operation method of same |
CN110164476A (en) * | 2019-05-24 | 2019-08-23 | 广西师范大学 | A kind of speech-emotion recognition method of the BLSTM based on multi output Fusion Features |
CN110415728A (en) * | 2019-07-29 | 2019-11-05 | 内蒙古工业大学 | A kind of method and apparatus identifying emotional speech |
CN110534132A (en) * | 2019-09-23 | 2019-12-03 | 河南工业大学 | A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic |
CN111785301A (en) * | 2020-06-28 | 2020-10-16 | 重庆邮电大学 | Residual error network-based 3DACRNN speech emotion recognition method and storage medium |
CN112562725A (en) * | 2020-12-09 | 2021-03-26 | 山西财经大学 | Mixed voice emotion classification method based on spectrogram and capsule network |
CN112581979A (en) * | 2020-12-10 | 2021-03-30 | 重庆邮电大学 | Speech emotion recognition method based on spectrogram |
CN112712824A (en) * | 2021-03-26 | 2021-04-27 | 之江实验室 | Crowd information fused speech emotion recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
CN113808620A (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Deshwal et al. | A language identification system using hybrid features and back-propagation neural network | |
CN112466326B (en) | Voice emotion feature extraction method based on transducer model encoder | |
CN112259106A (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
CN112784730B (en) | Multi-modal emotion recognition method based on time domain convolutional network | |
CN110853680A (en) | double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
Rammo et al. | Detecting the speaker language using CNN deep learning algorithm | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
Sinha et al. | Acoustic-phonetic feature based dialect identification in Hindi Speech | |
CN113808620B (en) | Tibetan language emotion recognition method based on CNN and LSTM | |
Sen et al. | A convolutional neural network based approach to recognize bangla spoken digits from speech signal | |
Nivetha | A survey on speech feature extraction and classification techniques | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
Xue et al. | Cross-modal information fusion for voice spoofing detection | |
KS et al. | Comparative performance analysis for speech digit recognition based on MFCC and vector quantization | |
Sun et al. | A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea | |
CN113539243A (en) | Training method of voice classification model, voice classification method and related device | |
CN113571095A (en) | Speech emotion recognition method and system based on nested deep neural network | |
Dwijayanti et al. | Speaker identification using a convolutional neural network | |
CN115064175A (en) | Speaker recognition method | |
Hanifa et al. | Comparative Analysis on Different Cepstral Features for Speaker Identification Recognition | |
CN112951270B (en) | Voice fluency detection method and device and electronic equipment | |
CN113488069A (en) | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network | |
Abdiche et al. | Text-independent speaker identification using mel-frequency energy coefficients and convolutional neural networks | |
Singh | A text independent speaker identification system using ANN, RNN, and CNN classification technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |