CN113808620B - Tibetan language emotion recognition method based on CNN and LSTM - Google Patents

Tibetan language emotion recognition method based on CNN and LSTM Download PDF

Info

Publication number
CN113808620B
CN113808620B CN202110995181.6A CN202110995181A CN113808620B CN 113808620 B CN113808620 B CN 113808620B CN 202110995181 A CN202110995181 A CN 202110995181A CN 113808620 B CN113808620 B CN 113808620B
Authority
CN
China
Prior art keywords
tibetan
speech
emotion
lstm
emotion recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110995181.6A
Other languages
Chinese (zh)
Other versions
CN113808620A (en
Inventor
边巴旺堆
王希
王君堡
卓嘎
云登努布
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tibet University
Original Assignee
Tibet University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tibet University filed Critical Tibet University
Priority to CN202110995181.6A priority Critical patent/CN113808620B/en
Publication of CN113808620A publication Critical patent/CN113808620A/en
Application granted granted Critical
Publication of CN113808620B publication Critical patent/CN113808620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a Tibetan language voice emotion recognition method based on CNN and LSTM, belonging to the technical field of voice emotion recognition and comprising the following steps: establishing a Tibetan language emotion corpus; preprocessing Tibetan speech data in the Tibetan speech emotion corpus; performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum; training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network; preprocessing Tibetan speech data to be recognized and extracting characteristics, and inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data; the Tibetan speech emotion recognition method based on the CNN and the LSTM solves the problem of Tibetan speech emotion recognition.

Description

Tibetan language emotion recognition method based on CNN and LSTM
Technical Field
The invention belongs to the technical field of speech emotion recognition, and particularly relates to a Tibetan speech emotion recognition method based on CNN and LSTM.
Background
The Tibetan language emotion recognition is a special case of speech emotion recognition, namely Tibetan speech with emotion is used as input, so that the computer can recognize Tibetan speech with emotion on the basis of building a mapping relation, and man-machine interaction is realized.
In recent years, with the gradual fire of deep learning, many researchers apply the deep learning to various fields, wherein the speech recognition field is also a hot field applied to the deep learning, and in the speech emotion recognition field, the improvement of the robustness and accuracy of recognition is always an important problem needing to be explored and solved, and is also a core problem. Many researchers have made a lot of efforts, and various research results are endless, such as: the multi-modal emotion recognition method is based on an emotion recognition method with fusion of a plurality of classifiers, an emotion recognition system based on a deep neural network and the like.
However, most of speech data applied in the speech emotion recognition field in recent years are speech libraries such as chinese and english, and a speech emotion recognition method for a Tibetan language library does not exist yet; second, the robustness and accuracy of current speech emotion recognition still needs to be improved. Aiming at the two points, the scheme provides a Tibetan speech emotion recognition method which can improve the recognition robustness and accuracy, namely a Tibetan speech emotion recognition method based on CNN and LSTM.
Disclosure of Invention
Aiming at the defects in the prior art, the Tibetan speech emotion recognition method based on CNN and LSTM provided by the invention solves the problem of Tibetan speech emotion recognition.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
the scheme provides a Tibetan language speech emotion recognition method based on CNN and LSTM, which comprises the following steps:
s1, establishing a Tibetan language speech emotion corpus;
s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
s3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
s4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
and S5, preprocessing and feature extracting Tibetan speech data to be recognized, and inputting the preprocessed and feature extracted Tibetan speech data into the trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data.
The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; the method adopts a Hamming window to preprocess a voice signal, utilizes a Mel scale filter to obtain a Tibetan voice spectrum related to human ear auditory sense, inputs the Tibetan voice spectrum into a Tibetan voice emotion recognition network consisting of a CNN network and an LSTM to carry out forward and reverse training on the Tibetan voice emotion recognition network, and obtains the trained Tibetan voice emotion recognition network.
Further, the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan speech data to obtain an initial Tibetan speech emotion corpus;
and S13, dividing the initial Tibetan language emotion corpus into a training set and a test set, and completing the establishment of the Tibetan language emotion corpus.
The beneficial effect of adopting the further scheme is as follows: corresponding Tibetan speech emotion data are recorded by a professional, a Tibetan speech emotion corpus is established, speech data are provided for accurately recognizing Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set for training and testing of a Tibetan speech emotion recognition network.
Further, the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: and multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal, and finishing the preprocessing of the Tibetan speech data.
The beneficial effect of adopting the above further scheme is that: the method comprises the steps of carrying out pre-emphasis processing on Tibetan speech data to improve the energy of high-frequency signals in the speech data, carrying out framing and windowing processing on the Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the Tibetan speech data before and after interception, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves a 1/3 or 1/2 window when the window is moved, and the lost data of the previous frame or the previous two frames can be re-reflected to ensure the accuracy and the integrity of the Tibetan speech data.
Further, the window function in step S23 adopts a hamming window, and the expression of the hamming window w (n) is as follows:
Figure BDA0003233596920000041
wherein, a 0 A constant of 0.53836 is expressed, n represents the length of the hamming window, and n represents the input window function signal.
The beneficial effect of adopting the above further scheme is that: the amplitude-frequency characteristic of the Hamming window is that the side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved.
Further, the specific steps of step S3 are as follows:
s31, performing short-time Fourier transform on the framing windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
and S32, processing the Tibetan spectrogram by using a Mel scale filter to obtain the Tibetan voice spectrum related to the auditory sense of the human ear.
The beneficial effect of adopting the above further scheme is that: the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; the unit of the Mel frequency scale of the Mel scale filter is Mel which is defined for describing the pitch and reflects the non-linear relationship between the frequency and the pitch more vividly, wherein the Tibetan language spectrum is a Tibetan language spectrogram with Mel characteristics.
Further, the Mel Scale Filter H in the step S32 m (k) The expression is as follows:
Figure BDA0003233596920000042
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel-scale filter H m (k) The center frequency of each triangular filter.
The beneficial effect of adopting the above further scheme is that: the Mel Scale Filter H m (k) Is a triangular filter bank that maps linear spectra into a mel-frequency nonlinear spectrum based on auditory perception and then converts to a cepstrum.
Further, the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network formed by a CNN (CNN) network and an LSTM (LSTM) for training to obtain a predicted emotion feature category y (t);
s42, back propagation training: and performing reverse training on the Tibetan speech emotion recognition network formed by the CNN network and the LSTM by taking the set predicted emotion category Y' (t) as input to obtain the closest real emotion feature category Y (t), and adjusting network parameters to enable the error between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network.
The beneficial effect of adopting the above further scheme is that: forward propagation and backward propagation training are carried out on a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM, and a gradient descent algorithm is utilized to reduce the error between the category Y (t) closest to the real emotion characteristic and the category Y (t) of the measured emotion characteristic, so that the trained Tibetan speech emotion recognition network is obtained, and accurate emotion recognition of a Tibetan speech signal is realized.
Further, the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel is composed of 60 convolution layers which are sequentially arranged and have the sizes of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, and inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
Figure BDA0003233596920000061
Figure BDA0003233596920000062
Figure BDA0003233596920000063
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
Figure BDA0003233596920000064
wherein, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.
S414, the long-time domain emotional characteristics y are used i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
Figure BDA0003233596920000065
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,
Figure BDA0003233596920000066
represents the sum of all long-term emotion feature classes, i represents the number of long-term emotion feature classes, where i =1,2, …, k, k represents the total number of long-term emotion classes.
The beneficial effect of adopting the further scheme is as follows: the three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network i And (t) obtaining all long-term domain emotional characteristics by using the all connection layers, and outputting the predicted emotional characteristic type y (t) through a Softmax layer.
Further, the expression of the error function e (t) in step S42 is as follows:
Figure BDA0003233596920000071
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
The beneficial effect of adopting the further scheme is as follows: the error function e (t) adopts a quadratic cost function, and is suitable for the condition that the output neuron is linear.
Further, the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer;
s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data.
The beneficial effect of adopting the above further scheme is that: and the Softmax classification layer outputs a plurality of Tibetan speech emotion category results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature category y (t) with the maximum probability.
Drawings
FIG. 1 is a flowchart of steps of a Tibetan language emotion recognition method based on CNN and LSTM in the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, in an embodiment of the present invention, the present invention provides a CNN and LSTM based Tibetan language emotion recognition method, which includes the following steps:
s1, establishing a Tibetan language emotion corpus;
the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan language voice data to obtain an initial Tibetan language emotion corpus;
s13, dividing the initial Tibetan language emotion corpus into a training set and a test set to complete the establishment of the Tibetan language emotion corpus;
corresponding Tibetan speech emotion data are recorded by professionals, a Tibetan speech emotion corpus is established, speech data are provided for accurately identifying Tibetan speech emotion, and the Tibetan speech emotion corpus is divided into a training set and a test set and used for training and testing of a Tibetan speech emotion identification network.
S2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: and multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal, and finishing the preprocessing of the Tibetan speech data.
In the step S23, a hamming window is used as a window function, and the expression of the hamming window w (n) is as follows:
Figure BDA0003233596920000091
wherein, a 0 A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal;
the method comprises the steps of performing framing and windowing processing on Tibetan speech data in a training set, intercepting a part of the Tibetan speech data which is overlapped with the part of the Tibetan speech data before and after intercepting, wherein a Hamming window only reflects middle data, data information at two ends can be lost, but only moves 1/3 or 1/2 window when moving the window, the lost data of the previous frame or the previous two frames can be reflected again, the accuracy and the integrity of the Tibetan speech data are ensured, the amplitude-frequency characteristic of the Hamming window is that side lobe attenuation is large, the main lobe peak value and the first side lobe attenuation can reach 40db, and the frequency leakage condition can be improved. .
S3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
the specific steps of step S3 are as follows:
s31, carrying out short-time Fourier transform on the framed windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
and S32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of human ears, wherein the Tibetan language voice spectrum is the Tibetan language spectrogram with Mel characteristics. .
The Meyer scale filter H in the step S32 m (k) The expression is as follows:
Figure BDA0003233596920000101
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel Scale Filter H m (k) The center frequency of each triangular filter;
the Tibetan language voice signals subjected to framing and windowing are subjected to short-time Fourier transform and stacked according to each frame to obtain a Tibetan language spectrogram; said plumThe unit of the Mel frequency scale of the Mel scale filter is Mel, which is defined for tone delineation, and which more vividly reflects the non-linear relationship of frequency and tone, the Mel scale filter H m (k) A triangular filter bank that maps linear spectra into mel-frequency nonlinear spectra based on auditory perception and then converts to cepstrum.
S4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM for training to obtain a predicted emotion feature type y (t);
the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel consists of 60 convolution layers which are sequentially arranged and convolution layers with the size of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
Figure BDA0003233596920000111
Figure BDA0003233596920000112
Figure BDA0003233596920000113
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
Figure BDA0003233596920000114
wherein x (t) represents an emotional feature vector, y i (t) long-term emotional characteristics, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, f (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) denotes the output gate of the LSTM, sigmoid (t) denotes the activation function, and t denotes time.
S414, the long-time domain emotional characteristics y are used i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
Figure BDA0003233596920000115
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,
Figure BDA0003233596920000116
representing the sum of all long-term emotional feature classes, i representing a long-term emotional feature classNumbering, where i =1,2, …, k, k represents the total number of long-term emotion categories;
s42, back propagation training: reversely training a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as an input to obtain a closest real emotion feature category Y (t), and adjusting network parameters according to a gradient descent algorithm to enable an error e (t) between the closest real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value so as to obtain the trained Tibetan speech emotion recognition network;
the expression of the error function e (t) in step S42 is as follows:
Figure BDA0003233596920000121
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
The three-dimensional characteristics of the Tibetan language voice spectrum are extracted through a CNN network, and the long-term emotional characteristics y with memorability are trained by using an LSTM network i (t), obtaining all long-time domain emotional characteristics by using a full connection layer, outputting and predicting an emotional characteristic category Y (t) by a Softmax layer, carrying out forward propagation and backward propagation training on a Tibetan language voice emotion recognition network consisting of a CNN network and an LSTM, reducing an error e (t) between the nearest real emotional characteristic category Y (t) and a measured emotional characteristic category Y (t) by using a gradient descent algorithm, obtaining a trained Tibetan language voice emotion recognition network, and realizing accurate emotion recognition of a Tibetan language voice signal, wherein the error function e (t) adopts a quadratic cost function and is suitable for the condition that an output neuron is linear.
S5, preprocessing Tibetan voice data to be recognized and extracting characteristics, and then inputting the preprocessed Tibetan voice data into a trained Tibetan voice emotion recognition network to obtain a Tibetan voice emotion classification result corresponding to the Tibetan voice data;
the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting a predicted emotion feature category y (t) with probability through a Softmax classification layer, wherein the preprocessing and feature extraction method is the same as that in S2 and S3;
s52, selecting the prediction emotion feature type y (t) with the maximum probability as a Tibetan language speech emotion classification result corresponding to the Tibetan language speech data;
and the Softmax classification layer outputs a plurality of Tibetan speech emotion classification results with the probability not equal to zero, and accurate recognition of the Tibetan speech emotion is realized by selecting the prediction emotion feature classification y (t) with the maximum probability.
The invention has the beneficial effects that: the invention provides a Tibetan speech emotion recognition method, which fills the current vacancy of Tibetan speech emotion recognition; according to the scheme, the built Tibetan speech emotion recognition network combining CNN and LSTM is adopted, so that abstract emotion characteristics in speech signals can be extracted more fully, and emotion classification is more accurate; according to the scheme, a Hamming window is adopted to preprocess a voice signal, a Meier scale filter is utilized to obtain a Tibetan voice spectrum related to human ear auditory sense, the Tibetan voice spectrum is input into a Tibetan voice emotion recognition network formed by a CNN network and an LSTM to perform forward and reverse training on the Tibetan voice emotion recognition network, the trained Tibetan voice emotion recognition network is obtained, and Tibetan voice data can be accurately recognized and classified in emotion mode.

Claims (6)

1. A Tibetan language emotion recognition method based on CNN and LSTM is characterized by comprising the following steps:
s1, establishing a Tibetan language emotion corpus;
s2, preprocessing Tibetan voice data in the Tibetan voice emotion corpus;
the specific steps of step S2 are as follows:
s21, pre-emphasis: pre-emphasis processing is carried out on Tibetan language voice data in a Tibetan language emotion corpus training set, and the expression of the pre-emphasis processing is as follows:
g(n)=x(n)-ax(n-1)
wherein x (n) represents the input Tibetan speech data, g (n) represents the pre-emphasized Tibetan speech data, x (n-1) represents the Tibetan speech data input last time, a represents an emphasis coefficient, and a is 0.96;
s22, framing: carrying out framing operation on the pre-emphasized Tibetan speech data according to a preset frame length and a preset frame to obtain a plurality of sections of framed Tibetan speech signals;
s23, windowing: multiplying the window function by each frame Tibetan speech signal to obtain a frame windowed Tibetan speech signal and finishing the preprocessing of the Tibetan speech data;
s3, performing feature extraction on the Tibetan speech data in the preprocessed Tibetan speech emotion corpus to obtain a Tibetan speech spectrum;
the specific steps of step S3 are as follows:
s31, carrying out short-time Fourier transform on the framed windowed Tibetan speech signal in the step S23, and stacking according to each frame to obtain a Tibetan spectrogram;
s32, processing the Tibetan language spectrogram by using a Mel scale filter to obtain a Tibetan language voice spectrum related to the auditory sense of a human ear;
s4, training the Tibetan speech emotion recognition network according to the Tibetan speech spectrum to obtain a trained Tibetan speech emotion recognition network;
the specific steps of step S4 are as follows:
s41, forward propagation training: inputting the Tibetan speech spectrum into a Tibetan speech emotion recognition network consisting of a CNN network and an LSTM for training to obtain a predicted emotion feature type y (t);
the specific steps of step S41 are as follows:
s411, inputting the Tibetan voice spectrum into a three-channel CNN network for training to obtain three-dimensional characteristics of the Tibetan voice spectrum;
the first channel is composed of 20 convolution layers which are sequentially arranged and have convolution kernels with the size of 3 multiplied by 3 and the step length of 1; the second channel is composed of 40 convolution layers which are sequentially arranged and have convolution kernels with the size of 5 multiplied by 5 and the step length of 2; the third channel is composed of 60 convolution layers which are sequentially arranged and have the sizes of 7 multiplied by 7 and the step length of 2;
s412, randomly arranging and combining the three-dimensional features to obtain an emotional feature vector x (t);
s413, normalizing the emotional feature vector, and inputting the normalized emotional feature vector into an LSTM network for training to obtain a long-term emotional feature y with memorability i (t);
The long-time domain emotional characteristic y i The expression of (t) is as follows:
Figure FDA0004077245340000021
Figure FDA0004077245340000022
Figure FDA0004077245340000023
g out (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t))
g forget (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
g in (t)=sigmoid(W·x(t)+V·y i (t-1)+U·C(t-1))
Figure FDA0004077245340000024
wherein, y i (t-1) represents the long-term emotional characteristics of the previous moment, C (t) represents the memory state, W represents the first weight, V represents the second weight, F (-) represents the activation function tanh, g in (t) input gate for LSTM, g forget (t) forgetting gate of LSTM, g out (t) represents the output gate of the LSTM, sigmoid (t) represents the activation function, t represents time;
s414, will be instituteLong time domain emotional characteristics y i (t) inputting the full connection layer for processing to obtain all long-time domain emotional characteristics, and outputting and predicting the emotional characteristic category y (t) through a Softmax classification layer;
the expression of the output prediction emotion characteristic category y (t) of the Softmax classification layer is as follows:
Figure FDA0004077245340000031
wherein, y i (t) represents a long-term emotional characteristic, e represents a constant e,
Figure FDA0004077245340000032
representing the sum of all long-term emotion feature types, i represents the number of the long-term emotion feature types, wherein i =1,2, …, k represents the total number of the long-term emotion types;
s42, back propagation training: performing reverse training on a Tibetan speech emotion recognition network formed by a CNN network and an LSTM by taking a set predicted emotion category Y' (t) as input to obtain a most approximate real emotion feature category Y (t), and adjusting network parameters to enable an error between the most approximate real emotion feature category Y (t) and the predicted emotion feature category Y (t) to be smaller than a preset value according to a gradient descent algorithm to obtain the trained Tibetan speech emotion recognition network;
and S5, preprocessing and feature extracting Tibetan speech data to be recognized, and inputting the preprocessed and feature extracted Tibetan speech data into the trained Tibetan speech emotion recognition network to obtain a Tibetan speech emotion classification result corresponding to the Tibetan speech data.
2. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 1, wherein the specific steps of step S1 are as follows:
s11, recording Tibetan voice data;
s12, emotion marking is carried out on the Tibetan language voice data to obtain an initial Tibetan language emotion corpus;
and S13, dividing the initial Tibetan language emotion corpus into a training set and a test set, and completing the establishment of the Tibetan language emotion corpus.
3. The CNN and LSTM based Tibetan language emotion recognition method of claim 1, wherein the window function in step S23 is a hamming window, and the expression of the hamming window w (n) is as follows:
Figure FDA0004077245340000041
wherein, a 0 A constant of 0.53836 is represented, n represents the length of the hamming window, and n represents the input window function signal.
4. The CNN-and LSTM-based Tibetan language emotion recognition method of claim 1, wherein the Mel scale filter H in step S32 m (k) The expression is as follows:
Figure FDA0004077245340000042
where m denotes the mth filter, k denotes the rotation frequency, f (-) denotes the Mel-scale filter H m (k) The center frequency of each triangular filter.
5. The CNN and LSTM based tibetan speech emotion recognition method of claim 1, wherein the expression of the error function e (t) in step S42 is as follows:
Figure FDA0004077245340000043
wherein n represents the total number of samples, Y (t) represents the closest real emotion feature category, Y (t) represents the predicted emotion feature category, and t represents time.
6. The CNN and LSTM-based Tibetan language emotion recognition method as claimed in claim 5, wherein the specific steps of step S5 are as follows:
s51, preprocessing Tibetan speech data to be recognized and extracting features, inputting the preprocessed Tibetan speech data into a trained Tibetan speech emotion recognition network, and outputting emotion feature categories with probabilities through a Softmax classification layer;
and S52, selecting the emotional feature type with the maximum probability as a Tibetan language voice emotion classification result corresponding to the Tibetan language voice data.
CN202110995181.6A 2021-08-27 2021-08-27 Tibetan language emotion recognition method based on CNN and LSTM Active CN113808620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110995181.6A CN113808620B (en) 2021-08-27 2021-08-27 Tibetan language emotion recognition method based on CNN and LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110995181.6A CN113808620B (en) 2021-08-27 2021-08-27 Tibetan language emotion recognition method based on CNN and LSTM

Publications (2)

Publication Number Publication Date
CN113808620A CN113808620A (en) 2021-12-17
CN113808620B true CN113808620B (en) 2023-03-21

Family

ID=78942011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110995181.6A Active CN113808620B (en) 2021-08-27 2021-08-27 Tibetan language emotion recognition method based on CNN and LSTM

Country Status (1)

Country Link
CN (1) CN113808620B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596960B (en) * 2022-03-01 2023-08-08 中山大学 Alzheimer's disease risk prediction method based on neural network and natural dialogue

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
CN110164476A (en) * 2019-05-24 2019-08-23 广西师范大学 A kind of speech-emotion recognition method of the BLSTM based on multi output Fusion Features
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
WO2020196978A1 (en) * 2019-03-25 2020-10-01 한국과학기술원 Electronic device for multi-scale voice emotion recognition and operation method of same
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium
CN112562725A (en) * 2020-12-09 2021-03-26 山西财经大学 Mixed voice emotion classification method based on spectrogram and capsule network
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112712824A (en) * 2021-03-26 2021-04-27 之江实验室 Crowd information fused speech emotion recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856B (en) * 2018-06-16 2022-03-08 台州学院 Speech emotion recognition method based on multi-scale deep convolution cyclic neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597539A (en) * 2018-02-09 2018-09-28 桂林电子科技大学 Speech-emotion recognition method based on parameter migration and sound spectrograph
WO2020196978A1 (en) * 2019-03-25 2020-10-01 한국과학기술원 Electronic device for multi-scale voice emotion recognition and operation method of same
CN110164476A (en) * 2019-05-24 2019-08-23 广西师范大学 A kind of speech-emotion recognition method of the BLSTM based on multi output Fusion Features
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic
CN111785301A (en) * 2020-06-28 2020-10-16 重庆邮电大学 Residual error network-based 3DACRNN speech emotion recognition method and storage medium
CN112562725A (en) * 2020-12-09 2021-03-26 山西财经大学 Mixed voice emotion classification method based on spectrogram and capsule network
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112712824A (en) * 2021-03-26 2021-04-27 之江实验室 Crowd information fused speech emotion recognition method and system

Also Published As

Publication number Publication date
CN113808620A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
Deshwal et al. A language identification system using hybrid features and back-propagation neural network
CN112466326B (en) Voice emotion feature extraction method based on transducer model encoder
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
CN112784730B (en) Multi-modal emotion recognition method based on time domain convolutional network
CN110853680A (en) double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
CN113808620B (en) Tibetan language emotion recognition method based on CNN and LSTM
Sen et al. A convolutional neural network based approach to recognize bangla spoken digits from speech signal
Nivetha A survey on speech feature extraction and classification techniques
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Xue et al. Cross-modal information fusion for voice spoofing detection
KS et al. Comparative performance analysis for speech digit recognition based on MFCC and vector quantization
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
CN113539243A (en) Training method of voice classification model, voice classification method and related device
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
Dwijayanti et al. Speaker identification using a convolutional neural network
CN115064175A (en) Speaker recognition method
Hanifa et al. Comparative Analysis on Different Cepstral Features for Speaker Identification Recognition
CN112951270B (en) Voice fluency detection method and device and electronic equipment
CN113488069A (en) Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
Abdiche et al. Text-independent speaker identification using mel-frequency energy coefficients and convolutional neural networks
Singh A text independent speaker identification system using ANN, RNN, and CNN classification technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant