CN110223714B - Emotion recognition method based on voice - Google Patents

Emotion recognition method based on voice Download PDF

Info

Publication number
CN110223714B
CN110223714B CN201910478640.6A CN201910478640A CN110223714B CN 110223714 B CN110223714 B CN 110223714B CN 201910478640 A CN201910478640 A CN 201910478640A CN 110223714 B CN110223714 B CN 110223714B
Authority
CN
China
Prior art keywords
frame
level
deep learning
features
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910478640.6A
Other languages
Chinese (zh)
Other versions
CN110223714A (en
Inventor
伍林
尹朝阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Zhexin Information Technology Co ltd
Original Assignee
Hangzhou Zhexin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Zhexin Information Technology Co ltd filed Critical Hangzhou Zhexin Information Technology Co ltd
Priority to CN201910478640.6A priority Critical patent/CN110223714B/en
Publication of CN110223714A publication Critical patent/CN110223714A/en
Application granted granted Critical
Publication of CN110223714B publication Critical patent/CN110223714B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a speech-based emotion recognition method, which comprises the steps of performing frame processing on speech, and extracting a feature vector of each frame; inputting the feature vector of each frame into a deep learning time sequence model, and outputting frame-level features; inputting the frame level characteristics and the hidden state of the deep learning time sequence model at the previous moment into the attention model, and outputting segment level characteristics through learning; inputting the segment-level features into the attention model to form a representation of the final pronunciation level; and finally, inputting the emotion prediction data into a softmax layer to obtain a probability value of the predicted emotion, so that the emotion is recognized. The invention has the beneficial effects that: the method has the advantages that the features of different levels in the voice are extracted by using the hierarchical deep learning time sequence model structure, a plurality of attention mechanisms are introduced to effectively select the key features, emotion recognition is facilitated, and by using the method, not only can the frame-level voice features be extracted, but also the segment-level voice features can be extracted, so that the accuracy of emotion recognition can be effectively improved.

Description

Emotion recognition method based on voice
Technical Field
The invention relates to the technical field of emotion recognition, in particular to an emotion recognition method based on voice.
Background
With the development of computers and artificial intelligence technologies, emotion recognition is particularly important in natural human-computer interaction. Such as an intelligent customer service system, a chat robot, etc., need to give corresponding feedback through different emotions of the client. The voice contains rich information of the speaker, and the emotion of the speaker can be recognized through voice. The traditional speech emotion recognition system firstly extracts acoustic features of each frame of audio, such as short-time energy, fundamental frequency, MFCC (Mel frequency cepstrum coefficient, a commonly used speech frequency spectrum feature) and the like, then concatenates the acoustic features, and finally recognizes emotion through a classifier. Commonly used classifiers are SVM (support vector machine, a supervised classifier), random forests, etc.
In recent years, deep learning methods are widely applied to the field of speech emotion recognition, and mainly include: 1) extracting a Mel frequency spectrum of the audio as an input of a CNN (convolutional neural network for extracting features) to further extract features, and extracting time correlation between frames through an LSTM (long short term memory network, which is suitable for processing time series), wherein an attention mechanism is introduced to reduce influence caused by silence; 2) converting audio into a sound spectrum, extracting features by adopting an FCN (full convolution network) structure in AlexNet (a deep neural network), and extracting a part useful for emotion by introducing an attention mechanism to reduce influence caused by input irrelevant to emotion; 3) extracting 32-dimensional acoustic features of the audio, and identifying emotion by adopting a bidirectional LSTM (least squares) filling intention machine mechanism; 4) 36-dimensional acoustic features of the audio are extracted, and a modified LSTM is adopted to better extract time correlation features.
Since speech is a time series, it is a good choice to extract the time-related features in speech using LSTM. In the above prior art, the input of LSTM at a certain time is the acoustic features of the corresponding frames of audio, and the association between frames is learned, but the training data set is based on segment-level labeled emotion, that is, one emotion is labeled by one piece of speech. Therefore, in addition to learning frame-level features in speech, there is a need to learn segment-level features, i.e., extract pronunciation-level features to better recognize emotion.
Disclosure of Invention
To solve the above problems, the present invention provides a method for recognizing speech emotion by using a hierarchical structure of a deep learning time sequence model, which can extract not only frame-level speech features but also segment-level speech features.
In order to achieve the above object, the present invention provides a speech-based emotion recognition method, including the steps of:
step 1: performing frame processing on the speech, and extracting features of each frame to obtain a feature vector, v, of each framenA feature vector representing an nth frame;
step 2: the feature vector v of each frame obtained in the step 1 is usednInputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and outputting the frame-level characteristics y every k framesiObtaining the frame-level features y of M time instantsi,yiRepresenting a frame level vector output by the first layer deep learning time sequence model at the ith moment;
and step 3: the frame level characteristics y of the t moment obtained in the step 2iImplicit state h at time t-1 of second-layer deep learning time sequence modelt-1Inputting the data into a first attention model to obtain a second-layer depth chart at the time tInput z of learning time sequence modeltAfter M times of learning, segment level characteristics s are outputt,stA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;
and 4, step 4: the segment-level characteristics s obtained in the step 3tInputting the data into a second attention model to form a representation of the final pronunciation level;
and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.
As a further refinement of the present invention, the first layer deep learning temporal model and the second layer deep learning temporal model are one of LSTM, RNN and GRU.
As a further improvement of the present invention, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.
As a further improvement of the invention, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame consists of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum extensibility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.
As a further improvement of the invention, in step 2, k is 3 to give
Figure BDA0002083087030000021
Frame-level features for each time instant.
As a further improvement of the present invention, in step 3, the first attention model operation mechanism is as shown in formula (1), formula (2) and formula (3):
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
Figure BDA0002083087030000031
Figure BDA0002083087030000032
wherein, wT、Wa、Ua、baIs a network parameter of the first attention model, yiFor frame-level features, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t.
As a further improvement of the invention, in step 4, the correlation coefficients of the second attention model are estimated by the network parameters u and SiObtained by vector multiplication.
As a further improvement of the invention, a plurality of the first layer deep learning time sequence models and the first attention model are used for extracting features of different levels in the voice.
The invention has the beneficial effects that: the method has the advantages that the features of different levels in the voice are extracted by using the hierarchical deep learning time sequence model structure, a plurality of attention mechanisms are introduced to effectively select the key features, emotion recognition is facilitated, and by using the method, not only can the frame-level voice features be extracted, but also the segment-level voice features can be extracted, so that the accuracy of emotion recognition can be effectively improved.
Drawings
FIG. 1 is a flow chart of a method for speech-based emotion recognition according to an embodiment of the present invention;
FIG. 2 is a block diagram of an emotion recognition system of a speech-based emotion recognition method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an LSTM model structure of a speech-based emotion recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an RNN model structure of a speech-based emotion recognition method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a GRU model of a speech-based emotion recognition method according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.
As shown in fig. 1, a method for emotion recognition based on speech according to an embodiment of the present invention includes the following steps:
step 1: performing frame processing on the speech, and extracting features of each frame to obtain a feature vector, v, of each framenA feature vector representing an nth frame;
step 2: the feature vector v of each frame obtained in the step 1 is usednInputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and outputting the frame-level characteristics y every k framesiObtaining the frame-level features y of M time instantsi,yiRepresenting a frame level vector output by the first layer deep learning time sequence model at the ith moment;
and step 3: the frame level characteristics y of the t moment obtained in the step 2iImplicit state h at time t-1 of second-layer deep learning time sequence modelt-1Inputting the input z into the first attention model to obtain the input z of the second layer deep learning time sequence model at the time ttAfter M times of learning, segment level characteristics s are outputt,stA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;
and 4, step 4: the segment-level characteristics s obtained in the step 3tInputting the data into a second attention model to form a representation of the final pronunciation level;
and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.
Further, the first layer deep learning timing model and the second layer deep learning timing model are one of LSTM, RNN, and GRU.
Further, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.
Further, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame is composed of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum ductility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.
Further, in step 2, k is 3 to give
Figure BDA0002083087030000041
Frame-level features for each time instant.
Further, in step 1 and step 3, the first attention model operation mechanism is shown as formula (1), formula (2) and formula (3):
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
Figure BDA0002083087030000043
Figure BDA0002083087030000044
wherein, wT、Wa、Ua、baFor the network parameters of the first attention model (W, U for weight, b for bias), yiFor frame-level features, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t.
Further, in step 4, the correlation coefficient of the second attention model is estimated by the network estimation parameters u and Siu is obtained by vector multiplication.
Furthermore, the characteristics of different levels in the voice are extracted by using the plurality of first-layer deep learning time sequence models and the first attention model, and the emotion recognition effect is improved. The first layer deep learning time sequence model and the first attention model are taken as an integral module, and the integral module formed by the multi-level first layer deep learning time sequence model and the first attention model can be adopted to realize the extraction of the characteristics in the voice by three or more layers of deep learning time sequence models.
As shown in fig. 2 and fig. 3, when recognizing speech emotion by using LSTM hierarchical structure, firstly, framing the speech, where each frame is 25ms in length and the frame shift is 10 ms; and extracting 36-dimensional features for each frame, wherein the 36-dimensional features comprise 13-dimensional MFCC, zero crossing rate, energy entropy, spectrum center, spectrum spread, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch. The 36-dimensional feature vectors for each frame are then input into the first-level LSTM structure, as shown in FIG. 2, vnRepresenting the feature vector of the nth frame, the association between frames can be learned by the first layer LSTM.
Second, the output y of the first layer LSTM is output every k framesiImplicit state h from the moment before the second layer LSTMt-1Inputting to an attention model; k is 3, then obtained jointly from the first layer LSTM
Figure BDA0002083087030000053
Output of individual time { y1,y2,......,yM}. The first attention model works as follows:
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
Figure BDA0002083087030000055
Figure BDA0002083087030000056
wherein, wT,Wa,UaAs a weight value, baTo be offset, yiIs a frame-level feature at time t, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t. With the attention model, it is effective to let the second layer LSTM select key features by increasing the attention coefficient. For example: attention coefficient ai (t)When the value is equal to 0, the ith frame-level feature y is not selectedi
Thirdly, the features S of the second layer LSTM learningiInput to a second attention model to form a representation of the final pronunciation level. Here the correlation coefficient e of the second attention modeli (t)Estimating parameters u and S from the networkiObtained by vector multiplication.
And finally, inputting the representation of the pronunciation level into a softmax layer to obtain a probability value of the predicted emotion, thereby realizing emotion recognition.
Using a layered LSTM structure: after the audio is subjected to framing processing, each frame is tens of milliseconds, and the extracted features are based on phonemes or even lower level elements. After a hierarchical LSTM structure is used, the relation between phoneme characteristics is learned through a first layer of LSTM, the phoneme forms syllables, the syllables form different words and phrases, and the relation between the syllable characteristics and the phrase characteristics is extracted from the first layer of LSTM and output to a second layer of LSTM at intervals to learn the syllable characteristics. Such as a certain voice "feed, hello! ", the hierarchical LSTM structure can be used to learn the relationships of"/w/"/ei/"/n/"etc., and also learn and distinguish emotions based on" wei "and" ni hao ", whereas the prior art can only learn based on"/w/"/ei/"/n/"and other phoneme features, or even lower level element features. Compared with the prior art which uses a single-layer LSTM, the invention can extract the characteristics of different levels in the voice by using a layered LSTM structure, thereby being more beneficial to emotion recognition.
In specific implementation, because the LSTM, RNN, and GRU modules all put a frame sequence into the modules, each time of the modules has an output, but the internal mechanisms of the three modules are slightly different, but the characteristics of the sequences can be extracted, so the LSTM can be converted into similar timing models such as RNN, GRU, and the like.
The model structures of RNN, GRU and LSTM are as follows:
the LSTM internally comprises a memory unit and has long-time and short-time memory; the system comprises three gates, namely an input gate, a forgetting gate and an output gate, and the specific expression formula is as follows:
it=σ(Wi·[ht-1,xt]+bi)
ft=σ(Wf·[ht-1,xt]+bf)
ct=ft⊙ct-1+it⊙tanh(Wc·[ht-1,xt]+bc)
ot=σ(Wo·[ht-1,xt]+bo)
ht=ot⊙tanh(ct)
wherein it,ft,otAn input gate, a forgetting gate and an output gate are respectively arranged; c. CtIs a memory cell, htIs an implicit state, σ denotes a sigmoid function, a indicates a dot product, W denotes a weight, and b denotes an offset. The LSTM determines how much past time information is reserved in the memory unit through the forgetting gate, and receives the current time information through the input gate.
The RNN has no memory unit therein, learns the relationship between a plurality of time inputs through the hidden layer, and specifically expresses the following formula:
Figure BDA0002083087030000071
Figure BDA0002083087030000072
wherein x istIndicates the input at time t, htIndicating an implicit state at time t, otAn output representing the time at which t is present,
Figure BDA0002083087030000073
the activation function is expressed, and the tanh function is generally selected; u, W, V are weights, and b is an offset. It can be found that the implicit state at the moment t is not only related to the input at the moment t, but also related to the state before the moment t, so that the association between the time series can be effectively learned.
RNN is a hidden layer, and inputs x (t-1) x (t +1).. sequence and outputs y (t-1) y (t +1).. sequence by using the same weight parameter, and the model structure of RNN is shown in fig. 4.
The GRU also has no memory cells inside, but has an update gate that helps the model determine how much past information to pass into the future, and a reset gate that mainly determines how much past information to forget. The concrete expression formula is as follows:
zt=σ(Uzxt+Wzht-1)
rt=σ(Urxt+Wrht-1)
Figure BDA0002083087030000074
Figure BDA0002083087030000075
wherein x istIndicates the input at time t, htOutput representing time t, ztIndicating an update gate, rtIt is indicated that the gate is reset,
Figure BDA0002083087030000076
represents an intermediate state at time t; σ denotes a sigmoid function, and |, denotes a dot product, and U, W denotes a weight. It can be seen that the output of the GRU at each instant depends on the resets and the update gates.
The GRU is a variant of LSTM, simplifying the LSTM network, and the model structure of the GRU is shown in fig. 5.
Further extended, the method can also be applied to speech age group recognition and gender recognition.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A speech-based emotion recognition method, characterized in that the method comprises the steps of:
step 1: performing frame processing on the voice, extracting features of each frame to obtain a feature vector of each frame,
Figure DEST_PATH_IMAGE002
is shown as
Figure DEST_PATH_IMAGE004
A feature vector of the frame;
step 2: the feature vector of each frame obtained in the step 1 is used
Figure 445355DEST_PATH_IMAGE002
Inputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and every other frame
Figure DEST_PATH_IMAGE006
Frame output frame level features
Figure DEST_PATH_IMAGE008
To obtain
Figure DEST_PATH_IMAGE010
Frame level characterization of individual time instances
Figure DEST_PATH_IMAGE012
Figure 175544DEST_PATH_IMAGE012
Representing a frame level vector output by the first layer deep learning time sequence model at the ith moment;
and step 3: the frame level characteristics of M moments of the t moment speech obtained in the step 2
Figure 437505DEST_PATH_IMAGE008
Implicit states at time t-1 of the second-layer deep learning time sequence model
Figure DEST_PATH_IMAGE014
Inputting the data into a first attention model to obtain the input of a second layer deep learning time sequence model at the time t
Figure DEST_PATH_IMAGE016
Through which is passed
Figure 629452DEST_PATH_IMAGE010
Learning at each moment, outputting segment-level features
Figure DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE020
A segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;
and 4, step 4: segment-level characteristics obtained in step 3
Figure DEST_PATH_IMAGE022
Inputting the data into a second attention model to form a representation of the final pronunciation level;
and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.
2. A speech based emotion recognition method as claimed in claim 1, wherein the first layer deep learning temporal model and the second layer deep learning temporal model are one of LSTM, RNN and GRU.
3. A speech-based emotion recognition method as claimed in claim 1, wherein in step 1, each frame has a length of 25ms and the frame shift is 10 ms.
4. The method of claim 1, wherein in step 1, 36-dimensional features are extracted from each frame, and the feature vector of each frame is extracted
Figure 725715DEST_PATH_IMAGE002
The method is characterized by comprising 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum extensibility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.
5. A speech based emotion recognition method as claimed in claim 1, wherein, in step 2,
Figure DEST_PATH_IMAGE024
= 3, obtained
Figure DEST_PATH_IMAGE026
Frame-level features for each time instant.
6. A speech-based emotion recognition method as claimed in claim 1, wherein in step 3, the first attention model operation scheme is as shown in formula (1), formula (2) and formula (3):
Figure DEST_PATH_IMAGE028
(1)
Figure DEST_PATH_IMAGE030
(2)
Figure DEST_PATH_IMAGE032
(3)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE036
Figure DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE040
being the network parameters of the first attention model,
Figure DEST_PATH_IMAGE042
in order to be a frame-level feature,
Figure DEST_PATH_IMAGE044
for the implicit state at time LSTMt-1,
Figure 433996DEST_PATH_IMAGE016
for the input of the LSTM at time t,
Figure DEST_PATH_IMAGE046
for time t frame level features
Figure 356822DEST_PATH_IMAGE042
Input with LSTM at time t
Figure 227826DEST_PATH_IMAGE016
The correlation coefficient of (a) is calculated,
Figure DEST_PATH_IMAGE048
for the attention coefficient at time t, the second layer deep learning temporal model is LSTM.
7. A speech-based emotion recognition method as claimed in claim 1, wherein in step 4, the correlation coefficient of the second attention model is a network estimation parameter
Figure DEST_PATH_IMAGE050
And
Figure DEST_PATH_IMAGE052
obtained by vector multiplication.
8. The method of claim 1, wherein a plurality of the first-level deep learning temporal models and a plurality of the first attention models are used to extract features of different levels in the speech.
CN201910478640.6A 2019-06-03 2019-06-03 Emotion recognition method based on voice Active CN110223714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910478640.6A CN110223714B (en) 2019-06-03 2019-06-03 Emotion recognition method based on voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910478640.6A CN110223714B (en) 2019-06-03 2019-06-03 Emotion recognition method based on voice

Publications (2)

Publication Number Publication Date
CN110223714A CN110223714A (en) 2019-09-10
CN110223714B true CN110223714B (en) 2021-08-03

Family

ID=67819528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910478640.6A Active CN110223714B (en) 2019-06-03 2019-06-03 Emotion recognition method based on voice

Country Status (1)

Country Link
CN (1) CN110223714B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110556130A (en) * 2019-09-17 2019-12-10 平安科技(深圳)有限公司 Voice emotion recognition method and device and storage medium
CN110600015B (en) * 2019-09-18 2020-12-15 北京声智科技有限公司 Voice dense classification method and related device
CN110956953B (en) * 2019-11-29 2023-03-10 中山大学 Quarrel recognition method based on audio analysis and deep learning
CN111276131B (en) * 2020-01-22 2021-01-12 厦门大学 Multi-class acoustic feature integration method and system based on deep neural network
CN111312292A (en) * 2020-02-18 2020-06-19 北京三快在线科技有限公司 Emotion recognition method and device based on voice, electronic equipment and storage medium
CN111583965A (en) * 2020-04-28 2020-08-25 北京慧闻科技(集团)有限公司 Voice emotion recognition method, device, equipment and storage medium
CN111968677B (en) * 2020-08-21 2021-09-07 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid
CN112185423B (en) * 2020-09-28 2023-11-21 南京工程学院 Voice emotion recognition method based on multi-head attention mechanism
CN112671984B (en) * 2020-12-01 2022-09-23 长沙市到家悠享网络科技有限公司 Service mode switching method and device, robot customer service and storage medium
CN113688822A (en) * 2021-09-07 2021-11-23 河南工业大学 Time sequence attention mechanism scene image identification method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9020822B2 (en) * 2012-10-19 2015-04-28 Sony Computer Entertainment Inc. Emotion recognition using auditory attention cues extracted from users voice
SG11201806080WA (en) * 2016-01-19 2018-08-30 Murdoch Childrens Res Inst Systems and computer-implemented methods for assessing social competency
US20180133900A1 (en) * 2016-11-15 2018-05-17 JIBO, Inc. Embodied dialog and embodied speech authoring tools for use with an expressive social robot
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
CN108334583B (en) * 2018-01-26 2021-07-09 上海智臻智能网络科技股份有限公司 Emotion interaction method and device, computer readable storage medium and computer equipment
CN108597539B (en) * 2018-02-09 2021-09-03 桂林电子科技大学 Speech emotion recognition method based on parameter migration and spectrogram
CN108717856B (en) * 2018-06-16 2022-03-08 台州学院 Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN108874782B (en) * 2018-06-29 2019-04-26 北京寻领科技有限公司 A kind of more wheel dialogue management methods of level attention LSTM and knowledge mapping
CN109003625B (en) * 2018-07-27 2021-01-12 中国科学院自动化研究所 Speech emotion recognition method and system based on ternary loss
CN109285562B (en) * 2018-09-28 2022-09-23 东南大学 Voice emotion recognition method based on attention mechanism
CN109243494B (en) * 2018-10-30 2022-10-11 南京工程学院 Children emotion recognition method based on multi-attention mechanism long-time memory network
CN109599129B (en) * 2018-11-13 2021-09-14 杭州电子科技大学 Voice depression recognition system based on attention mechanism and convolutional neural network
CN109599128B (en) * 2018-12-24 2022-03-01 北京达佳互联信息技术有限公司 Speech emotion recognition method and device, electronic equipment and readable medium
CN109637522B (en) * 2018-12-26 2022-12-09 杭州电子科技大学 Speech emotion recognition method for extracting depth space attention features based on spectrogram
CN109817246B (en) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network

Also Published As

Publication number Publication date
CN110223714A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN110223714B (en) Emotion recognition method based on voice
US11222620B2 (en) Speech recognition using unspoken text and speech synthesis
Huang et al. Audio-visual deep learning for noise robust speech recognition
Mimura et al. Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition
Agarwalla et al. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech
US8972253B2 (en) Deep belief network for large vocabulary continuous speech recognition
Yamada et al. Improvement of distant-talking speaker identification using bottleneck features of DNN.
US11205420B1 (en) Speech processing using a recurrent neural network
CN112331183B (en) Non-parallel corpus voice conversion method and system based on autoregressive network
Wang et al. Boosting classification based speech separation using temporal dynamics
Cardona et al. Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs
Jung et al. A unified deep learning framework for short-duration speaker verification in adverse environments
Tokuda et al. Temporal modeling in neural network based statistical parametric speech synthesis.
You et al. Deep neural network embeddings with gating mechanisms for text-independent speaker verification
Salam et al. Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters.
Soltau et al. Reducing the computational complexity for whole word models
Bi et al. Deep feed-forward sequential memory networks for speech synthesis
US11557292B1 (en) Speech command verification
Shah et al. Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion.
Masumura et al. End-to-end automatic speech recognition with deep mutual learning
Meirong et al. Query-by-example on-device keyword spotting using convolutional recurrent neural network and connectionist temporal classification
Lee et al. Isolated word recognition using modular recurrent neural networks
Jain et al. Investigation Using MLP-SVM-PCA Classifiers on Speech Emotion Recognition
Kaushik et al. End-to-end speaker age and height estimation using attention mechanism and triplet loss
Hmich et al. Automatic speaker identification by using the neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant