CN108806724B - Method and system for predicting sentiment voice PAD value - Google Patents
Method and system for predicting sentiment voice PAD value Download PDFInfo
- Publication number
- CN108806724B CN108806724B CN201810926352.8A CN201810926352A CN108806724B CN 108806724 B CN108806724 B CN 108806724B CN 201810926352 A CN201810926352 A CN 201810926352A CN 108806724 B CN108806724 B CN 108806724B
- Authority
- CN
- China
- Prior art keywords
- support vector
- regression model
- training
- vector regression
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000008451 emotion Effects 0.000 claims abstract description 82
- 238000012360 testing method Methods 0.000 claims abstract description 46
- 230000002996 emotional effect Effects 0.000 claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 65
- 238000013075 data extraction Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008909 emotion recognition Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses an emotional voice PAD value prediction method and system. The method comprises the following steps: acquiring test emotion voice data; performing feature extraction on the test emotion voice data to obtain test feature data; obtaining a trained support vector regression model; and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data. The method or the system can quickly and accurately predict the PAD value of the emotional voice.
Description
Technical Field
The invention relates to the field of prediction of an emotional voice PAD value, in particular to a prediction method and a prediction system of the emotional voice PAD value.
Background
Speech is the most effective way for human communication and is increasingly used in man-machine interaction applications. The voice not only contains character information, but also contains rich information capable of reflecting the emotion state of the speaker. The speech emotion recognition is a cognitive judgment on the emotion type of a speaker through a computer, and most of the current speech emotion recognition research focuses on basic discrete emotions, such as recognizing whether the emotion of speech is angry or happy. In real life, however, the emotions of people are usually continuous and complicated and changeable, and emotions such as best joy, sad-happy intersection and the like do not completely belong to a specific discrete emotion category any more. Based on this situation, researchers propose a dimension theory, which uses a dimension space to represent complex varying emotion categories, i.e. emotion can be represented as one coordinate point in a multi-dimensional emotion space. Dimension emotion voice provides a more sufficient foundation for realizing human-computer interaction and developing emotion calculation research. In recent years, the research on dimension emotional voice gradually gets a wide attention. At present, dimension coordinates are mainly obtained by manual marking according to an emotion scale, and the method is long in time consumption and easy to subjectively influence.
Disclosure of Invention
The invention aims to provide a method and a system for predicting a PAD value of emotional voice, which are used for rapidly and accurately predicting the PAD value of the emotional voice.
In order to achieve the purpose, the invention provides the following scheme:
a method of emotion speech PAD value prediction, the method comprising:
acquiring test emotion voice data;
performing feature extraction on the test emotion voice data to obtain test feature data;
obtaining a trained support vector regression model;
and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
Optionally, before the acquiring the test emotion voice data, the method further includes:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
Optionally, the training of the support vector regression model through the training feature data and the labeled PAD value to obtain the trained support vector regression model specifically includes:
inputting the training characteristic data into the support vector regression model to obtain output data;
judging whether the error between the output data and the labeled PAD value is within an error threshold range or not;
if so, obtaining a trained support vector regression model;
if not, adjusting the parameters of the support vector regression model to enable the error between the output data and the labeled PAD value to be within the range of an error threshold value, and obtaining the trained support vector regression model.
Optionally, the adjusting the parameters of the support vector regression model specifically includes:
and adjusting the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
An emotion speech PAD value prediction system, the system comprising:
the test emotion voice data acquisition module is used for acquiring test emotion voice data;
the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data;
the support vector regression model acquisition module is used for acquiring a trained support vector regression model;
and the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
Optionally, the system further includes:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
and the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
Optionally, the training module specifically includes:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
Optionally, the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
Compared with the prior art, the invention has the following technical effects: according to the invention, PAD of the dimensional emotion voice is predicted through the trained support vector regression model, the prediction precision is improved, and accurate prediction of the PAD value of the emotion voice is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for predicting an emotional speech PAD value according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
FIG. 1 is a flowchart illustrating a method for predicting an emotional speech PAD value according to an embodiment of the present invention. As shown in FIG. 1, an emotion speech PAD value prediction method comprises the following steps:
step 101: and acquiring test emotion voice data.
Step 102: and performing feature extraction on the test emotion voice data to obtain test feature data.
Step 103: and obtaining the trained support vector regression model.
Step 104: predicting the test feature data through the trained support vector regression model to obtain a PAD value of the test emotion voice data, wherein P is the pleasure degree and represents the positive and negative characteristics of the individual emotion state; a is activation degree, which represents the neurophysiologic activation degree of an individual; d is the dominance degree and represents the control state of the individual on the situation and other people.
Before the acquiring the test emotion voice data, the method further comprises:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model. Inputting the training characteristic data into the support vector regression model to obtain output data; judging whether the error between the output data and the labeled PAD value is within an error threshold range or not; if so, obtaining a trained support vector regression model; if not, the penalty factor and the kernel function of the support vector regression model are adjusted through a cross grid search method, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
The method comprises the following specific implementation steps:
according to a PAD three-dimensional emotion scale and a self-assessment model formulated by a Chinese academy, on the basis of an original discrete emotion voice database TUT 2.0 in a laboratory, 100 college students are recruited to mark P, A, D dimensionalities of each emotion voice according to the assessment model, validity verification is carried out on the data after marking data is obtained, a dimensionality emotion voice database is established, and comparison data are provided for subsequent training of an SVR regression model and prediction performance assessment. Extracting the speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formant and MFCC characteristics of the emotional speech, specifically: average speech rate; averaging the zero crossing rate; the maximum value, the minimum value and the average value of the energy and the 1 st order difference thereof; the maximum value, the minimum value and the average value of the fundamental frequency and the 1-order difference thereof; the 1 st formant (F1) and its maximum, minimum, mean, variance of the 1 st order difference; the 2 nd formant (F2) and its maximum, minimum, mean, variance of the 1 st order difference; the maximum, minimum, mean, variance of the 3 rd formant (F3) and its 1 st order difference; MFCCs are 98 dimensions in terms of skewness, kurtosis, mean, variance and median of MFCC 0-MFCC 11 order.
The data sample number determines a sample training set and a test set. The specific process is as follows: representing the marked 237 sentences of PAD data of the emotional voices by using an Nx 3 matrix, and constructing a regression prediction model; the experiment used approximately 2/3 voices as the training set and 1/3 voices as the test set. The training set of the SVR model becomes 158 × 3 matrix data, and the testing set becomes 79 × 3 matrix data.
And selecting a regression kernel function of the support vector machine, and determining the parameters to be optimized of the SVR model. Value of insensitivity of 10-2And optimizing the penalty factor C and the RBF kernel function parameter sigma by using a cross grid search method, and finally selecting a parameter combination which minimizes the mean square error of the training model. And predicting the PAD value of the emotional voice by using the optimal training parameter SVR model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the PAD value of the emotional voice is predicted by utilizing the optimal training parameter SVR model based on the established dimension emotional voice database and according to the marked PAD value of the voice.
FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention. As shown in fig. 2, an emotion voice PAD value prediction system includes:
and the test emotion voice data acquisition module is used for acquiring the test emotion voice data.
And the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data.
And the support vector regression model acquisition module is used for acquiring the trained support vector regression model.
And the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
The system further comprises:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
and the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
The training module specifically comprises:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained. And the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.
Claims (6)
1. A method for predicting an emotional speech PAD value, the method comprising:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;
acquiring test emotion voice data;
performing feature extraction on the test emotion voice data to obtain test feature data; the characteristics comprise speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formants and MFCC of the emotional speech;
obtaining a trained support vector regression model;
and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
2. The method for predicting emotion speech PAD values as recited in claim 1, wherein said training a support vector regression model with said training feature data and said labeled PAD values to obtain a trained support vector regression model, specifically comprises:
inputting the training characteristic data into the support vector regression model to obtain output data;
judging whether the error between the output data and the labeled PAD value is within an error threshold range or not;
if so, obtaining a trained support vector regression model;
if not, adjusting the parameters of the support vector regression model to enable the error between the output data and the labeled PAD value to be within the range of an error threshold value, and obtaining the trained support vector regression model.
3. The method for predicting emotional speech PAD values according to claim 2, wherein the adjusting the parameters of the support vector regression model specifically comprises:
and adjusting the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
4. An emotion speech PAD value prediction system, the system comprising:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;
the test emotion voice data acquisition module is used for acquiring test emotion voice data;
the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data;
the support vector regression model acquisition module is used for acquiring a trained support vector regression model;
and the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
5. The system for predicting emotional speech PAD values of claim 4, wherein the training module specifically comprises:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
6. The emotion speech PAD value prediction system of claim 5, wherein the adjustment unit adjusts the penalty factor and kernel function of the support vector regression model by cross-grid search.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810926352.8A CN108806724B (en) | 2018-08-15 | 2018-08-15 | Method and system for predicting sentiment voice PAD value |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810926352.8A CN108806724B (en) | 2018-08-15 | 2018-08-15 | Method and system for predicting sentiment voice PAD value |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108806724A CN108806724A (en) | 2018-11-13 |
CN108806724B true CN108806724B (en) | 2020-08-25 |
Family
ID=64080122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810926352.8A Active CN108806724B (en) | 2018-08-15 | 2018-08-15 | Method and system for predicting sentiment voice PAD value |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108806724B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111415680B (en) * | 2020-03-26 | 2023-05-23 | 心图熵动科技(苏州)有限责任公司 | Voice-based anxiety prediction model generation method and anxiety prediction system |
CN112185345A (en) * | 2020-09-02 | 2021-01-05 | 电子科技大学 | Emotion voice synthesis method based on RNN and PAD emotion models |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8340274B2 (en) * | 2008-12-22 | 2012-12-25 | Genesys Telecommunications Laboratories, Inc. | System for routing interactions using bio-performance attributes of persons as dynamic input |
CN102222500A (en) * | 2011-05-11 | 2011-10-19 | 北京航空航天大学 | Extracting method and modeling method for Chinese speech emotion combining emotion points |
CN102231276B (en) * | 2011-06-21 | 2013-03-20 | 北京捷通华声语音技术有限公司 | Method and device for forecasting duration of speech synthesis unit |
CN103198827B (en) * | 2013-03-26 | 2015-06-17 | 合肥工业大学 | Voice emotion correction method based on relevance of prosodic feature parameter and emotion parameter |
CN103531207B (en) * | 2013-10-15 | 2016-07-27 | 中国科学院自动化研究所 | A kind of speech-emotion recognition method merging long span emotion history |
CN103970864B (en) * | 2014-05-08 | 2017-09-22 | 清华大学 | Mood classification and mood component analyzing method and system based on microblogging text |
CN107437090A (en) * | 2016-05-28 | 2017-12-05 | 郭帅杰 | The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal |
CN107633851B (en) * | 2017-07-31 | 2020-07-28 | 极限元(杭州)智能科技股份有限公司 | Discrete speech emotion recognition method, device and system based on emotion dimension prediction |
-
2018
- 2018-08-15 CN CN201810926352.8A patent/CN108806724B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN108806724A (en) | 2018-11-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110838286B (en) | Model training method, language identification method, device and equipment | |
CN102142253B (en) | Voice emotion identification equipment and method | |
Darabkh et al. | An efficient speech recognition system for arm‐disabled students based on isolated words | |
Chandrashekar et al. | Spectro-temporal representation of speech for intelligibility assessment of dysarthria | |
Ke et al. | Speech emotion recognition based on SVM and ANN | |
CN101777347B (en) | Model complementary Chinese accent identification method and system | |
CN112259106A (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
CN101346758A (en) | Emotion recognizer | |
CN103035241A (en) | Model complementary Chinese rhythm interruption recognition system and method | |
Jiang et al. | Speech emotion classification with the combination of statistic features and temporal features | |
CN114416934A (en) | Multi-modal dialog generation model training method and device and electronic equipment | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN108806724B (en) | Method and system for predicting sentiment voice PAD value | |
CN110992959A (en) | Voice recognition method and system | |
CN112562723B (en) | Pronunciation accuracy determination method and device, storage medium and electronic equipment | |
CN116665669A (en) | Voice interaction method and system based on artificial intelligence | |
CN107767881A (en) | A kind of acquisition methods and device of the satisfaction of voice messaging | |
CN113393828A (en) | Training method of voice synthesis model, and voice synthesis method and device | |
Hu et al. | A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training | |
CN110580897B (en) | Audio verification method and device, storage medium and electronic equipment | |
CN112669845A (en) | Method and device for correcting voice recognition result, electronic equipment and storage medium | |
CN112489634A (en) | Language acoustic model training method and device, electronic equipment and computer medium | |
CN104575495A (en) | Language identification method and system adopting total variable quantity factors | |
Song et al. | Speech signal-based emotion recognition and its application to entertainment robots | |
Tsai et al. | Self-defined text-dependent wake-up-words speaker recognition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |