CN108806724B - Method and system for predicting sentiment voice PAD value - Google Patents

Method and system for predicting sentiment voice PAD value Download PDF

Info

Publication number
CN108806724B
CN108806724B CN201810926352.8A CN201810926352A CN108806724B CN 108806724 B CN108806724 B CN 108806724B CN 201810926352 A CN201810926352 A CN 201810926352A CN 108806724 B CN108806724 B CN 108806724B
Authority
CN
China
Prior art keywords
support vector
regression model
training
vector regression
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810926352.8A
Other languages
Chinese (zh)
Other versions
CN108806724A (en
Inventor
张雪英
孙颖
张卫
张婷
黄丽霞
陈桂军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201810926352.8A priority Critical patent/CN108806724B/en
Publication of CN108806724A publication Critical patent/CN108806724A/en
Application granted granted Critical
Publication of CN108806724B publication Critical patent/CN108806724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses an emotional voice PAD value prediction method and system. The method comprises the following steps: acquiring test emotion voice data; performing feature extraction on the test emotion voice data to obtain test feature data; obtaining a trained support vector regression model; and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data. The method or the system can quickly and accurately predict the PAD value of the emotional voice.

Description

Method and system for predicting sentiment voice PAD value
Technical Field
The invention relates to the field of prediction of an emotional voice PAD value, in particular to a prediction method and a prediction system of the emotional voice PAD value.
Background
Speech is the most effective way for human communication and is increasingly used in man-machine interaction applications. The voice not only contains character information, but also contains rich information capable of reflecting the emotion state of the speaker. The speech emotion recognition is a cognitive judgment on the emotion type of a speaker through a computer, and most of the current speech emotion recognition research focuses on basic discrete emotions, such as recognizing whether the emotion of speech is angry or happy. In real life, however, the emotions of people are usually continuous and complicated and changeable, and emotions such as best joy, sad-happy intersection and the like do not completely belong to a specific discrete emotion category any more. Based on this situation, researchers propose a dimension theory, which uses a dimension space to represent complex varying emotion categories, i.e. emotion can be represented as one coordinate point in a multi-dimensional emotion space. Dimension emotion voice provides a more sufficient foundation for realizing human-computer interaction and developing emotion calculation research. In recent years, the research on dimension emotional voice gradually gets a wide attention. At present, dimension coordinates are mainly obtained by manual marking according to an emotion scale, and the method is long in time consumption and easy to subjectively influence.
Disclosure of Invention
The invention aims to provide a method and a system for predicting a PAD value of emotional voice, which are used for rapidly and accurately predicting the PAD value of the emotional voice.
In order to achieve the purpose, the invention provides the following scheme:
a method of emotion speech PAD value prediction, the method comprising:
acquiring test emotion voice data;
performing feature extraction on the test emotion voice data to obtain test feature data;
obtaining a trained support vector regression model;
and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
Optionally, before the acquiring the test emotion voice data, the method further includes:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
Optionally, the training of the support vector regression model through the training feature data and the labeled PAD value to obtain the trained support vector regression model specifically includes:
inputting the training characteristic data into the support vector regression model to obtain output data;
judging whether the error between the output data and the labeled PAD value is within an error threshold range or not;
if so, obtaining a trained support vector regression model;
if not, adjusting the parameters of the support vector regression model to enable the error between the output data and the labeled PAD value to be within the range of an error threshold value, and obtaining the trained support vector regression model.
Optionally, the adjusting the parameters of the support vector regression model specifically includes:
and adjusting the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
An emotion speech PAD value prediction system, the system comprising:
the test emotion voice data acquisition module is used for acquiring test emotion voice data;
the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data;
the support vector regression model acquisition module is used for acquiring a trained support vector regression model;
and the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
Optionally, the system further includes:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
and the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
Optionally, the training module specifically includes:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
Optionally, the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
Compared with the prior art, the invention has the following technical effects: according to the invention, PAD of the dimensional emotion voice is predicted through the trained support vector regression model, the prediction precision is improved, and accurate prediction of the PAD value of the emotion voice is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for predicting an emotional speech PAD value according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
FIG. 1 is a flowchart illustrating a method for predicting an emotional speech PAD value according to an embodiment of the present invention. As shown in FIG. 1, an emotion speech PAD value prediction method comprises the following steps:
step 101: and acquiring test emotion voice data.
Step 102: and performing feature extraction on the test emotion voice data to obtain test feature data.
Step 103: and obtaining the trained support vector regression model.
Step 104: predicting the test feature data through the trained support vector regression model to obtain a PAD value of the test emotion voice data, wherein P is the pleasure degree and represents the positive and negative characteristics of the individual emotion state; a is activation degree, which represents the neurophysiologic activation degree of an individual; d is the dominance degree and represents the control state of the individual on the situation and other people.
Before the acquiring the test emotion voice data, the method further comprises:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model. Inputting the training characteristic data into the support vector regression model to obtain output data; judging whether the error between the output data and the labeled PAD value is within an error threshold range or not; if so, obtaining a trained support vector regression model; if not, the penalty factor and the kernel function of the support vector regression model are adjusted through a cross grid search method, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
The method comprises the following specific implementation steps:
according to a PAD three-dimensional emotion scale and a self-assessment model formulated by a Chinese academy, on the basis of an original discrete emotion voice database TUT 2.0 in a laboratory, 100 college students are recruited to mark P, A, D dimensionalities of each emotion voice according to the assessment model, validity verification is carried out on the data after marking data is obtained, a dimensionality emotion voice database is established, and comparison data are provided for subsequent training of an SVR regression model and prediction performance assessment. Extracting the speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formant and MFCC characteristics of the emotional speech, specifically: average speech rate; averaging the zero crossing rate; the maximum value, the minimum value and the average value of the energy and the 1 st order difference thereof; the maximum value, the minimum value and the average value of the fundamental frequency and the 1-order difference thereof; the 1 st formant (F1) and its maximum, minimum, mean, variance of the 1 st order difference; the 2 nd formant (F2) and its maximum, minimum, mean, variance of the 1 st order difference; the maximum, minimum, mean, variance of the 3 rd formant (F3) and its 1 st order difference; MFCCs are 98 dimensions in terms of skewness, kurtosis, mean, variance and median of MFCC 0-MFCC 11 order.
The data sample number determines a sample training set and a test set. The specific process is as follows: representing the marked 237 sentences of PAD data of the emotional voices by using an Nx 3 matrix, and constructing a regression prediction model; the experiment used approximately 2/3 voices as the training set and 1/3 voices as the test set. The training set of the SVR model becomes 158 × 3 matrix data, and the testing set becomes 79 × 3 matrix data.
And selecting a regression kernel function of the support vector machine, and determining the parameters to be optimized of the SVR model. Value of insensitivity of 10-2And optimizing the penalty factor C and the RBF kernel function parameter sigma by using a cross grid search method, and finally selecting a parameter combination which minimizes the mean square error of the training model. And predicting the PAD value of the emotional voice by using the optimal training parameter SVR model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the PAD value of the emotional voice is predicted by utilizing the optimal training parameter SVR model based on the established dimension emotional voice database and according to the marked PAD value of the voice.
FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention. As shown in fig. 2, an emotion voice PAD value prediction system includes:
and the test emotion voice data acquisition module is used for acquiring the test emotion voice data.
And the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data.
And the support vector regression model acquisition module is used for acquiring the trained support vector regression model.
And the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
The system further comprises:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
and the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.
The training module specifically comprises:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained. And the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (6)

1. A method for predicting an emotional speech PAD value, the method comprising:
acquiring training emotion voice data;
marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
extracting features of the training emotion voice data to obtain training feature data;
training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;
acquiring test emotion voice data;
performing feature extraction on the test emotion voice data to obtain test feature data; the characteristics comprise speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formants and MFCC of the emotional speech;
obtaining a trained support vector regression model;
and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
2. The method for predicting emotion speech PAD values as recited in claim 1, wherein said training a support vector regression model with said training feature data and said labeled PAD values to obtain a trained support vector regression model, specifically comprises:
inputting the training characteristic data into the support vector regression model to obtain output data;
judging whether the error between the output data and the labeled PAD value is within an error threshold range or not;
if so, obtaining a trained support vector regression model;
if not, adjusting the parameters of the support vector regression model to enable the error between the output data and the labeled PAD value to be within the range of an error threshold value, and obtaining the trained support vector regression model.
3. The method for predicting emotional speech PAD values according to claim 2, wherein the adjusting the parameters of the support vector regression model specifically comprises:
and adjusting the penalty factor and the kernel function of the support vector regression model by a cross grid search method.
4. An emotion speech PAD value prediction system, the system comprising:
the training emotion voice data acquisition module is used for acquiring training emotion voice data;
the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;
the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;
the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;
the test emotion voice data acquisition module is used for acquiring test emotion voice data;
the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data;
the support vector regression model acquisition module is used for acquiring a trained support vector regression model;
and the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.
5. The system for predicting emotional speech PAD values of claim 4, wherein the training module specifically comprises:
the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;
the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;
the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;
and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.
6. The emotion speech PAD value prediction system of claim 5, wherein the adjustment unit adjusts the penalty factor and kernel function of the support vector regression model by cross-grid search.
CN201810926352.8A 2018-08-15 2018-08-15 Method and system for predicting sentiment voice PAD value Active CN108806724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810926352.8A CN108806724B (en) 2018-08-15 2018-08-15 Method and system for predicting sentiment voice PAD value

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810926352.8A CN108806724B (en) 2018-08-15 2018-08-15 Method and system for predicting sentiment voice PAD value

Publications (2)

Publication Number Publication Date
CN108806724A CN108806724A (en) 2018-11-13
CN108806724B true CN108806724B (en) 2020-08-25

Family

ID=64080122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810926352.8A Active CN108806724B (en) 2018-08-15 2018-08-15 Method and system for predicting sentiment voice PAD value

Country Status (1)

Country Link
CN (1) CN108806724B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111415680B (en) * 2020-03-26 2023-05-23 心图熵动科技(苏州)有限责任公司 Voice-based anxiety prediction model generation method and anxiety prediction system
CN112185345A (en) * 2020-09-02 2021-01-05 电子科技大学 Emotion voice synthesis method based on RNN and PAD emotion models

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8340274B2 (en) * 2008-12-22 2012-12-25 Genesys Telecommunications Laboratories, Inc. System for routing interactions using bio-performance attributes of persons as dynamic input
CN102222500A (en) * 2011-05-11 2011-10-19 北京航空航天大学 Extracting method and modeling method for Chinese speech emotion combining emotion points
CN102231276B (en) * 2011-06-21 2013-03-20 北京捷通华声语音技术有限公司 Method and device for forecasting duration of speech synthesis unit
CN103198827B (en) * 2013-03-26 2015-06-17 合肥工业大学 Voice emotion correction method based on relevance of prosodic feature parameter and emotion parameter
CN103531207B (en) * 2013-10-15 2016-07-27 中国科学院自动化研究所 A kind of speech-emotion recognition method merging long span emotion history
CN103970864B (en) * 2014-05-08 2017-09-22 清华大学 Mood classification and mood component analyzing method and system based on microblogging text
CN107437090A (en) * 2016-05-28 2017-12-05 郭帅杰 The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal
CN107633851B (en) * 2017-07-31 2020-07-28 极限元(杭州)智能科技股份有限公司 Discrete speech emotion recognition method, device and system based on emotion dimension prediction

Also Published As

Publication number Publication date
CN108806724A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN110838286B (en) Model training method, language identification method, device and equipment
CN102142253B (en) Voice emotion identification equipment and method
Darabkh et al. An efficient speech recognition system for arm‐disabled students based on isolated words
Chandrashekar et al. Spectro-temporal representation of speech for intelligibility assessment of dysarthria
Ke et al. Speech emotion recognition based on SVM and ANN
CN101777347B (en) Model complementary Chinese accent identification method and system
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
CN101346758A (en) Emotion recognizer
CN103035241A (en) Model complementary Chinese rhythm interruption recognition system and method
Jiang et al. Speech emotion classification with the combination of statistic features and temporal features
CN114416934A (en) Multi-modal dialog generation model training method and device and electronic equipment
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN108806724B (en) Method and system for predicting sentiment voice PAD value
CN110992959A (en) Voice recognition method and system
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
CN116665669A (en) Voice interaction method and system based on artificial intelligence
CN107767881A (en) A kind of acquisition methods and device of the satisfaction of voice messaging
CN113393828A (en) Training method of voice synthesis model, and voice synthesis method and device
Hu et al. A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training
CN110580897B (en) Audio verification method and device, storage medium and electronic equipment
CN112669845A (en) Method and device for correcting voice recognition result, electronic equipment and storage medium
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
CN104575495A (en) Language identification method and system adopting total variable quantity factors
Song et al. Speech signal-based emotion recognition and its application to entertainment robots
Tsai et al. Self-defined text-dependent wake-up-words speaker recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant