CN108806724B

CN108806724B - Method and system for predicting sentiment voice PAD value

Info

Publication number: CN108806724B
Application number: CN201810926352.8A
Authority: CN
Inventors: 张雪英; 孙颖; 张卫; 张婷; 黄丽霞; 陈桂军
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-08-15
Filing date: 2018-08-15
Publication date: 2020-08-25
Anticipated expiration: 2038-08-15
Also published as: CN108806724A

Abstract

The invention discloses an emotional voice PAD value prediction method and system. The method comprises the following steps: acquiring test emotion voice data; performing feature extraction on the test emotion voice data to obtain test feature data; obtaining a trained support vector regression model; and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data. The method or the system can quickly and accurately predict the PAD value of the emotional voice.

Description

Method and system for predicting sentiment voice PAD value

Technical Field

The invention relates to the field of prediction of an emotional voice PAD value, in particular to a prediction method and a prediction system of the emotional voice PAD value.

Background

Speech is the most effective way for human communication and is increasingly used in man-machine interaction applications. The voice not only contains character information, but also contains rich information capable of reflecting the emotion state of the speaker. The speech emotion recognition is a cognitive judgment on the emotion type of a speaker through a computer, and most of the current speech emotion recognition research focuses on basic discrete emotions, such as recognizing whether the emotion of speech is angry or happy. In real life, however, the emotions of people are usually continuous and complicated and changeable, and emotions such as best joy, sad-happy intersection and the like do not completely belong to a specific discrete emotion category any more. Based on this situation, researchers propose a dimension theory, which uses a dimension space to represent complex varying emotion categories, i.e. emotion can be represented as one coordinate point in a multi-dimensional emotion space. Dimension emotion voice provides a more sufficient foundation for realizing human-computer interaction and developing emotion calculation research. In recent years, the research on dimension emotional voice gradually gets a wide attention. At present, dimension coordinates are mainly obtained by manual marking according to an emotion scale, and the method is long in time consumption and easy to subjectively influence.

Disclosure of Invention

The invention aims to provide a method and a system for predicting a PAD value of emotional voice, which are used for rapidly and accurately predicting the PAD value of the emotional voice.

In order to achieve the purpose, the invention provides the following scheme:

a method of emotion speech PAD value prediction, the method comprising:

acquiring test emotion voice data;

performing feature extraction on the test emotion voice data to obtain test feature data;

obtaining a trained support vector regression model;

and predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.

Optionally, before the acquiring the test emotion voice data, the method further includes:

acquiring training emotion voice data;

marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;

extracting features of the training emotion voice data to obtain training feature data;

and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.

Optionally, the training of the support vector regression model through the training feature data and the labeled PAD value to obtain the trained support vector regression model specifically includes:

inputting the training characteristic data into the support vector regression model to obtain output data;

judging whether the error between the output data and the labeled PAD value is within an error threshold range or not;

if so, obtaining a trained support vector regression model;

if not, adjusting the parameters of the support vector regression model to enable the error between the output data and the labeled PAD value to be within the range of an error threshold value, and obtaining the trained support vector regression model.

Optionally, the adjusting the parameters of the support vector regression model specifically includes:

and adjusting the penalty factor and the kernel function of the support vector regression model by a cross grid search method.

An emotion speech PAD value prediction system, the system comprising:

the test emotion voice data acquisition module is used for acquiring test emotion voice data;

the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data;

the support vector regression model acquisition module is used for acquiring a trained support vector regression model;

and the prediction module is used for predicting the test characteristic data through the trained support vector regression model to obtain the PAD value of the test emotion voice data.

Optionally, the system further includes:

the training emotion voice data acquisition module is used for acquiring training emotion voice data;

the marking module is used for marking the training emotion voice data through a PAD three-dimensional emotion scale to obtain a marked PAD value;

the training feature data extraction module is used for extracting features of the training emotion voice data to obtain training feature data;

and the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model.

Optionally, the training module specifically includes:

the input unit is used for inputting the training characteristic data into the support vector regression model to obtain output data;

the judging unit is used for judging whether the error between the output data and the marked PAD value is within an error threshold range or not;

the result determining unit is used for obtaining a trained support vector regression model when the error between the output data and the labeled PAD value is within an error threshold range;

and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.

Optionally, the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.

Compared with the prior art, the invention has the following technical effects: according to the invention, PAD of the dimensional emotion voice is predicted through the trained support vector regression model, the prediction precision is improved, and accurate prediction of the PAD value of the emotion voice is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for predicting an emotional speech PAD value according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

FIG. 1 is a flowchart illustrating a method for predicting an emotional speech PAD value according to an embodiment of the present invention. As shown in FIG. 1, an emotion speech PAD value prediction method comprises the following steps:

step 101: and acquiring test emotion voice data.

Step 102: and performing feature extraction on the test emotion voice data to obtain test feature data.

Step 103: and obtaining the trained support vector regression model.

Step 104: predicting the test feature data through the trained support vector regression model to obtain a PAD value of the test emotion voice data, wherein P is the pleasure degree and represents the positive and negative characteristics of the individual emotion state; a is activation degree, which represents the neurophysiologic activation degree of an individual; d is the dominance degree and represents the control state of the individual on the situation and other people.

Before the acquiring the test emotion voice data, the method further comprises:

acquiring training emotion voice data;

and training a support vector regression model through the training characteristic data and the labeled PAD value to obtain the trained support vector regression model. Inputting the training characteristic data into the support vector regression model to obtain output data; judging whether the error between the output data and the labeled PAD value is within an error threshold range or not; if so, obtaining a trained support vector regression model; if not, the penalty factor and the kernel function of the support vector regression model are adjusted through a cross grid search method, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained.

The method comprises the following specific implementation steps:

according to a PAD three-dimensional emotion scale and a self-assessment model formulated by a Chinese academy, on the basis of an original discrete emotion voice database TUT 2.0 in a laboratory, 100 college students are recruited to mark P, A, D dimensionalities of each emotion voice according to the assessment model, validity verification is carried out on the data after marking data is obtained, a dimensionality emotion voice database is established, and comparison data are provided for subsequent training of an SVR regression model and prediction performance assessment. Extracting the speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formant and MFCC characteristics of the emotional speech, specifically: average speech rate; averaging the zero crossing rate; the maximum value, the minimum value and the average value of the energy and the 1 st order difference thereof; the maximum value, the minimum value and the average value of the fundamental frequency and the 1-order difference thereof; the 1 st formant (F1) and its maximum, minimum, mean, variance of the 1 st order difference; the 2 nd formant (F2) and its maximum, minimum, mean, variance of the 1 st order difference; the maximum, minimum, mean, variance of the 3 rd formant (F3) and its 1 st order difference; MFCCs are 98 dimensions in terms of skewness, kurtosis, mean, variance and median of MFCC 0-MFCC 11 order.

The data sample number determines a sample training set and a test set. The specific process is as follows: representing the marked 237 sentences of PAD data of the emotional voices by using an Nx 3 matrix, and constructing a regression prediction model; the experiment used approximately 2/3 voices as the training set and 1/3 voices as the test set. The training set of the SVR model becomes 158 × 3 matrix data, and the testing set becomes 79 × 3 matrix data.

And selecting a regression kernel function of the support vector machine, and determining the parameters to be optimized of the SVR model. Value of insensitivity of 10^-2And optimizing the penalty factor C and the RBF kernel function parameter sigma by using a cross grid search method, and finally selecting a parameter combination which minimizes the mean square error of the training model. And predicting the PAD value of the emotional voice by using the optimal training parameter SVR model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the PAD value of the emotional voice is predicted by utilizing the optimal training parameter SVR model based on the established dimension emotional voice database and according to the marked PAD value of the voice.

FIG. 2 is a schematic diagram of a system for predicting an emotion speech PAD value according to an embodiment of the present invention. As shown in fig. 2, an emotion voice PAD value prediction system includes:

and the test emotion voice data acquisition module is used for acquiring the test emotion voice data.

And the test feature data extraction module is used for extracting features of the test emotion voice data to obtain test feature data.

And the support vector regression model acquisition module is used for acquiring the trained support vector regression model.

The system further comprises:

The training module specifically comprises:

and the adjusting unit is used for adjusting the parameters of the support vector regression model when the error between the output data and the labeled PAD value is not within the error threshold range, so that the error between the output data and the labeled PAD value is within the error threshold range, and the trained support vector regression model is obtained. And the adjusting unit adjusts the penalty factor and the kernel function of the support vector regression model by a cross grid search method.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for predicting an emotional speech PAD value, the method comprising:

acquiring training emotion voice data;

training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;

acquiring test emotion voice data;

performing feature extraction on the test emotion voice data to obtain test feature data; the characteristics comprise speech speed, zero-crossing rate, short-time energy, fundamental tone frequency, formants and MFCC of the emotional speech;

obtaining a trained support vector regression model;

2. The method for predicting emotion speech PAD values as recited in claim 1, wherein said training a support vector regression model with said training feature data and said labeled PAD values to obtain a trained support vector regression model, specifically comprises:

if so, obtaining a trained support vector regression model;

3. The method for predicting emotional speech PAD values according to claim 2, wherein the adjusting the parameters of the support vector regression model specifically comprises:

4. An emotion speech PAD value prediction system, the system comprising:

the training module is used for training a support vector regression model through the training characteristic data and the labeled PAD value to obtain a trained support vector regression model;

5. The system for predicting emotional speech PAD values of claim 4, wherein the training module specifically comprises:

6. The emotion speech PAD value prediction system of claim 5, wherein the adjustment unit adjusts the penalty factor and kernel function of the support vector regression model by cross-grid search.