CN111583966B

CN111583966B - Cross-database speech emotion recognition method and device based on joint distribution least square regression

Info

Publication number: CN111583966B
Application number: CN202010372728.2A
Authority: CN
Inventors: 宗源; 江林; 张佳成; 郑文明; 江星洵; 刘佳腾
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2022-06-28
Anticipated expiration: 2040-05-06
Also published as: CN111583966A

Abstract

The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least square regression, wherein the method comprises the following steps: (1) acquiring a training database and a testing database, wherein the training voice database comprises a plurality of voice segments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice segments to be recognized; (2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment; (3) establishing a least square regression model based on joint distribution, and performing joint training by using a training database and a test database to obtain a sparse projection matrix; (4) and (3) for the voice segment to be recognized, obtaining a feature vector according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix. The invention can adapt to different environments and has higher accuracy.

Description

Cross-database speech emotion recognition method and device based on joint distribution least square regression

Technical Field

The invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least square regression.

Background

The purpose of speech emotion recognition is to enable a machine to have enough intelligence to extract the emotional state (such as happiness, fear, sadness and the like) of the machine from the speech of a speaker, so that the machine is an important link in human-computer interaction and has great research potential and development prospect. If the mental state of the driver is detected by combining the voice, expression and behavior information of the driver, the driver can be reminded to concentrate on the attention to avoid dangerous driving in time; the voice emotion of a speaker is detected in the human-computer interaction, so that the dialogue is smoother, the psychology of the speaker is better taken care of, and the speaker is close to cognition; the wearable equipment can make more timely and more appropriate feedback according to the emotional state of the wearer; meanwhile, in various fields such as classroom teaching and teacher accompanying, the speech emotion recognition plays an increasingly important role.

Traditional speech emotion recognition is trained and tested on the same speech database, and the training data and the testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background is also doped with various noises. Cross-database speech emotion recognition therefore faces significant challenges. How to make the trained model have good adaptability to different environments becomes a problem to be solved by the academics and the industry.

Disclosure of Invention

The invention aims to: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on joint distribution least square regression.

The technical scheme is as follows: the cross-database speech emotion recognition method based on joint distribution least square regression comprises the following steps:

(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;

(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;

(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;

(4) And (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.

Further, the step (2) specifically comprises:

(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal-to-noise ratio and Mel-ton frequency cepstrum coefficient 1-12;

(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;

and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features to form a vector as a feature vector of the corresponding voice segment.

Further, the least squares regression model established in step (3) is:

in the formula (I), the compound is shown in the specification,

means to find the matrix P, L that minimizes the parenthetical equation_s∈R^c×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion category, and n is the speech segment of the training database Number of (2), X_s∈R^d×nFor training the feature vector of the speech segment of the database, d is the dimension of the feature vector, and P is the element of R^d×cAs a sparse projection matrix, P^TIs a rank-of-turn matrix for P,

is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,

X_t∈R^d×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,

respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, n_c、m_cRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance_2,1Is 2,1 norm.

Further, the method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label in the step (3) specifically includes:

(3-1) converting the least squares regression model to:

s.t.P＝Q

(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model

(3-3) according to the pseudo labelLabel matrix

Is counted to obtain

And m_cAnd then calculated to obtain

(3-4) based on

Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value

(3-5) estimating values from the projection matrix

Applying the following formula to a pseudo label matrix

And (3) updating:

in the formula (I), the compound is shown in the specification,

it is indicated that the intermediate auxiliary variable,

is composed of

The element of the ith column and jth row,

the row number j representing the row for which the value of the element in the ith column is the largest is found,

is a pseudo label matrix

Column i, row k;

(3-6) Using the updated pseudo tag matrix

And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended

As the projection matrix P obtained by learning.

Further, the step (3-2) specifically comprises:

(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term

(3-2-2) initial values according to projection matrix

Obtaining an initial value of the pseudo label matrix by adopting the following formula:

in the formula (I), the compound is shown in the specification,

it is shown that the intermediate auxiliary variable,

is an initial value of the pseudo tag matrix

Column i element of row k. Further, the step (3-4) specifically comprises:

(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:

in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;

(3-4-2) keeping P, T, k unchanged, updating Q:

And (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:

solving the above equation to obtain:

(3-4-3) keeping Q, T, k unchanged, updating P:

and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:

solving the above equation to obtain:

P_iis the ith column vector of P, T_iIs the ith column vector of T;

(3-4-4) keeping Q, P unchanged, updating T, k:

T＝T+k(P-C)

k＝min(ρk,k_max)

in the formula, k_maxIs the maximum value of the preset k, rho is the scaling factor, rho>1；

(3-4-5) check whether convergence occurs:

checking P-Q luminance_∞If the epsilon is less than epsilon, returning to execute the step (3-4-2) if the epsilon is not less than epsilon, and if the epsilon is less than epsilon or the iteration number is greater than a set value, taking the value of P at the moment as the solved sparse projection matrix, | | | | | magnetism_∞Represents the maximum element in the data, and ε represents the convergence threshold.

Further, the method for calculating the speech emotion category label of the test database in step (4) is as follows:

calculated using the formula:

where P is the final projection matrix, X, we have learned_tThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,

and j represents a middle auxiliary variable, and j represents a speech emotion category label of the speech segment to be recognized.

The cross-database speech emotion recognition device based on the joint distribution least square regression comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor executes the program to realize the method.

Has the beneficial effects that: compared with the prior art, the invention has the remarkable advantages that: the cross-database speech emotion recognition method and device of the invention are in cross-database learning, so that the method and device have good adaptability to different environments and the recognition result is more accurate.

Drawings

FIG. 1 is a schematic flow diagram of a cross-database speech emotion recognition method based on joint distribution least square regression according to the present invention.

Detailed Description

The embodiment provides a cross-database speech emotion recognition method based on joint distribution least square regression, as shown in fig. 1, including the following steps:

(1) the method comprises the steps of obtaining two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized.

In this embodiment, we use three types of speech emotion databases, which are common in emotion speech recognition: berlin, eNTERFACE, and casia. Because the three types of databases contain different emotion categories, the data are selected when compared pairwise. When Berlin and eNFIGCE are compared, 375 pieces of data and 1077 pieces of data are respectively selected, and the emotion categories are 5 types (anger, fear, happiness, aversion and sadness); when Berlin and CAISA are compared, 408 data and 1000 data are respectively selected, and the emotion category is 5 types (anger, fear, happiness, disgust and sadness); when eNFORCE and CAISA are compared, we have chosen 1072 pieces of data and 1000 pieces of data, and the emotion category is 5 types (anger, fear, happiness, dislike, sadness).

(2) And processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment.

The method comprises the following steps:

(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal-to-noise ratio and Mel-ton frequency cepstrum coefficient 1-12; the descriptor is derived from INTERSPEECH 2009 the function set provided by Emotion Challenge;

(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment by using openSIMLE software, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;

and (2-3) taking each piece of information obtained by statistics as one emotion feature, and taking a 16 × 2 × 12-384 emotion feature composition vector as a feature vector corresponding to a speech segment.

(3) And establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known labels and a test database of unknown labels to obtain a sparse projection matrix connecting the voice segments and the voice emotion category labels.

Wherein the established least squares regression model is as follows:

in the formula (I), the compound is shown in the specification,

means to find the matrix P, L that minimizes the parenthetical equation_s∈R^c×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, X_s∈R^d×nFor training the feature vector of the database speech segment, d is the dimension of the feature vector, and P belongs to R^d×cAs a sparse projection matrix, P^TIs a rank-of-turn matrix for P,

respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, n_c、m_cRespectively testing the number of the speech segments of which the emotion types in the database belong to the class c, | | calting_2,1Is 2,1 norm.

The method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:

(3-1) converting the least squares regression model to:

s.t.P＝Q

(3-3) according to the pseudo tag matrix

Is counted to obtain

And m_cAnd then calculated to obtain

(3-4) based on

(3-5) estimating a value based on the projection matrix

Using the following formula to pseudo label matrix

Updating:

in the formula (I), the compound is shown in the specification,

it is shown that the intermediate auxiliary variable,

is composed of

The element of the ith column and the jth row,

is a pseudo label matrix

Column i, row k;

(3-6) Using the updated pseudo tag matrix

As the projection matrix P obtained by learning.

Further, the step (3-2) specifically comprises:

(3-2-2) initial values according to projection matrix

The initial value of the pseudo label matrix is obtained by adopting the following formula:

in the formula (I), the compound is shown in the specification,

it is indicated that the intermediate auxiliary variable,

is an initial value of a pseudo tag matrix

Column i element of row k. Pseudo label matrix

Each column of (1) has only one row of its corresponding category as 1, and the remaining rows are all 0.

The step (3-4) specifically comprises the following steps:

(3-4-2) keeping P, T, k unchanged, updating Q:

solving the above equation to obtain:

(3-4-3) keeping Q, T, k unchanged, updating P:

solving the above equation to obtain:

P_iis the ith column vector of P, T_iIs the ith column vector of T;

(3-4-4) keeping Q, P unchanged, updating T, k:

T＝T+k(P-C)

k＝min(ρk,k_max)

(3-4-5) check whether convergence occurs:

checking P-Q luminance_∞If the epsilon is less than epsilon, returning to execute the step (3-4-2) if the epsilon is not less than epsilon, and if the epsilon is less than epsilon or the iteration number is greater than a set value, taking the value of P at the moment as the solved sparse projection matrix, | | | | | magnetism _∞Representing the largest element in the data to be evaluated,

ε represents the convergence threshold.

The specific method is to calculate the category label by adopting the following formula:

representing intermediate auxiliary variables, j^*And the speech emotion category label represents the speech segment to be recognized.

The embodiment also provides a cross-database speech emotion recognition device based on joint distribution least square regression, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor implements the method when executing the computer program.

In order to verify the effectiveness of the invention, experiments are respectively carried out on the speech emotion databases Berlin, the eNBACE and the CAISA database pairwise. In each set of experiments, we treated the two databases as a source domain and a target domain, respectively, wherein the source domain provides training data and labels as a training set, and the target domain provides test data only and does not provide any labels as a test set. For more effective detection and identification accuracy, two detection methods, namely non-weighted average recall ratio (UAR) and weighted average recall ratio (WAR), are adopted. Wherein, UAR represents the number of correct predictions of each class divided by the number of tests participated in by each class, and then the number of correct rate substitution of all classes is averaged; and the WAR indicates the number of all correct predictions divided by the number of all participating tests, without considering the effect of each class number. The UAR and the WAR are comprehensively considered, so that the influence of unbalanced class number can be effectively avoided. As a comparative experiment, several classical and efficient algorithms in subspace learning are selected, which are respectively: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in table 1 below, wherein the method is represented in the table by JDLSR, the data set is source domain/target domain, E, B, C are abbreviations for encerface, Berlin, CASIA data sets, respectively, and the evaluation criteria is UAR/WAR.

Experimental results show that the micro-expression recognition method based on the invention obtains higher cross-database micro-expression recognition rate.

TABLE 1

Claims

1. A cross-database speech emotion recognition method based on joint distribution least square regression is characterized by comprising the following steps:

(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;

(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags; wherein the established least squares regression model is as follows:

in the formula (I), the compound is shown in the specification,

means find sparse projection matrix P, L that minimizes the parenthetical inner equation _s∈R^c×nFor training the speech emotion class label vector of the database speech segment, C isClass number of speech emotion category, n is number of speech segments of training database, X_sFor training feature vectors, X, of speech segments of a database_s∈R^d×nD is the dimension of the eigenvector, P is the sparse projection matrix, P belongs to R^d×c，P^TIs a rank-of-turn matrix for P,

respectively a set of speech segments of which the emotion types in the training database and the testing database belong to class c, wherein c represents the serial number of the emotion type, n_c、m_cRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance_2,1Is a norm of 2, 1;

the method for performing joint training on the known label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:

(3-1) converting the least squares regression model to:

s.t.P＝Q

(3-3) according to the pseudo tag matrix

Get statistics of

And m_cAnd then calculated to obtain

(3-4) based on

(3-5) estimating a value based on the projection matrix

Using the following formula to pseudo label matrix

Updating:

in the formula (I), the compound is shown in the specification,

it is shown that the intermediate auxiliary variable,

is composed of

The element of the ith column and the jth row,

is a pseudo label matrix

Column i, row k;

(3-6) Using the updated pseudo tag matrix

And (4) returning to the step (3-3) until the preset cycle number is reached, and finishing the cycle to obtain the projection matrix estimated value

As a learned sparse projection matrix P;

(4) and (3) for the voice segments to be recognized in the test database, obtaining the feature vectors according to the step (2), and obtaining the corresponding voice emotion category labels by adopting the sparse projection matrix learned in the step (3).

2. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (2) specifically comprises the following steps:

(2-1) calculating values of 16 acoustic low-dimensional descriptors and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;

3. The cross-database speech emotion recognition method based on joint distribution least square regression as claimed in claim 1, wherein: the step (3-2) specifically comprises the following steps:

(3-2-2) initial value based on projection matrix estimation value

in the formula (I), the compound is shown in the specification,

it is shown that the intermediate auxiliary variable,

is an initial value of the pseudo tag matrix

Column i element of row k.

4. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (3-4) specifically comprises the following steps:

in the formula, T is Lagrange multiplier, k >0 is a regular parameter, and tr () represents a trace of a matrix;

(3-4-2) keeping P, T, k unchanged, updating Q:

and extracting the part related to the variable Q in the augmented Lagrange equation to obtain:

solving the above equation to obtain:

(3-4-3) keeping Q, T, k unchanged, updating P:

and extracting the part related to the variable P in the augmented Lagrange equation to obtain:

solving the above equation to obtain:

P_iis the ith column vector of P, T_iIs the ith column vector of T;

(3-4-4) keeping Q, P unchanged, updating T, k:

T＝T+k(P-C)

k＝min(ρk,k_max)

(3-4-5) check whether convergence occurs:

5. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the method for calculating the voice emotion category label in the test database in the step (4) comprises the following steps:

Calculated using the formula:

wherein P is the sparse projection matrix learned in step (3), X_tThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,

6. A cross-database speech emotion recognition apparatus based on joint distribution least squares regression, comprising a processor and a computer program stored on a memory and operable on the processor, wherein: the processor, when executing the computer program, implements the method of any of claims 1-5.