CN111583966B - Cross-database speech emotion recognition method and device based on joint distribution least square regression - Google Patents

Cross-database speech emotion recognition method and device based on joint distribution least square regression Download PDF

Info

Publication number
CN111583966B
CN111583966B CN202010372728.2A CN202010372728A CN111583966B CN 111583966 B CN111583966 B CN 111583966B CN 202010372728 A CN202010372728 A CN 202010372728A CN 111583966 B CN111583966 B CN 111583966B
Authority
CN
China
Prior art keywords
database
voice
speech
training
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010372728.2A
Other languages
Chinese (zh)
Other versions
CN111583966A (en
Inventor
宗源
江林
张佳成
郑文明
江星洵
刘佳腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010372728.2A priority Critical patent/CN111583966B/en
Publication of CN111583966A publication Critical patent/CN111583966A/en
Application granted granted Critical
Publication of CN111583966B publication Critical patent/CN111583966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least square regression, wherein the method comprises the following steps: (1) acquiring a training database and a testing database, wherein the training voice database comprises a plurality of voice segments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice segments to be recognized; (2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment; (3) establishing a least square regression model based on joint distribution, and performing joint training by using a training database and a test database to obtain a sparse projection matrix; (4) and (3) for the voice segment to be recognized, obtaining a feature vector according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix. The invention can adapt to different environments and has higher accuracy.

Description

Cross-database speech emotion recognition method and device based on joint distribution least square regression
Technical Field
The invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least square regression.
Background
The purpose of speech emotion recognition is to enable a machine to have enough intelligence to extract the emotional state (such as happiness, fear, sadness and the like) of the machine from the speech of a speaker, so that the machine is an important link in human-computer interaction and has great research potential and development prospect. If the mental state of the driver is detected by combining the voice, expression and behavior information of the driver, the driver can be reminded to concentrate on the attention to avoid dangerous driving in time; the voice emotion of a speaker is detected in the human-computer interaction, so that the dialogue is smoother, the psychology of the speaker is better taken care of, and the speaker is close to cognition; the wearable equipment can make more timely and more appropriate feedback according to the emotional state of the wearer; meanwhile, in various fields such as classroom teaching and teacher accompanying, the speech emotion recognition plays an increasingly important role.
Traditional speech emotion recognition is trained and tested on the same speech database, and the training data and the testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background is also doped with various noises. Cross-database speech emotion recognition therefore faces significant challenges. How to make the trained model have good adaptability to different environments becomes a problem to be solved by the academics and the industry.
Disclosure of Invention
The invention aims to: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on joint distribution least square regression.
The technical scheme is as follows: the cross-database speech emotion recognition method based on joint distribution least square regression comprises the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;
(4) And (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
Further, the step (2) specifically comprises:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal-to-noise ratio and Mel-ton frequency cepstrum coefficient 1-12;
(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features to form a vector as a feature vector of the corresponding voice segment.
Further, the least squares regression model established in step (3) is:
Figure BDA0002478916840000021
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000022
means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion category, and n is the speech segment of the training database Number of (2), Xs∈Rd×nFor training the feature vector of the speech segment of the database, d is the dimension of the feature vector, and P is the element of Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,
Figure BDA0002478916840000023
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure BDA0002478916840000024
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure BDA0002478916840000025
Figure BDA0002478916840000026
respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
Further, the method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label in the step (3) specifically includes:
(3-1) converting the least squares regression model to:
Figure BDA0002478916840000031
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure BDA0002478916840000032
(3-3) according to the pseudo labelLabel matrix
Figure BDA0002478916840000033
Is counted to obtain
Figure BDA0002478916840000034
And mcAnd then calculated to obtain
Figure BDA0002478916840000035
(3-4) based on
Figure BDA0002478916840000036
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure BDA0002478916840000037
(3-5) estimating values from the projection matrix
Figure BDA0002478916840000038
Applying the following formula to a pseudo label matrix
Figure BDA0002478916840000039
And (3) updating:
Figure BDA00024789168400000310
Figure BDA00024789168400000311
in the formula (I), the compound is shown in the specification,
Figure BDA00024789168400000312
it is indicated that the intermediate auxiliary variable,
Figure BDA00024789168400000313
is composed of
Figure BDA00024789168400000314
The element of the ith column and jth row,
Figure BDA00024789168400000315
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure BDA00024789168400000316
is a pseudo label matrix
Figure BDA00024789168400000317
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure BDA00024789168400000318
And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended
Figure BDA00024789168400000319
As the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure BDA00024789168400000320
Figure BDA00024789168400000321
(3-2-2) initial values according to projection matrix
Figure BDA00024789168400000322
Obtaining an initial value of the pseudo label matrix by adopting the following formula:
Figure BDA00024789168400000323
Figure BDA00024789168400000324
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000041
it is shown that the intermediate auxiliary variable,
Figure BDA0002478916840000042
is an initial value of the pseudo tag matrix
Figure BDA0002478916840000043
Column i element of row k. Further, the step (3-4) specifically comprises:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure BDA0002478916840000044
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
And (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
Figure BDA0002478916840000045
solving the above equation to obtain:
Figure BDA0002478916840000046
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
Figure BDA0002478916840000047
solving the above equation to obtain:
Figure BDA0002478916840000048
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf the epsilon is less than epsilon, returning to execute the step (3-4-2) if the epsilon is not less than epsilon, and if the epsilon is less than epsilon or the iteration number is greater than a set value, taking the value of P at the moment as the solved sparse projection matrix, | | | | | magnetismRepresents the maximum element in the data, and ε represents the convergence threshold.
Further, the method for calculating the speech emotion category label of the test database in step (4) is as follows:
calculated using the formula:
Figure BDA0002478916840000051
Figure BDA0002478916840000052
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure BDA0002478916840000053
and j represents a middle auxiliary variable, and j represents a speech emotion category label of the speech segment to be recognized.
The cross-database speech emotion recognition device based on the joint distribution least square regression comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor executes the program to realize the method.
Has the beneficial effects that: compared with the prior art, the invention has the remarkable advantages that: the cross-database speech emotion recognition method and device of the invention are in cross-database learning, so that the method and device have good adaptability to different environments and the recognition result is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of a cross-database speech emotion recognition method based on joint distribution least square regression according to the present invention.
Detailed Description
The embodiment provides a cross-database speech emotion recognition method based on joint distribution least square regression, as shown in fig. 1, including the following steps:
(1) the method comprises the steps of obtaining two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized.
In this embodiment, we use three types of speech emotion databases, which are common in emotion speech recognition: berlin, eNTERFACE, and casia. Because the three types of databases contain different emotion categories, the data are selected when compared pairwise. When Berlin and eNFIGCE are compared, 375 pieces of data and 1077 pieces of data are respectively selected, and the emotion categories are 5 types (anger, fear, happiness, aversion and sadness); when Berlin and CAISA are compared, 408 data and 1000 data are respectively selected, and the emotion category is 5 types (anger, fear, happiness, disgust and sadness); when eNFORCE and CAISA are compared, we have chosen 1072 pieces of data and 1000 pieces of data, and the emotion category is 5 types (anger, fear, happiness, dislike, sadness).
(2) And processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment.
The method comprises the following steps:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal-to-noise ratio and Mel-ton frequency cepstrum coefficient 1-12; the descriptor is derived from INTERSPEECH 2009 the function set provided by Emotion Challenge;
(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment by using openSIMLE software, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained by statistics as one emotion feature, and taking a 16 × 2 × 12-384 emotion feature composition vector as a feature vector corresponding to a speech segment.
(3) And establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known labels and a test database of unknown labels to obtain a sparse projection matrix connecting the voice segments and the voice emotion category labels.
Wherein the established least squares regression model is as follows:
Figure BDA0002478916840000061
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000062
means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vector of the database speech segment, d is the dimension of the feature vector, and P belongs to Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,
Figure BDA0002478916840000063
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure BDA0002478916840000064
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure BDA0002478916840000071
Figure BDA0002478916840000072
respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types in the database belong to the class c, | | calting2,1Is 2,1 norm.
The method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:
(3-1) converting the least squares regression model to:
Figure BDA0002478916840000073
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure BDA0002478916840000074
(3-3) according to the pseudo tag matrix
Figure BDA0002478916840000075
Is counted to obtain
Figure BDA0002478916840000076
And mcAnd then calculated to obtain
Figure BDA0002478916840000077
(3-4) based on
Figure BDA0002478916840000078
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure BDA0002478916840000079
(3-5) estimating a value based on the projection matrix
Figure BDA00024789168400000710
Using the following formula to pseudo label matrix
Figure BDA00024789168400000711
Updating:
Figure BDA00024789168400000712
Figure BDA00024789168400000713
in the formula (I), the compound is shown in the specification,
Figure BDA00024789168400000714
it is shown that the intermediate auxiliary variable,
Figure BDA00024789168400000715
is composed of
Figure BDA00024789168400000716
The element of the ith column and the jth row,
Figure BDA00024789168400000717
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure BDA00024789168400000718
is a pseudo label matrix
Figure BDA00024789168400000719
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure BDA00024789168400000720
And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended
Figure BDA00024789168400000721
As the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure BDA0002478916840000081
Figure BDA0002478916840000082
(3-2-2) initial values according to projection matrix
Figure BDA0002478916840000083
The initial value of the pseudo label matrix is obtained by adopting the following formula:
Figure BDA0002478916840000084
Figure BDA0002478916840000085
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000086
it is indicated that the intermediate auxiliary variable,
Figure BDA0002478916840000087
is an initial value of a pseudo tag matrix
Figure BDA0002478916840000088
Column i element of row k. Pseudo label matrix
Figure BDA0002478916840000089
Each column of (1) has only one row of its corresponding category as 1, and the remaining rows are all 0.
The step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure BDA00024789168400000810
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
Figure BDA00024789168400000811
solving the above equation to obtain:
Figure BDA00024789168400000812
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
Figure BDA00024789168400000813
solving the above equation to obtain:
Figure BDA0002478916840000091
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf the epsilon is less than epsilon, returning to execute the step (3-4-2) if the epsilon is not less than epsilon, and if the epsilon is less than epsilon or the iteration number is greater than a set value, taking the value of P at the moment as the solved sparse projection matrix, | | | | | magnetism Representing the largest element in the data to be evaluated,
ε represents the convergence threshold.
(4) And (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
The specific method is to calculate the category label by adopting the following formula:
Figure BDA0002478916840000092
Figure BDA0002478916840000093
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure BDA0002478916840000094
representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
The embodiment also provides a cross-database speech emotion recognition device based on joint distribution least square regression, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor implements the method when executing the computer program.
In order to verify the effectiveness of the invention, experiments are respectively carried out on the speech emotion databases Berlin, the eNBACE and the CAISA database pairwise. In each set of experiments, we treated the two databases as a source domain and a target domain, respectively, wherein the source domain provides training data and labels as a training set, and the target domain provides test data only and does not provide any labels as a test set. For more effective detection and identification accuracy, two detection methods, namely non-weighted average recall ratio (UAR) and weighted average recall ratio (WAR), are adopted. Wherein, UAR represents the number of correct predictions of each class divided by the number of tests participated in by each class, and then the number of correct rate substitution of all classes is averaged; and the WAR indicates the number of all correct predictions divided by the number of all participating tests, without considering the effect of each class number. The UAR and the WAR are comprehensively considered, so that the influence of unbalanced class number can be effectively avoided. As a comparative experiment, several classical and efficient algorithms in subspace learning are selected, which are respectively: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in table 1 below, wherein the method is represented in the table by JDLSR, the data set is source domain/target domain, E, B, C are abbreviations for encerface, Berlin, CASIA data sets, respectively, and the evaluation criteria is UAR/WAR.
Experimental results show that the micro-expression recognition method based on the invention obtains higher cross-database micro-expression recognition rate.
TABLE 1
Figure 1

Claims (6)

1. A cross-database speech emotion recognition method based on joint distribution least square regression is characterized by comprising the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags; wherein the established least squares regression model is as follows:
Figure FDA0003628972500000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003628972500000012
means find sparse projection matrix P, L that minimizes the parenthetical inner equation s∈Rc×nFor training the speech emotion class label vector of the database speech segment, C isClass number of speech emotion category, n is number of speech segments of training database, XsFor training feature vectors, X, of speech segments of a databases∈Rd×nD is the dimension of the eigenvector, P is the sparse projection matrix, P belongs to Rd×c,PTIs a rank-of-turn matrix for P,
Figure FDA0003628972500000013
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure FDA0003628972500000014
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure FDA0003628972500000015
Figure FDA0003628972500000016
respectively a set of speech segments of which the emotion types in the training database and the testing database belong to class c, wherein c represents the serial number of the emotion type, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is a norm of 2, 1;
the method for performing joint training on the known label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:
(3-1) converting the least squares regression model to:
Figure FDA0003628972500000017
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure FDA0003628972500000021
(3-3) according to the pseudo tag matrix
Figure FDA0003628972500000022
Get statistics of
Figure FDA0003628972500000023
And mcAnd then calculated to obtain
Figure FDA0003628972500000024
(3-4) based on
Figure FDA0003628972500000025
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure FDA0003628972500000026
(3-5) estimating a value based on the projection matrix
Figure FDA0003628972500000027
Using the following formula to pseudo label matrix
Figure FDA0003628972500000028
Updating:
Figure FDA0003628972500000029
Figure FDA00036289725000000210
in the formula (I), the compound is shown in the specification,
Figure FDA00036289725000000211
it is shown that the intermediate auxiliary variable,
Figure FDA00036289725000000212
is composed of
Figure FDA00036289725000000213
The element of the ith column and the jth row,
Figure FDA00036289725000000214
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure FDA00036289725000000215
is a pseudo label matrix
Figure FDA00036289725000000216
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure FDA00036289725000000217
And (4) returning to the step (3-3) until the preset cycle number is reached, and finishing the cycle to obtain the projection matrix estimated value
Figure FDA00036289725000000218
As a learned sparse projection matrix P;
(4) and (3) for the voice segments to be recognized in the test database, obtaining the feature vectors according to the step (2), and obtaining the corresponding voice emotion category labels by adopting the sparse projection matrix learned in the step (3).
2. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (2) specifically comprises the following steps:
(2-1) calculating values of 16 acoustic low-dimensional descriptors and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;
(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features to form a vector as a feature vector of the corresponding voice segment.
3. The cross-database speech emotion recognition method based on joint distribution least square regression as claimed in claim 1, wherein: the step (3-2) specifically comprises the following steps:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure FDA0003628972500000031
Figure FDA0003628972500000032
(3-2-2) initial value based on projection matrix estimation value
Figure FDA0003628972500000033
Obtaining an initial value of the pseudo label matrix by adopting the following formula:
Figure FDA0003628972500000034
Figure FDA0003628972500000035
in the formula (I), the compound is shown in the specification,
Figure FDA0003628972500000036
it is shown that the intermediate auxiliary variable,
Figure FDA0003628972500000037
is an initial value of the pseudo tag matrix
Figure FDA0003628972500000038
Column i element of row k.
4. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure FDA0003628972500000039
in the formula, T is Lagrange multiplier, k >0 is a regular parameter, and tr () represents a trace of a matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and extracting the part related to the variable Q in the augmented Lagrange equation to obtain:
Figure FDA00036289725000000310
solving the above equation to obtain:
Figure FDA00036289725000000311
(3-4-3) keeping Q, T, k unchanged, updating P:
and extracting the part related to the variable P in the augmented Lagrange equation to obtain:
Figure FDA0003628972500000041
solving the above equation to obtain:
Figure FDA0003628972500000042
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf the epsilon is less than epsilon, returning to execute the step (3-4-2) if the epsilon is not less than epsilon, and if the epsilon is less than epsilon or the iteration number is greater than a set value, taking the value of P at the moment as the solved sparse projection matrix, | | | | | magnetismRepresents the maximum element in the data, and ε represents the convergence threshold.
5. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the method for calculating the voice emotion category label in the test database in the step (4) comprises the following steps:
Calculated using the formula:
Figure FDA0003628972500000043
Figure FDA0003628972500000044
wherein P is the sparse projection matrix learned in step (3), XtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure FDA0003628972500000045
representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
6. A cross-database speech emotion recognition apparatus based on joint distribution least squares regression, comprising a processor and a computer program stored on a memory and operable on the processor, wherein: the processor, when executing the computer program, implements the method of any of claims 1-5.
CN202010372728.2A 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression Active CN111583966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010372728.2A CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010372728.2A CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Publications (2)

Publication Number Publication Date
CN111583966A CN111583966A (en) 2020-08-25
CN111583966B true CN111583966B (en) 2022-06-28

Family

ID=72113186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010372728.2A Active CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Country Status (1)

Country Link
CN (1) CN111583966B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397092A (en) * 2020-11-02 2021-02-23 天津理工大学 Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
CN113112994B (en) * 2021-04-21 2023-11-07 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103594084A (en) * 2013-10-23 2014-02-19 江苏大学 Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning
US9892726B1 (en) * 2014-12-17 2018-02-13 Amazon Technologies, Inc. Class-based discriminative training of speech models
CN110120231A (en) * 2019-05-15 2019-08-13 哈尔滨工业大学 Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization
CN110390955A (en) * 2019-07-01 2019-10-29 东南大学 A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks
CN111048117A (en) * 2019-12-05 2020-04-21 南京信息工程大学 Cross-library speech emotion recognition method based on target adaptation subspace learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8484024B2 (en) * 2011-02-24 2013-07-09 Nuance Communications, Inc. Phonetic features for speech recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103594084A (en) * 2013-10-23 2014-02-19 江苏大学 Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning
US9892726B1 (en) * 2014-12-17 2018-02-13 Amazon Technologies, Inc. Class-based discriminative training of speech models
CN110120231A (en) * 2019-05-15 2019-08-13 哈尔滨工业大学 Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization
CN110390955A (en) * 2019-07-01 2019-10-29 东南大学 A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks
CN111048117A (en) * 2019-12-05 2020-04-21 南京信息工程大学 Cross-library speech emotion recognition method based on target adaptation subspace learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cross-Corpus Speech Emotion Recognition Based on Domain-adaptive Least Squares Regression;Yuan Zong et al.;《IEEE》;20161231;第1-9页 *

Also Published As

Publication number Publication date
CN111583966A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
Sincan et al. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods
Bishay et al. Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis
Morel et al. Time-series averaging using constrained dynamic time warping with tolerance
CN111126263B (en) Electroencephalogram emotion recognition method and device based on double-hemisphere difference model
CN110175251A (en) The zero sample Sketch Searching method based on semantic confrontation network
Karnati et al. LieNet: A deep convolution neural network framework for detecting deception
CN111583966B (en) Cross-database speech emotion recognition method and device based on joint distribution least square regression
CN103544486B (en) Human age estimation method based on self-adaptation sign distribution
JP6977901B2 (en) Learning material recommendation method, learning material recommendation device and learning material recommendation program
Yang et al. Visual goal-step inference using wikiHow
CN113112994B (en) Cross-corpus emotion recognition method based on graph convolution neural network
Zhang et al. A kinect based golf swing score and grade system using gmm and svm
Cai et al. Mitigating behavioral variability for mouse dynamics: A dimensionality-reduction-based approach
Takano et al. Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots
Caglayan et al. LIUM-CVC submissions for WMT18 multimodal translation task
CN112397092A (en) Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
Spaulding et al. Frustratingly easy personalization for real-time affect interpretation of facial expression
Patro et al. Uncertainty class activation map (U-CAM) using gradient certainty method
Samsudin Modeling student’s academic performance during covid-19 based on classification in support vector machine
CN108197274B (en) Abnormal personality detection method and device based on conversation
Ye et al. Rebalanced zero-shot learning
Lu et al. A zero-shot intelligent fault diagnosis system based on EEMD
CN116244474A (en) Learner learning state acquisition method based on multi-mode emotion feature fusion
Ren et al. Subject-independent natural action recognition
Boukhennoufa et al. A novel model to generate heterogeneous and realistic time-series data for post-stroke rehabilitation assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant