CN111583966A - Cross-database speech emotion recognition method and device based on joint distribution least square regression - Google Patents

Cross-database speech emotion recognition method and device based on joint distribution least square regression Download PDF

Info

Publication number
CN111583966A
CN111583966A CN202010372728.2A CN202010372728A CN111583966A CN 111583966 A CN111583966 A CN 111583966A CN 202010372728 A CN202010372728 A CN 202010372728A CN 111583966 A CN111583966 A CN 111583966A
Authority
CN
China
Prior art keywords
database
voice
speech
matrix
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010372728.2A
Other languages
Chinese (zh)
Other versions
CN111583966B (en
Inventor
宗源
江林
张佳成
郑文明
江星洵
刘佳腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010372728.2A priority Critical patent/CN111583966B/en
Publication of CN111583966A publication Critical patent/CN111583966A/en
Application granted granted Critical
Publication of CN111583966B publication Critical patent/CN111583966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least square regression, wherein the method comprises the following steps: (1) acquiring a training database and a testing database, wherein the training voice database comprises a plurality of voice segments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice segments to be recognized; (2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment; (3) establishing a least square regression model based on joint distribution, and performing joint training by using a training database and a test database to obtain a sparse projection matrix; (4) and (3) for the voice segment to be recognized, obtaining a feature vector according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix. The invention can adapt to different environments and has higher accuracy.

Description

Cross-database speech emotion recognition method and device based on joint distribution least square regression
Technical Field
The invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least square regression.
Background
The purpose of speech emotion recognition is to enable a machine to have enough intelligence to extract the emotional state (such as happiness, fear, sadness and the like) of the machine from the speech of a speaker, so that the machine is an important link in human-computer interaction and has great research potential and development prospect. If the mental state of the driver is detected by combining the voice, expression and behavior information of the driver, the driver can be reminded to concentrate on the attention to avoid dangerous driving in time; the voice emotion of a speaker is detected in the human-computer interaction, so that the dialogue is smoother, the psychology of the speaker is better taken care of, and the speaker is close to cognition; the wearable equipment can make more timely and more appropriate feedback according to the emotional state of the wearer; meanwhile, in various fields such as classroom teaching and teacher accompanying, the speech emotion recognition plays an increasingly important role.
Traditional speech emotion recognition is trained and tested on the same speech database, and the training data and the testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background is also doped with various noises. Cross-database speech emotion recognition therefore faces significant challenges. How to make the trained model have good adaptability to different environments becomes a problem to be solved by the academics and the industry.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on joint distribution least square regression.
The technical scheme is as follows: the cross-database speech emotion recognition method based on the joint distribution least square regression comprises the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;
(4) and (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
Further, the step (2) specifically comprises:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;
(2-2) respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of each voice segment, wherein the 12 kinds of statistical functions are respectively mean value calculation, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features as a feature vector of the corresponding voice segment.
Further, the least squares regression model established in step (3) is:
Figure BDA0002478916840000021
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000022
means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,
Figure BDA0002478916840000023
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure BDA0002478916840000024
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure BDA0002478916840000025
Figure BDA0002478916840000026
respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
Further, the method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label in the step (3) specifically includes:
(3-1) converting the least squares regression model to:
Figure BDA0002478916840000031
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure BDA0002478916840000032
(3-3) according to the pseudo tag matrix
Figure BDA0002478916840000033
Is counted to obtain
Figure BDA0002478916840000034
And mcAnd then calculated to obtain
Figure BDA0002478916840000035
(3-4) based on
Figure BDA0002478916840000036
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure BDA0002478916840000037
(3-5) estimating a value based on the projection matrix
Figure BDA0002478916840000038
Using the following formula to pseudo label matrix
Figure BDA0002478916840000039
Updating:
Figure BDA00024789168400000310
Figure BDA00024789168400000311
in the formula (I), the compound is shown in the specification,
Figure BDA00024789168400000312
it is shown that the intermediate auxiliary variable,
Figure BDA00024789168400000313
is composed of
Figure BDA00024789168400000314
The element of the ith column and the jth row,
Figure BDA00024789168400000315
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure BDA00024789168400000316
is a pseudo label matrix
Figure BDA00024789168400000317
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure BDA00024789168400000318
And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended
Figure BDA00024789168400000319
As the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure BDA00024789168400000320
Figure BDA00024789168400000321
(3-2-2) initial values according to projection matrix
Figure BDA00024789168400000322
Obtaining an initial value of the pseudo label matrix by adopting the following formula:
Figure BDA00024789168400000323
Figure BDA00024789168400000324
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000041
it is shown that the intermediate auxiliary variable,
Figure BDA0002478916840000042
is an initial value of the pseudo tag matrix
Figure BDA0002478916840000043
Column i element of row k. Further, the step (3-4) specifically comprises:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure BDA0002478916840000044
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
Figure BDA0002478916840000045
solving the above equation to obtain:
Figure BDA0002478916840000046
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
Figure BDA0002478916840000047
solving the above equation to obtain:
Figure BDA0002478916840000048
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetismThe maximum element in the data is shown, and the convergence threshold is shown.
Further, the method for calculating the speech emotion category label of the test database in step (4) is as follows:
calculated using the formula:
Figure BDA0002478916840000051
Figure BDA0002478916840000052
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure BDA0002478916840000053
and j represents a middle auxiliary variable, and j represents a speech emotion category label of the speech segment to be recognized.
The cross-database speech emotion recognition device based on the joint distribution least square regression comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the cross-database speech emotion recognition method and device of the invention are in cross-database learning, so that the method and device have good adaptability to different environments and the recognition result is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of a cross-database speech emotion recognition method based on joint distribution least square regression provided by the invention.
Detailed Description
The embodiment provides a cross-database speech emotion recognition method based on joint distribution least square regression, as shown in fig. 1, including the following steps:
(1) the method comprises the steps of obtaining two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized.
In this embodiment, we use three types of speech emotion databases that are common in emotion speech recognition: berlin, eNTERFACE, and casia. Because the three types of databases contain different emotion categories, the data are selected when compared pairwise. When Berlin and eNFIGCE are compared, 375 pieces of data and 1077 pieces of data are respectively selected, and the emotion categories are 5 types (anger, fear, happiness, aversion and sadness); when Berlin and CAISA are compared, 408 data and 1000 data are respectively selected, and the emotion category is 5 types (anger, fear, happiness, disgust and sadness); when eNFORCE and CAISA are compared, we have chosen 1072 pieces of data and 1000 pieces of data, and the emotion category is 5 types (anger, fear, happiness, dislike, sadness).
(2) And processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment.
The method specifically comprises the following steps:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12; the descriptor is derived from INTERSPEECH 2009 from the function set provided by Emotion Challenge;
(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment by using openSIMLE software, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained by statistics as one emotion feature, and taking a vector formed by 16 × 2 × 12 to 384 emotion features as a feature vector corresponding to the voice segment.
(3) And establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known labels and a test database of unknown labels to obtain a sparse projection matrix connecting the voice segments and the voice emotion category labels.
Wherein the established least squares regression model is as follows:
Figure BDA0002478916840000061
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000062
means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,
Figure BDA0002478916840000063
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure BDA0002478916840000064
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure BDA0002478916840000071
Figure BDA0002478916840000072
respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
The method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:
(3-1) converting the least squares regression model to:
Figure BDA0002478916840000073
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure BDA0002478916840000074
(3-3) according to the pseudo tag matrix
Figure BDA0002478916840000075
Is counted to obtain
Figure BDA0002478916840000076
And mcAnd then calculated to obtain
Figure BDA0002478916840000077
(3-4) based on
Figure BDA0002478916840000078
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure BDA0002478916840000079
(3-5) estimating a value based on the projection matrix
Figure BDA00024789168400000710
Using the following formula to pseudo label matrix
Figure BDA00024789168400000711
Updating:
Figure BDA00024789168400000712
Figure BDA00024789168400000713
in the formula (I), the compound is shown in the specification,
Figure BDA00024789168400000714
it is shown that the intermediate auxiliary variable,
Figure BDA00024789168400000715
is composed of
Figure BDA00024789168400000716
The element of the ith column and the jth row,
Figure BDA00024789168400000717
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure BDA00024789168400000718
is a pseudo label matrix
Figure BDA00024789168400000719
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure BDA00024789168400000720
And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended
Figure BDA00024789168400000721
As the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure BDA0002478916840000081
Figure BDA0002478916840000082
(3-2-2) initial values according to projection matrix
Figure BDA0002478916840000083
Obtaining an initial value of the pseudo label matrix by adopting the following formula:
Figure BDA0002478916840000084
Figure BDA0002478916840000085
in the formula (I), the compound is shown in the specification,
Figure BDA0002478916840000086
it is shown that the intermediate auxiliary variable,
Figure BDA0002478916840000087
is an initial value of the pseudo tag matrix
Figure BDA0002478916840000088
Column i element of row k. Pseudo label matrix
Figure BDA0002478916840000089
Each column of (1) has only its corresponding category of row 1, and the remaining rows are all 0.
The step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure BDA00024789168400000810
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
Figure BDA00024789168400000811
solving the above equation to obtain:
Figure BDA00024789168400000812
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
Figure BDA00024789168400000813
solving the above equation to obtain:
Figure BDA0002478916840000091
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetismRepresenting the largest element in the data to be evaluated,
indicating a convergence threshold.
(4) And (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
The specific method is to calculate the category label by adopting the following formula:
Figure BDA0002478916840000092
Figure BDA0002478916840000093
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure BDA0002478916840000094
representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
The embodiment also provides a cross-database speech emotion recognition device based on joint distribution least square regression, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor implements the method when executing the computer program.
In order to verify the effectiveness of the invention, experiments are respectively carried out on the speech emotion databases Berlin, the eNBACE and the CAISA database pairwise. In each set of experiments, we treated the two databases as a source domain and a target domain, respectively, wherein the source domain provides training data and labels as a training set, and the target domain provides test data only and does not provide any labels as a test set. For more effective detection and identification accuracy, two detection methods, namely non-weighted average recall ratio (UAR) and weighted average recall ratio (WAR), are adopted. Wherein, UAR represents the number of correct predictions of each class divided by the number of tests participated in by each class, and then the number of correct rate substitution of all classes is averaged; and the WAR indicates the number of all correct predictions divided by the number of all participating tests, without considering the effect of each class number. The UAR and the WAR are comprehensively considered, so that the influence of unbalanced class number can be effectively avoided. As a comparative experiment, several classical and efficient algorithms in subspace learning are selected, which are respectively: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in table 1 below, wherein the method is represented in the table by JDLSR, the data set is source domain/target domain, E, B, C are abbreviations for encerface, Berlin, CASIA data sets, respectively, and the evaluation criteria is UAR/WAR.
Experimental results show that the micro-expression identification method provided by the invention obtains higher cross-database micro-expression identification rate.
TABLE 1
Figure 1

Claims (8)

1. A cross-database speech emotion recognition method based on joint distribution least square regression is characterized by comprising the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;
(4) and (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
2. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (2) specifically comprises the following steps:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;
(2-2) respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of each voice segment, wherein the 12 kinds of statistical functions are respectively mean value calculation, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features as a feature vector of the corresponding voice segment.
3. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the least square regression model established in the step (3) is as follows:
Figure FDA0002478916830000011
in the formula (I), the compound is shown in the specification,
Figure FDA0002478916830000012
means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,
Figure FDA0002478916830000021
is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,
Figure FDA0002478916830000022
Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,
Figure FDA0002478916830000023
Figure FDA0002478916830000024
respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
4. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 3, wherein: the method for performing joint training on the known label by using the training database of the known label and the test database of the unknown label in the step (3) specifically comprises the following steps:
(3-1) converting the least squares regression model to:
Figure FDA0002478916830000025
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
Figure FDA0002478916830000026
(3-3) according to the pseudo tag matrix
Figure FDA0002478916830000027
Is counted to obtain
Figure FDA0002478916830000028
And mcAnd then calculated to obtain
Figure FDA0002478916830000029
(3-4) based on
Figure FDA00024789168300000210
Solving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
Figure FDA00024789168300000211
(3-5) estimating a value based on the projection matrix
Figure FDA00024789168300000212
Using the following formula to pseudo label matrix
Figure FDA00024789168300000213
Updating:
Figure FDA00024789168300000214
Figure FDA00024789168300000215
in the formula (I), the compound is shown in the specification,
Figure FDA00024789168300000216
it is shown that the intermediate auxiliary variable,
Figure FDA00024789168300000217
is composed of
Figure FDA00024789168300000218
The element of the ith column and the jth row,
Figure FDA00024789168300000219
the row number j representing the row for which the value of the element in the ith column is the largest is found,
Figure FDA00024789168300000220
is a pseudo label matrix
Figure FDA00024789168300000221
Column i, row k;
(3-6) Using the updated pseudo tag matrix
Figure FDA0002478916830000031
And (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is ended
Figure FDA0002478916830000032
As the projection matrix P obtained by learning.
5. The cross-database speech emotion recognition method based on joint distribution least squares regression as claimed in claim 4, wherein: the step (3-2) specifically comprises the following steps:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
Figure FDA0002478916830000033
Figure FDA0002478916830000034
(3-2-2) initial values according to projection matrix
Figure FDA0002478916830000035
Obtaining an initial value of the pseudo label matrix by adopting the following formula:
Figure FDA0002478916830000036
Figure FDA0002478916830000037
in the formula (I), the compound is shown in the specification,
Figure FDA0002478916830000038
it is shown that the intermediate auxiliary variable,
Figure FDA0002478916830000039
is an initial value of the pseudo tag matrix
Figure FDA00024789168300000310
Column i element of row k.
6. The cross-database speech emotion recognition method based on joint distribution least squares regression as claimed in claim 4, wherein: the step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
Figure FDA00024789168300000311
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
Figure FDA00024789168300000312
solving the above equation to obtain:
Figure FDA00024789168300000313
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
Figure FDA0002478916830000041
solving the above equation to obtain:
Figure FDA0002478916830000042
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminanceIf not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetismThe maximum element in the data is shown, and the convergence threshold is shown.
7. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the method for calculating the voice emotion category label of the test database in the step (4) comprises the following steps:
calculated using the formula:
Figure FDA0002478916830000043
Figure FDA0002478916830000044
wherein P is the projection matrix learned in step (3), XtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,
Figure FDA0002478916830000045
representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
8. A cross-database speech emotion recognition apparatus based on joint distribution least squares regression, comprising a processor and a computer program stored on a memory and operable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-6.
CN202010372728.2A 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression Active CN111583966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010372728.2A CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010372728.2A CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Publications (2)

Publication Number Publication Date
CN111583966A true CN111583966A (en) 2020-08-25
CN111583966B CN111583966B (en) 2022-06-28

Family

ID=72113186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010372728.2A Active CN111583966B (en) 2020-05-06 2020-05-06 Cross-database speech emotion recognition method and device based on joint distribution least square regression

Country Status (1)

Country Link
CN (1) CN111583966B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397092A (en) * 2020-11-02 2021-02-23 天津理工大学 Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
CN113112994A (en) * 2021-04-21 2021-07-13 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120221333A1 (en) * 2011-02-24 2012-08-30 International Business Machines Corporation Phonetic Features for Speech Recognition
CN103594084A (en) * 2013-10-23 2014-02-19 江苏大学 Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning
US9892726B1 (en) * 2014-12-17 2018-02-13 Amazon Technologies, Inc. Class-based discriminative training of speech models
CN110120231A (en) * 2019-05-15 2019-08-13 哈尔滨工业大学 Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization
CN110390955A (en) * 2019-07-01 2019-10-29 东南大学 A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks
CN111048117A (en) * 2019-12-05 2020-04-21 南京信息工程大学 Cross-library speech emotion recognition method based on target adaptation subspace learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120221333A1 (en) * 2011-02-24 2012-08-30 International Business Machines Corporation Phonetic Features for Speech Recognition
CN103594084A (en) * 2013-10-23 2014-02-19 江苏大学 Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning
US9892726B1 (en) * 2014-12-17 2018-02-13 Amazon Technologies, Inc. Class-based discriminative training of speech models
CN110120231A (en) * 2019-05-15 2019-08-13 哈尔滨工业大学 Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization
CN110390955A (en) * 2019-07-01 2019-10-29 东南大学 A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks
CN111048117A (en) * 2019-12-05 2020-04-21 南京信息工程大学 Cross-library speech emotion recognition method based on target adaptation subspace learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUAN ZONG ET AL.: "Cross-Corpus Speech Emotion Recognition Based on Domain-adaptive Least Squares Regression", 《IEEE》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397092A (en) * 2020-11-02 2021-02-23 天津理工大学 Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
CN113112994A (en) * 2021-04-21 2021-07-13 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network
CN113112994B (en) * 2021-04-21 2023-11-07 江苏师范大学 Cross-corpus emotion recognition method based on graph convolution neural network

Also Published As

Publication number Publication date
CN111583966B (en) 2022-06-28

Similar Documents

Publication Publication Date Title
Bishay et al. Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis
Morel et al. Time-series averaging using constrained dynamic time warping with tolerance
CN107526799B (en) Knowledge graph construction method based on deep learning
Guu et al. Traversing knowledge graphs in vector space
JP6977901B2 (en) Learning material recommendation method, learning material recommendation device and learning material recommendation program
Karnati et al. LieNet: A deep convolution neural network framework for detecting deception
CN111126263B (en) Electroencephalogram emotion recognition method and device based on double-hemisphere difference model
CN111583966B (en) Cross-database speech emotion recognition method and device based on joint distribution least square regression
CN107506350B (en) Method and equipment for identifying information
CN113705092B (en) Disease prediction method and device based on machine learning
CN113112994B (en) Cross-corpus emotion recognition method based on graph convolution neural network
CN107491729A (en) The Handwritten Digit Recognition method of convolutional neural networks based on cosine similarity activation
Takano et al. Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots
Zhang et al. Intelligent Facial Action and emotion recognition for humanoid robots
CN112397092A (en) Unsupervised cross-library speech emotion recognition method based on field adaptive subspace
Samsudin Modeling student’s academic performance during covid-19 based on classification in support vector machine
Patro et al. Uncertainty class activation map (U-CAM) using gradient certainty method
CN116244474A (en) Learner learning state acquisition method based on multi-mode emotion feature fusion
CN110069601A (en) Mood determination method and relevant apparatus
Sun et al. Automatic inference of mental states from spontaneous facial expressions
CN116010563A (en) Multi-round dialogue data analysis method, electronic equipment and storage medium
Elbarougy et al. Feature selection method for real-time speech emotion recognition
Ren et al. Subject-independent natural action recognition
Utami et al. The EfficientNet Performance for Facial Expressions Recognition
Hachaj et al. Application of hidden markov models and gesture description language classifiers to oyama karate techniques recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant