CN111583966A - Cross-database speech emotion recognition method and device based on joint distribution least square regression - Google Patents
Cross-database speech emotion recognition method and device based on joint distribution least square regression Download PDFInfo
- Publication number
- CN111583966A CN111583966A CN202010372728.2A CN202010372728A CN111583966A CN 111583966 A CN111583966 A CN 111583966A CN 202010372728 A CN202010372728 A CN 202010372728A CN 111583966 A CN111583966 A CN 111583966A
- Authority
- CN
- China
- Prior art keywords
- database
- voice
- speech
- matrix
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims abstract description 66
- 238000012360 testing method Methods 0.000 claims abstract description 43
- 230000008451 emotion Effects 0.000 claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 37
- 230000002996 emotional effect Effects 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000003190 augmentative effect Effects 0.000 claims description 12
- 150000001875 compounds Chemical class 0.000 claims description 9
- 239000012634 fragment Substances 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 108010046685 Rho Factor Proteins 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 3
- 230000005389 magnetism Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010063659 Aversion Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least square regression, wherein the method comprises the following steps: (1) acquiring a training database and a testing database, wherein the training voice database comprises a plurality of voice segments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice segments to be recognized; (2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment; (3) establishing a least square regression model based on joint distribution, and performing joint training by using a training database and a test database to obtain a sparse projection matrix; (4) and (3) for the voice segment to be recognized, obtaining a feature vector according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix. The invention can adapt to different environments and has higher accuracy.
Description
Technical Field
The invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least square regression.
Background
The purpose of speech emotion recognition is to enable a machine to have enough intelligence to extract the emotional state (such as happiness, fear, sadness and the like) of the machine from the speech of a speaker, so that the machine is an important link in human-computer interaction and has great research potential and development prospect. If the mental state of the driver is detected by combining the voice, expression and behavior information of the driver, the driver can be reminded to concentrate on the attention to avoid dangerous driving in time; the voice emotion of a speaker is detected in the human-computer interaction, so that the dialogue is smoother, the psychology of the speaker is better taken care of, and the speaker is close to cognition; the wearable equipment can make more timely and more appropriate feedback according to the emotional state of the wearer; meanwhile, in various fields such as classroom teaching and teacher accompanying, the speech emotion recognition plays an increasingly important role.
Traditional speech emotion recognition is trained and tested on the same speech database, and the training data and the testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background is also doped with various noises. Cross-database speech emotion recognition therefore faces significant challenges. How to make the trained model have good adaptability to different environments becomes a problem to be solved by the academics and the industry.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on joint distribution least square regression.
The technical scheme is as follows: the cross-database speech emotion recognition method based on the joint distribution least square regression comprises the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;
(4) and (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
Further, the step (2) specifically comprises:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;
(2-2) respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of each voice segment, wherein the 12 kinds of statistical functions are respectively mean value calculation, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features as a feature vector of the corresponding voice segment.
Further, the least squares regression model established in step (3) is:
in the formula (I), the compound is shown in the specification,means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment, respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
Further, the method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label in the step (3) specifically includes:
(3-1) converting the least squares regression model to:
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
(3-4) based onSolving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
(3-5) estimating a value based on the projection matrixUsing the following formula to pseudo label matrixUpdating:
in the formula (I), the compound is shown in the specification,it is shown that the intermediate auxiliary variable,is composed ofThe element of the ith column and the jth row,the row number j representing the row for which the value of the element in the ith column is the largest is found,is a pseudo label matrixColumn i, row k;
(3-6) Using the updated pseudo tag matrixAnd (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is endedAs the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
(3-2-2) initial values according to projection matrixObtaining an initial value of the pseudo label matrix by adopting the following formula:
in the formula (I), the compound is shown in the specification,it is shown that the intermediate auxiliary variable,is an initial value of the pseudo tag matrixColumn i element of row k. Further, the step (3-4) specifically comprises:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminance∞If not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetism∞The maximum element in the data is shown, and the convergence threshold is shown.
Further, the method for calculating the speech emotion category label of the test database in step (4) is as follows:
calculated using the formula:
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,and j represents a middle auxiliary variable, and j represents a speech emotion category label of the speech segment to be recognized.
The cross-database speech emotion recognition device based on the joint distribution least square regression comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the cross-database speech emotion recognition method and device of the invention are in cross-database learning, so that the method and device have good adaptability to different environments and the recognition result is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of a cross-database speech emotion recognition method based on joint distribution least square regression provided by the invention.
Detailed Description
The embodiment provides a cross-database speech emotion recognition method based on joint distribution least square regression, as shown in fig. 1, including the following steps:
(1) the method comprises the steps of obtaining two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized.
In this embodiment, we use three types of speech emotion databases that are common in emotion speech recognition: berlin, eNTERFACE, and casia. Because the three types of databases contain different emotion categories, the data are selected when compared pairwise. When Berlin and eNFIGCE are compared, 375 pieces of data and 1077 pieces of data are respectively selected, and the emotion categories are 5 types (anger, fear, happiness, aversion and sadness); when Berlin and CAISA are compared, 408 data and 1000 data are respectively selected, and the emotion category is 5 types (anger, fear, happiness, disgust and sadness); when eNFORCE and CAISA are compared, we have chosen 1072 pieces of data and 1000 pieces of data, and the emotion category is 5 types (anger, fear, happiness, dislike, sadness).
(2) And processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment.
The method specifically comprises the following steps:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12; the descriptor is derived from INTERSPEECH 2009 from the function set provided by Emotion Challenge;
(2-2) for each voice segment, respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of the voice segment by using openSIMLE software, wherein the 12 kinds of statistical functions are respectively averaging, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained by statistics as one emotion feature, and taking a vector formed by 16 × 2 × 12 to 384 emotion features as a feature vector corresponding to the voice segment.
(3) And establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known labels and a test database of unknown labels to obtain a sparse projection matrix connecting the voice segments and the voice emotion category labels.
Wherein the established least squares regression model is as follows:
in the formula (I), the compound is shown in the specification,means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment, respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
The method for jointly training the known label and the unknown label by using the training database of the known label and the test database of the unknown label specifically comprises the following steps:
(3-1) converting the least squares regression model to:
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
(3-4) based onSolving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
(3-5) estimating a value based on the projection matrixUsing the following formula to pseudo label matrixUpdating:
in the formula (I), the compound is shown in the specification,it is shown that the intermediate auxiliary variable,is composed ofThe element of the ith column and the jth row,the row number j representing the row for which the value of the element in the ith column is the largest is found,is a pseudo label matrixColumn i, row k;
(3-6) Using the updated pseudo tag matrixAnd (4) returning to execute the step (3-3) until the preset cycle number is reached, and obtaining the projection matrix estimated value after the cycle is endedAs the projection matrix P obtained by learning.
Further, the step (3-2) specifically comprises:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
(3-2-2) initial values according to projection matrixObtaining an initial value of the pseudo label matrix by adopting the following formula:
in the formula (I), the compound is shown in the specification,it is shown that the intermediate auxiliary variable,is an initial value of the pseudo tag matrixColumn i element of row k. Pseudo label matrixEach column of (1) has only its corresponding category of row 1, and the remaining rows are all 0.
The step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminance∞If not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetism∞Representing the largest element in the data to be evaluated,
indicating a convergence threshold.
(4) And (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
The specific method is to calculate the category label by adopting the following formula:
where P is the final projection matrix, X, we have learnedtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
The embodiment also provides a cross-database speech emotion recognition device based on joint distribution least square regression, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor implements the method when executing the computer program.
In order to verify the effectiveness of the invention, experiments are respectively carried out on the speech emotion databases Berlin, the eNBACE and the CAISA database pairwise. In each set of experiments, we treated the two databases as a source domain and a target domain, respectively, wherein the source domain provides training data and labels as a training set, and the target domain provides test data only and does not provide any labels as a test set. For more effective detection and identification accuracy, two detection methods, namely non-weighted average recall ratio (UAR) and weighted average recall ratio (WAR), are adopted. Wherein, UAR represents the number of correct predictions of each class divided by the number of tests participated in by each class, and then the number of correct rate substitution of all classes is averaged; and the WAR indicates the number of all correct predictions divided by the number of all participating tests, without considering the effect of each class number. The UAR and the WAR are comprehensively considered, so that the influence of unbalanced class number can be effectively avoided. As a comparative experiment, several classical and efficient algorithms in subspace learning are selected, which are respectively: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in table 1 below, wherein the method is represented in the table by JDLSR, the data set is source domain/target domain, E, B, C are abbreviations for encerface, Berlin, CASIA data sets, respectively, and the evaluation criteria is UAR/WAR.
Experimental results show that the micro-expression identification method provided by the invention obtains higher cross-database micro-expression identification rate.
TABLE 1
Claims (8)
1. A cross-database speech emotion recognition method based on joint distribution least square regression is characterized by comprising the following steps:
(1) acquiring two voice databases which are respectively used as a training database and a testing database, wherein the training voice database comprises a plurality of voice fragments and corresponding voice emotion category labels, and the testing database only comprises a plurality of voice fragments to be recognized;
(2) processing and counting the voice segments by using a plurality of acoustic low-dimensional descriptors, taking each piece of information obtained by counting as an emotional feature, and forming a vector by using a plurality of emotional features as a feature vector of the corresponding voice segment;
(3) establishing a least square regression model based on joint distribution, and performing joint training on the least square regression model by using a training database of known tags and a test database of unknown tags to obtain a sparse projection matrix connecting the voice fragments and the voice emotion class tags;
(4) and (3) obtaining a feature vector for the voice segment to be recognized in the test database according to the step (2), and obtaining a corresponding voice emotion category label by adopting the learned sparse projection matrix.
2. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the step (2) specifically comprises the following steps:
(2-1) calculating 16 acoustic low-dimensional descriptor values and corresponding incremental parameters thereof for each speech segment; the 16 acoustic low-dimensional descriptors are respectively: time signal zero crossing rate, frame energy root mean square, fundamental frequency, harmonic signal to noise ratio and Melton frequency cepstrum coefficient 1-12;
(2-2) respectively carrying out 12 kinds of statistical function processing on 16 acoustic low-dimensional descriptors of each voice segment, wherein the 12 kinds of statistical functions are respectively mean value calculation, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, two linear regression coefficients and mean square error thereof;
and (2-3) taking each piece of information obtained through statistics as an emotional feature, and taking a plurality of emotional features as a feature vector of the corresponding voice segment.
3. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the least square regression model established in the step (3) is as follows:
in the formula (I), the compound is shown in the specification,means to find the matrix P, L that minimizes the parenthetical equations∈Rc×nA speech emotion category label vector for training the speech segments of the database, C is the class number of the speech emotion categories, n is the number of the speech segments of the training database, Xs∈Rd×nFor training the feature vectors of speech segments of the database, d is the dimension of the feature vector, P ∈ Rd×cAs a sparse projection matrix, PTIs a rank-of-turn matrix for P,is the square of Frobenius norm, and is a balance coefficient for controlling a regular term,Xt∈Rd×mis the feature vector of the test database speech segment, m is the number of segments of the test database speech segment, respectively a set of speech segments of which the emotion types in the training database and the test database belong to the class c, nc、mcRespectively testing the number of the speech segments of which the emotion types belong to the class c in the database, | | | luminance2,1Is 2,1 norm.
4. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 3, wherein: the method for performing joint training on the known label by using the training database of the known label and the test database of the unknown label in the step (3) specifically comprises the following steps:
(3-1) converting the least squares regression model to:
s.t.P=Q
(3-2) estimating a pseudo label matrix formed by the speech emotion category pseudo labels corresponding to all speech segments in the test database through the converted least square regression model
(3-4) based onSolving the converted least square regression model by using an augmented Lagrange multiplier method to obtain a projection matrix estimation value
(3-5) estimating a value based on the projection matrixUsing the following formula to pseudo label matrixUpdating:
in the formula (I), the compound is shown in the specification,it is shown that the intermediate auxiliary variable,is composed ofThe element of the ith column and the jth row,the row number j representing the row for which the value of the element in the ith column is the largest is found,is a pseudo label matrixColumn i, row k;
5. The cross-database speech emotion recognition method based on joint distribution least squares regression as claimed in claim 4, wherein: the step (3-2) specifically comprises the following steps:
(3-2-1) obtaining an initial value of the estimated value of the projection matrix by using a formula of the transformed least square regression model without adding a regular term
(3-2-2) initial values according to projection matrixObtaining an initial value of the pseudo label matrix by adopting the following formula:
6. The cross-database speech emotion recognition method based on joint distribution least squares regression as claimed in claim 4, wherein: the step (3-4) specifically comprises the following steps:
(3-4-1) obtaining an augmented Lagrange equation of the least squares regression model:
in the formula, T is a Lagrange multiplier, k is more than 0 and is a regular parameter, and tr () represents the trace of the matrix;
(3-4-2) keeping P, T, k unchanged, updating Q:
and (3) extracting a part related to the variable Q in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
(3-4-3) keeping Q, T, k unchanged, updating P:
and (3) extracting a part related to the variable P in the augmented Lagrange equation to obtain:
solving the above equation to obtain:
Piis the ith column vector of P, TiIs the ith column vector of T;
(3-4-4) keeping Q, P unchanged, updating T, k:
T=T+k(P-C)
k=min(ρk,kmax)
in the formula, kmaxIs the maximum value of the preset k, rho is the scaling factor, rho>1;
(3-4-5) check whether convergence occurs:
checking P-Q luminance∞If not, returning to execute the step (3-4-2), if yes or the iteration times are larger than the set value, taking the value of P at the moment as the obtained sparse projection matrix, | | | | magnetism∞The maximum element in the data is shown, and the convergence threshold is shown.
7. The cross-database speech emotion recognition method based on joint distribution least squares regression, as claimed in claim 1, wherein: the method for calculating the voice emotion category label of the test database in the step (4) comprises the following steps:
calculated using the formula:
wherein P is the projection matrix learned in step (3), XtThe feature vector set representing the voice segments of the test database, namely the feature vector set of the voice segments to be recognized,representing intermediate auxiliary variables, j*And the speech emotion category label represents the speech segment to be recognized.
8. A cross-database speech emotion recognition apparatus based on joint distribution least squares regression, comprising a processor and a computer program stored on a memory and operable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372728.2A CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372728.2A CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583966A true CN111583966A (en) | 2020-08-25 |
CN111583966B CN111583966B (en) | 2022-06-28 |
Family
ID=72113186
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010372728.2A Active CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583966B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397092A (en) * | 2020-11-02 | 2021-02-23 | 天津理工大学 | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace |
CN113112994A (en) * | 2021-04-21 | 2021-07-13 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolution neural network |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120221333A1 (en) * | 2011-02-24 | 2012-08-30 | International Business Machines Corporation | Phonetic Features for Speech Recognition |
CN103594084A (en) * | 2013-10-23 | 2014-02-19 | 江苏大学 | Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning |
US9892726B1 (en) * | 2014-12-17 | 2018-02-13 | Amazon Technologies, Inc. | Class-based discriminative training of speech models |
CN110120231A (en) * | 2019-05-15 | 2019-08-13 | 哈尔滨工业大学 | Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization |
CN110390955A (en) * | 2019-07-01 | 2019-10-29 | 东南大学 | A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks |
CN111048117A (en) * | 2019-12-05 | 2020-04-21 | 南京信息工程大学 | Cross-library speech emotion recognition method based on target adaptation subspace learning |
-
2020
- 2020-05-06 CN CN202010372728.2A patent/CN111583966B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120221333A1 (en) * | 2011-02-24 | 2012-08-30 | International Business Machines Corporation | Phonetic Features for Speech Recognition |
CN103594084A (en) * | 2013-10-23 | 2014-02-19 | 江苏大学 | Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning |
US9892726B1 (en) * | 2014-12-17 | 2018-02-13 | Amazon Technologies, Inc. | Class-based discriminative training of speech models |
CN110120231A (en) * | 2019-05-15 | 2019-08-13 | 哈尔滨工业大学 | Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization |
CN110390955A (en) * | 2019-07-01 | 2019-10-29 | 东南大学 | A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks |
CN111048117A (en) * | 2019-12-05 | 2020-04-21 | 南京信息工程大学 | Cross-library speech emotion recognition method based on target adaptation subspace learning |
Non-Patent Citations (1)
Title |
---|
YUAN ZONG ET AL.: "Cross-Corpus Speech Emotion Recognition Based on Domain-adaptive Least Squares Regression", 《IEEE》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397092A (en) * | 2020-11-02 | 2021-02-23 | 天津理工大学 | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace |
CN113112994A (en) * | 2021-04-21 | 2021-07-13 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolution neural network |
CN113112994B (en) * | 2021-04-21 | 2023-11-07 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolution neural network |
Also Published As
Publication number | Publication date |
---|---|
CN111583966B (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bishay et al. | Schinet: Automatic estimation of symptoms of schizophrenia from facial behaviour analysis | |
Morel et al. | Time-series averaging using constrained dynamic time warping with tolerance | |
CN107526799B (en) | Knowledge graph construction method based on deep learning | |
Guu et al. | Traversing knowledge graphs in vector space | |
JP6977901B2 (en) | Learning material recommendation method, learning material recommendation device and learning material recommendation program | |
Karnati et al. | LieNet: A deep convolution neural network framework for detecting deception | |
CN111126263B (en) | Electroencephalogram emotion recognition method and device based on double-hemisphere difference model | |
CN111583966B (en) | Cross-database speech emotion recognition method and device based on joint distribution least square regression | |
CN107506350B (en) | Method and equipment for identifying information | |
CN113705092B (en) | Disease prediction method and device based on machine learning | |
CN113112994B (en) | Cross-corpus emotion recognition method based on graph convolution neural network | |
CN107491729A (en) | The Handwritten Digit Recognition method of convolutional neural networks based on cosine similarity activation | |
Takano et al. | Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots | |
Zhang et al. | Intelligent Facial Action and emotion recognition for humanoid robots | |
CN112397092A (en) | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace | |
Samsudin | Modeling student’s academic performance during covid-19 based on classification in support vector machine | |
Patro et al. | Uncertainty class activation map (U-CAM) using gradient certainty method | |
CN116244474A (en) | Learner learning state acquisition method based on multi-mode emotion feature fusion | |
CN110069601A (en) | Mood determination method and relevant apparatus | |
Sun et al. | Automatic inference of mental states from spontaneous facial expressions | |
CN116010563A (en) | Multi-round dialogue data analysis method, electronic equipment and storage medium | |
Elbarougy et al. | Feature selection method for real-time speech emotion recognition | |
Ren et al. | Subject-independent natural action recognition | |
Utami et al. | The EfficientNet Performance for Facial Expressions Recognition | |
Hachaj et al. | Application of hidden markov models and gesture description language classifiers to oyama karate techniques recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |