CN111583966A

CN111583966A - Cross-database speech emotion recognition method and device based on joint distribution least square regression

Info

Publication number: CN111583966A
Application number: CN202010372728.2A
Authority: CN
Inventors: 宗源; 江林; 张佳成; 郑文明; 江星洵; 刘佳腾
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2020-08-25
Anticipated expiration: 2040-05-06
Also published as: CN111583966B

Abstract

The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least squares regression. The method includes: (1) acquiring a training database and a test database, wherein the training speech database includes several speech fragments and corresponding Speech emotion category label, the test database only contains a number of speech fragments to be recognized; (2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by the statistics as an emotional feature, and use it as an emotional feature. A plurality of emotional feature composition vectors are used as the feature vectors of the corresponding speech segments; (3) a least squares regression model based on joint distribution is established, and a sparse projection matrix is obtained by jointly training the training database and the test database; (4) For the speech segment to be recognized , obtain the feature vector according to step (2), and use the learned sparse projection matrix to obtain the corresponding voice emotion category label. The present invention can adapt to different environments and has higher accuracy.

Description

Method and device for cross-database speech emotion recognition based on joint distribution least squares regression

技术领域technical field

本发明涉及语音情感识别，尤其涉及一种基于联合分布最小二乘回归的跨数据库语音情感识别方法及装置。The present invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least squares regression.

背景技术Background technique

语音情感识别的目的在于使得机器能够拥有足够智能从说话者的语音中提取它的情感状态(如高兴、恐惧、悲伤等)，是人机交互中重要的一个环节，拥有巨大的研究潜能与发展前景。如结合驾驶员的语音、表情和行为信息检测其精神状态，可以及时提醒驾驶员集中注意力避免危险驾驶；在人机交互中检测对话人的语音情感可以使得对话更加流畅，更加照顾对话者的心理，贴近认知；可穿戴设备可以依据穿戴者的情感状态做出更为及时和贴心的反馈；同时，在课堂教学、老师陪护等各种各样的领域，语音情感识别都在发挥着越来越重要的作用。The purpose of speech emotion recognition is to enable the machine to have enough intelligence to extract its emotional state (such as happiness, fear, sadness, etc.) from the speaker's speech. It is an important part of human-computer interaction and has huge research potential and development. prospect. For example, detecting the driver's mental state by combining the driver's voice, facial expressions and behavior information can remind the driver to concentrate on avoiding dangerous driving in time; detecting the interlocutor's voice emotion in the human-computer interaction can make the conversation smoother and take care of the interlocutor's behavior. Psychological, close to cognition; wearable devices can give more timely and intimate feedback based on the wearer's emotional state; at the same time, in various fields such as classroom teaching and teacher escort, speech emotion recognition is playing a more and more important role. increasingly important role.

传统的语音情感识别都在同一个语音数据库上进行训练和测试，训练和测试的数据都遵循着同样的分布。而在实际生活中，训练出的模型必须面对不同的环境，发声背景中也会掺杂着各种各样的噪音。因此跨数据库语音情感识别面临着很大的挑战。如何使训练出的模型面对不同的环境都拥有良好的适应性，成为学术和工业界需要解决的问题。Traditional speech emotion recognition is trained and tested on the same speech database, and the training and testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background will be mixed with various noises. Therefore, cross-database speech emotion recognition faces great challenges. How to make the trained model have good adaptability to different environments has become a problem that needs to be solved in academia and industry.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明针对现有技术存在的问题，提供一种基于联合分布最小二乘回归的跨数据库语音情感识别方法及装置，本发明对于不同环境都拥有良好的适应性，识别结果更准确。Purpose of the invention: The present invention provides a method and device for cross-database speech emotion recognition based on joint distribution least squares regression, aiming at the problems existing in the prior art. The present invention has good adaptability to different environments, and the recognition result is more accurate.

技术方案：本发明所述的基于联合分布最小二乘回归的跨数据库语音情感识别方法包括：Technical solution: The cross-database speech emotion recognition method based on joint distribution least squares regression according to the present invention includes:

(1)获取两个语音数据库，分别作为训练数据库和测试数据库，其中，训练语音数据库中包含有若干语音片段和对应的语音情感类别标签，而测试数据库中仅包含有若干待识别语音片段；(1) obtain two voice databases, respectively as training database and test database, wherein, training voice database contains several voice fragments and corresponding voice emotion category labels, and test database only contains several voice fragments to be recognized;

(2)利用若干声学低维描述子对语音片段进行处理并进行统计，将统计得到的每个信息作为一个情感特征，并将多个情感特征组成向量作为对应语音片段的特征向量；(2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by the statistics as an emotional feature, and use a plurality of emotional features to form a vector as the feature vector of the corresponding speech fragment;

(3)建立基于联合分布的最小二乘回归模型，利用已知标签的训练数据库与未知标签的测试数据库对其联合训练，得到一个连接语音片段与语音情感类别标签之间的稀疏投影矩阵；(3) Establish a least-squares regression model based on joint distribution, and use the training database of known labels and the test database of unknown labels to jointly train it to obtain a sparse projection matrix connecting the speech fragments and speech emotion category labels;

(4)对于测试数据库中待识别语音片段，按照步骤(2)得到特征向量，并采用学习到的稀疏投影矩阵，得到对应的语音情感类别标签。(4) For the speech segment to be recognized in the test database, the feature vector is obtained according to step (2), and the learned sparse projection matrix is used to obtain the corresponding speech emotion category label.

进一步的，步骤(2)具体包括：Further, step (2) specifically includes:

(2-1)对于每个语音片段，计算其16个声学低维描述子值和对应增量参数；所述16个声学低维描述子分别为：时间信号过零率、帧能量均方根、基音频率、谐波信噪比以及梅尔顿频率倒谱系数1-12；(2-1) For each speech segment, calculate its 16 acoustic low-dimensional descriptor values and corresponding increment parameters; the 16 acoustic low-dimensional descriptors are: time signal zero-crossing rate, frame energy root mean square , fundamental frequency, harmonic signal-to-noise ratio and Melton frequency cepstral coefficient 1-12;

(2-2)对于每个语音片段，分别对其16个声学低维描述子进行12种统计函数处理，所述12种统计函数分别为求平均值、标准差、峰态、偏度、最大值、最小值、相对位置、相对范围，以及两个线性回归系数及其均方误差；(2-2) For each speech segment, 12 kinds of statistical functions are processed for its 16 acoustic low-dimensional descriptors respectively, and the 12 kinds of statistical functions are mean value, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, and two linear regression coefficients and their mean squared errors;

(2-3)将统计得到的每个信息作为一个情感特征，并将多个情感特征组成向量作为对应语音片段的特征向量。(2-3) Each information obtained by statistics is regarded as an emotional feature, and a plurality of emotional features are formed into a vector as the feature vector of the corresponding speech segment.

进一步的，步骤(3)建立的最小二乘回归模型为：Further, the least squares regression model established in step (3) is:

式中，

表示找到使括号内式子最小的矩阵P，L_s∈R^c×n为训练数据库语音片段的语音情感类别标签向量，C为语音情感类别的类数，n为训练数据库语音片段的个数，X_s∈R^d×n为训练数据库语音片段的特征向量，d为特征向量的维数，P∈R^d×c为稀疏投影矩阵，P^T为P的转秩矩阵，

为Frobenius范数的平方，λ、μ为控制正则项的权衡系数，

X_t∈R^d×m为测试数据库语音片段的特征向量， m为测试数据库语音片段的段数，

分别为训练数据库、测试数据库中情感类别属于第c类的语音片段的集合，n_c、m_c分别为测试数据库中情感类别属于第c类的语音片段的个数，|| ||_2,1为2,1范数。In the formula,

means to find the matrix P that minimizes the expression in parentheses, L _s ∈ R ^c×n is the speech emotion category label vector of the speech segment in the training database, C is the number of speech emotion classes, n is the number of speech segments in the training database, X _s ∈R ^d×n is the feature vector of the speech segment in the training database, d is the dimension of the feature vector, P∈R ^d×c is the sparse projection matrix, P ^T is the rank transformation matrix of P,

is the square of the Frobenius norm, λ and μ are the trade-off coefficients that control the regular term,

X _t ∈ R ^d×m is the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,

are the sets of speech clips whose emotion category belongs to the c-th category in the training database and the test database, respectively, n _c , m _c are the number of speech clips whose emotion category belongs to the c-th category in the test database, || || _2,1 is the 2,1 norm.

进一步的，步骤(3)中所述利用已知标签的训练数据库与未知标签的测试数据库对其进行联合训练的方法具体包括：Further, described in step (3), the method of using the training database of the known label and the test database of the unknown label to jointly train it specifically includes:

(3-1)将所述最小二乘回归模型转换为：(3-1) Convert the least squares regression model to:

s.t.P＝Qs.t.P=Q

(3-2)通过上述转换后的最小二乘回归模型，估算测试数据库中所有语音片段对应的语音情感类别伪标签形成的伪标签矩阵

(3-2) Estimate the pseudo-label matrix formed by the pseudo-labels of the speech emotion categories corresponding to all speech segments in the test database through the transformed least squares regression model

(3-3)根据伪标签矩阵

统计得到

和m_c，进而计算得到

(3-3) According to the pseudo-label matrix

Statistics get

and m _c , and then calculate

(3-4)基于

对转换后的最小二乘回归模型利用增广拉格朗日乘子法进行求解，得到投影矩阵估计值

(3-4) Based on

The transformed least squares regression model is solved by the augmented Lagrange multiplier method to obtain the estimated value of the projection matrix

(3-5)根据投影矩阵估计值

采用下式对伪标签矩阵

进行更新：(3-5) Estimated value according to projection matrix

The pseudo-label matrix is

To update:

式中，

表示中间辅助变量，

为

第i列第j行的元素，

表示求取第i列元素值最大的一行的行数j，

是伪标签矩阵

第i列第k行的元素；In the formula,

represents an intermediate auxiliary variable,

for

the element in column i and row j,

Represents the row number j of the row with the largest element value in the i-th column,

is the pseudo-label matrix

the element in column i and row k;

(3-6)采用更新后的伪标签矩阵

返回执行步骤(3-3)，直至达到预设的循环次数后，将循环结束后得到的的投影矩阵估计值

作为学习得到的投影矩阵P。(3-6) Using the updated pseudo-label matrix

Return to step (3-3), until the preset number of cycles is reached, the estimated value of the projection matrix obtained after the cycle ends

as the learned projection matrix P.

进一步的，步骤(3-2)具体包括：Further, step (3-2) specifically includes:

(3-2-1)利用转换后的最小二乘回归模型不加正则项的公式，求得投影矩阵估计值的初始值

(3-2-1) Obtain the initial value of the estimated value of the projection matrix by using the formula of the transformed least squares regression model without adding a regular term

(3-2-2)根据投影矩阵的初始值

采用下式得到伪标签矩阵的初始值：(3-2-2) According to the initial value of the projection matrix

Use the following formula to get the initial value of the pseudo-label matrix:

式中，

表示中间辅助变量，

是伪标签矩阵的初始值

第i列第k行的元素。进一步的，步骤(3-4)具体包括：In the formula,

represents an intermediate auxiliary variable,

is the initial value of the pseudo-label matrix

The element in column i and row k. Further, step (3-4) specifically includes:

(3-4-1)获取所述最小二乘回归模型的增广拉格朗日方程：(3-4-1) Obtain the augmented Lagrangian equation of the least squares regression model:

式中，T为拉格朗日乘子，k>0为一个正则参数，tr()表示求矩阵的迹；In the formula, T is the Lagrange multiplier, k>0 is a regular parameter, and tr() represents the trace of the matrix;

(3-4-2)保持P、T、k不变，更新Q：(3-4-2) Keep P, T, and k unchanged, and update Q:

将增广拉格朗日方程中与变量Q有关的部分提出，得到：The part related to the variable Q in the augmented Lagrangian equation is proposed, and we get:

求解上式，得到：Solving the above formula, we get:

(3-4-3)保持Q、T、k不变，更新P：(3-4-3) Keep Q, T, and k unchanged, and update P:

将增广拉格朗日方程中与变量P有关的部分提出，得到：The part related to the variable P in the augmented Lagrangian equation is proposed, and we get:

求解上式，得到：Solving the above formula, we get:

P_i是P的第i个列向量，T_i是T的第i个列向量；Pi is the _ith column vector of P, and Ti is the _ith column vector of T;

(3-4-4)保持Q、P不变，更新T、k：(3-4-4) Keep Q and P unchanged, and update T and k:

T＝T+k(P-C)T=T+k(P-C)

k＝min(ρk,k_max)k=min(ρk,k _max )

式中，k_max是预设k的最大值，ρ是缩放系数，ρ>1；In the formula, k _max is the maximum value of the preset k, ρ is the scaling factor, ρ>1;

(3-4-5)检查是否收敛：(3-4-5) Check for convergence:

检查||P-Q||_∞＜ε是否成立，若否，则返回执行步骤(3-4-2)，若是或迭代次数大于设置值，则将此时的P的值作为所求的稀疏投影矩阵，|| ||_∞表示求数据中的最大元素，ε表示收敛阈值。Check whether ||PQ|| _∞ <ε is true, if not, return to step (3-4-2), if or the number of iterations is greater than the set value, the value of P at this time is used as the required sparse projection matrix , || || _∞ represents the largest element in the data, and ε represents the convergence threshold.

进一步的，步骤(4)中所述测试数据库的语音情感类别标签的计算方法为：Further, the calculation method of the voice emotion category label of the test database described in step (4) is:

采用下式计算：Calculated using the following formula:

式中，P为我们学习到的最终的投影矩阵，X_t表示测试数据库语音片段的特征向量集合，即待识别语音片段的特征向量集合，

表示中间辅助变量，j*表示待识别语音片段的语音情感类别标签。In the formula, P is the final projection matrix we have learned, X _t represents the feature vector set of the speech segment in the test database, that is, the feature vector set of the speech segment to be recognized,

represents the intermediate auxiliary variable, and j* represents the speech emotion category label of the speech segment to be recognized.

本发明所述的基于联合分布最小二乘回归的跨数据库语音情感识别装置包括处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述方法。The cross-database speech emotion recognition device based on joint distribution least squares regression according to the present invention includes a processor and a computer program stored in a memory and running on the processor, and the processor implements the above method when executing the program .

有益效果：本发明与现有技术相比，其显著优点是：本发明的跨数据库语音情感识别方法及装置是在跨库学习，因此，对于不同环境都拥有良好的适应性，识别结果更准确。Beneficial effects: Compared with the prior art, the present invention has the following significant advantages: the method and device for cross-database speech emotion recognition of the present invention are cross-database learning, therefore, they have good adaptability to different environments, and the recognition results are more accurate .

附图说明Description of drawings

图1是本发明提供的基于联合分布最小二乘回归的跨数据库语音情感识别方法的流程示意图。FIG. 1 is a schematic flowchart of a cross-database speech emotion recognition method based on joint distribution least squares regression provided by the present invention.

具体实施方式Detailed ways

本实施例提供了一种基于联合分布最小二乘回归的跨数据库语音情感识别方法，如图1所示，包括以下步骤：This embodiment provides a cross-database speech emotion recognition method based on joint distribution least squares regression, as shown in FIG. 1 , including the following steps:

(1)获取两个语音数据库，分别作为训练数据库和测试数据库，其中，训练语音数据库中包含有若干语音片段和对应的语音情感类别标签，而测试数据库中仅包含有若干待识别语音片段。(1) Acquire two speech databases as training database and test database, wherein the training speech database contains several speech fragments and corresponding speech emotion category labels, while the test database only contains several speech fragments to be recognized.

在本实施例中，我们使用了情感语音识别中常见的三类语音情感数据库：Berlin、eNTERFACE和CAISA。因为三类数据库包含的情感类别不同，所以在两两比较时都对数据进行了选择。当Berlin和eNTERFACE进行比较时，我们分别选取了375条数据和 1077条数据，情感类别为5类(生气、害怕、快乐、厌恶、悲伤)；当Berlin和CAISA 进行比较时，我们分别选取了408条数据和1000条数据，情感类别为5类(生气、害怕、高兴、厌恶、悲伤)；当eNTERFACE和CAISA进行比较时，我们分别选取了1072 条数据和1000条数据，情感类别为5类(生气、害怕、高兴、厌恶、悲伤)。In this embodiment, we use three types of speech emotion databases commonly used in emotional speech recognition: Berlin, eNTERFACE and CAISA. Because the three types of databases contain different sentiment categories, the data are selected in the pairwise comparison. When comparing Berlin and eNTERFACE, we selected 375 pieces of data and 1077 pieces of data respectively, and the emotion categories were 5 categories (angry, scared, happy, disgusted, sad); when Berlin and CAISA were compared, we selected 408 pieces of data respectively. There are 1000 pieces of data and 1000 pieces of data, and the emotion category is 5 categories (angry, scared, happy, disgusted, sad); when eINTERFACE and CAISA are compared, we select 1072 pieces of data and 1000 pieces of data respectively, and the emotion category is 5 categories ( angry, scared, happy, disgusted, sad).

(2)利用若干声学低维描述子对语音片段进行处理并进行统计，将统计得到的每个信息作为一个情感特征，并将多个情感特征组成向量作为对应语音片段的特征向量。(2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by statistics as an emotional feature, and use multiple emotional features to form a vector as the feature vector of the corresponding speech fragment.

该步骤具体包括：This step specifically includes:

(2-1)对于每个语音片段，计算其16个声学低维描述子值和对应增量参数；所述16个声学低维描述子分别为：时间信号过零率、帧能量均方根、基音频率、谐波信噪比以及梅尔顿频率倒谱系数1-12；描述子来源于INTERSPEECH 2009Emotion Challenge提供的功能集；(2-1) For each speech segment, calculate its 16 acoustic low-dimensional descriptor values and corresponding increment parameters; the 16 acoustic low-dimensional descriptors are: time signal zero-crossing rate, frame energy root mean square , fundamental frequency, harmonic signal-to-noise ratio and Melton frequency cepstral coefficients 1-12; the descriptor comes from the function set provided by INTERSPEECH 2009 Emotion Challenge;

(2-2)对于每个语音片段，利用openSIMLE软件分别对其16个声学低维描述子进行12种统计函数处理，所述12种统计函数分别为求平均值、标准差、峰态、偏度、最大值、最小值、相对位置、相对范围，以及两个线性回归系数及其均方误差；(2-2) For each speech segment, use openSIMLE software to process 12 statistical functions for its 16 acoustic low-dimensional descriptors, and the 12 statistical functions are average, standard deviation, kurtosis, partial Degree, maximum value, minimum value, relative position, relative range, and two linear regression coefficients and their mean square errors;

(2-3)将统计得到的每个信息作为一个情感特征，并将16×2×12＝384个情感特征组成向量作为对应语音片段的特征向量。(2-3) Take each information obtained by statistics as an emotion feature, and use 16×2×12=384 emotion features to form a vector as the feature vector of the corresponding speech segment.

(3)建立基于联合分布的最小二乘回归模型，利用已知标签的训练数据库与未知标签的测试数据库对其联合训练，得到一个连接语音片段与语音情感类别标签之间的稀疏投影矩阵。(3) Establish a least squares regression model based on joint distribution, and use the training database of known labels and the test database of unknown labels to jointly train it, and obtain a sparse projection matrix connecting speech fragments and speech emotion category labels.

其中，建立的最小二乘回归模型为：Among them, the established least squares regression model is:

式中，

为Frobenius范数的平方，λ、μ为控制正则项的权衡系数，

X_t∈R^d×m为测试数据库语音片段的特征向量，m为测试数据库语音片段的段数，

X _t ∈R ^d×m is the feature vector of the test database speech segment, m is the number of segments of the test database speech segment,

其中，利用已知标签的训练数据库与未知标签的测试数据库对其联合训练的方法具体包括：Wherein, the method for jointly training the training database with the known label and the test database with the unknown label specifically includes:

s.t.P＝Qs.t.P=Q

(3-3)根据伪标签矩阵

统计得到

和m_c，进而计算得到

(3-3) According to the pseudo-label matrix

Statistics get

and m _c , and then calculate

(3-4)基于

(3-4) Based on

(3-5)根据投影矩阵估计值

采用下式对伪标签矩阵

进行更新：(3-5) Estimated value according to projection matrix

The pseudo-label matrix is

To update:

式中，

表示中间辅助变量，

为

第i列第j行的元素，

表示求取第i列元素值最大的一行的行数j，

是伪标签矩阵

第i列第k行的元素；In the formula,

represents an intermediate auxiliary variable,

for

the element in column i and row j,

is the pseudo-label matrix

the element in column i and row k;

(3-6)采用更新后的伪标签矩阵

as the learned projection matrix P.

(3-2-2)根据投影矩阵的初始值

Use the following formula to get the initial value of the pseudo-label matrix:

式中，

表示中间辅助变量，

是伪标签矩阵的初始值

第i列第k行的元素。伪标签矩阵

的每一列只有其对应的类别那一行为1，其余行都为0。In the formula,

represents an intermediate auxiliary variable,

is the initial value of the pseudo-label matrix

The element in column i and row k. Pseudo-label matrix

Each column of is only 1 for its corresponding category, and the other rows are 0.

步骤(3-4)具体包括：Step (3-4) specifically includes:

求解上式，得到：Solving the above formula, we get:

求解上式，得到：Solving the above formula, we get:

T＝T+k(P-C)T=T+k(P-C)

k＝min(ρk,k_max)k=min(ρk,k _max )

(3-4-5)检查是否收敛：(3-4-5) Check for convergence:

检查||P-Q||_∞＜ε是否成立，若否，则返回执行步骤(3-4-2)，若是或迭代次数大于设置值，则将此时的P的值作为所求的稀疏投影矩阵，||||_∞表示求数据中的最大元素，Check whether ||PQ|| _∞ <ε is true, if not, return to step (3-4-2), if or the number of iterations is greater than the set value, the value of P at this time is used as the required sparse projection matrix , |||| _∞ means to find the largest element in the data,

ε表示收敛阈值。ε represents the convergence threshold.

具体方法为采用下式计算类别标签：The specific method is to use the following formula to calculate the category label:

表示中间辅助变量，j^*表示待识别语音片段的语音情感类别标签。In the formula, P is the final projection matrix we have learned, X _t represents the feature vector set of the speech segment in the test database, that is, the feature vector set of the speech segment to be recognized,

represents the intermediate auxiliary variable, and j ^* represents the speech emotion category label of the speech segment to be recognized.

本实施例还提供了一种基于联合分布最小二乘回归的跨数据库语音情感识别装置，包括处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述方法。This embodiment also provides a cross-database speech emotion recognition device based on joint distributed least squares regression, including a processor and a computer program stored in a memory and executable on the processor, the processor executing the program implement the above method.

为了验证本发明的有效性，在语音情感数据库Berlin、eNTERFACE和CAISA数据库上我们两两分别进行了实验。在每一组实验中，我们将两种数据库分别作为源域和目标域，其中源域是作为训练集提供训练数据和标签，目标域是作为测试集，仅仅提供测试数据，不提供任何标签。为了更有效的检测识别准确率，我们采用了非加权平均召回率(UAR)与加权平均召回率(WAR)两种检测方法。其中，UAR表示每一类正确预测的数量除以每一类参与测试的数量，再对所有类的正确率取代数平均；而WAR表示所有正确预测的数量除以所有参与测试的数量，不考虑每一类数量的影响。综合考虑UAR和 WAR可以有效避免类别数量不平衡的影响。作为对比实验，我们选取了子空间学习中经典且高效的几类算法，分别为：SVM、TCA、TKL、DaLSR、DoSL。实验结果如下表1所示，其中，本方法在表中表示为英文缩写JDLSR，数据集为源域/目标域，E、B、 C分别为eNTERFACE、Berlin、CASIA数据集的缩写，评价标准为UAR/WAR。In order to verify the effectiveness of the present invention, we conduct experiments in pairs on the speech emotion database Berlin, eNTERFACE and CAISA database respectively. In each set of experiments, we use the two databases as the source domain and the target domain, where the source domain is used as the training set to provide training data and labels, and the target domain is used as the test set, providing only test data and no labels. In order to detect the recognition accuracy more effectively, we adopt two detection methods: Unweighted Average Recall (UAR) and Weighted Average Recall (WAR). Among them, UAR represents the number of correct predictions of each class divided by the number of participating tests for each class, and then averages the number of correct replacements for all classes; and WAR represents the number of all correct predictions divided by the number of all participating tests, regardless of The impact of each category quantity. Comprehensive consideration of UAR and WAR can effectively avoid the impact of unbalanced number of categories. As a comparative experiment, we selected several classic and efficient algorithms in subspace learning, namely: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in Table 1 below. Among them, the method is represented by the English abbreviation JDLSR in the table, the data set is the source domain/target domain, E, B, and C are the abbreviations of the eNTERFACE, Berlin, and CASIA data sets, respectively. The evaluation standard is UAR/WAR.

实验结果表明，基于本发明提出的微表情识别方法，取得了较高的跨数据库微表情识别率。The experimental results show that, based on the micro-expression recognition method proposed by the present invention, a high cross-database micro-expression recognition rate is achieved.

表1Table 1

Claims

1. a cross-database speech emotion recognition method based on joint distribution least squares regression, is characterized in that the method comprises:

(1) two voice databases are obtained as training database and test database respectively, wherein, the training voice database contains several voice fragments and corresponding voice emotion category labels, and the test database only contains several voice fragments to be recognized;

(2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by the statistics as an emotional feature, and use a plurality of emotional features to form a vector as the feature vector of the corresponding speech fragment;

(3) Establish a least-squares regression model based on joint distribution, and use the training database of known labels and the test database of unknown labels to jointly train it to obtain a sparse projection matrix connecting the speech fragments and speech emotion category labels;

(4) For the speech segment to be recognized in the test database, the feature vector is obtained according to step (2), and the learned sparse projection matrix is used to obtain the corresponding speech emotion category label.

2. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 1, is characterized in that: step (2) specifically comprises:

(2-1) For each speech segment, calculate its 16 acoustic low-dimensional descriptor values and corresponding increment parameters; the 16 acoustic low-dimensional descriptors are: time signal zero-crossing rate, frame energy root mean square , fundamental frequency, harmonic signal-to-noise ratio and Melton frequency cepstral coefficient 1-12;

(2-2) For each speech segment, 12 kinds of statistical functions are processed for its 16 acoustic low-dimensional descriptors respectively, and the 12 kinds of statistical functions are mean value, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, and two linear regression coefficients and their mean squared errors;

(2-3) Each information obtained by statistics is regarded as an emotional feature, and a plurality of emotional features are formed into a vector as the feature vector of the corresponding speech segment.

3. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 1, is characterized in that: the least squares regression model that step (3) sets up is:

In the formula,

means to find the matrix P that minimizes the expression in parentheses, L _s ∈ R ^c×n is the speech emotion category label vector of the speech segment in the training database, C is the number of speech emotion classes, n is the number of speech segments in the training database, X _s ∈ R ^d×n is the feature vector of the speech segment in the training database, d is the dimension of the feature vector, P∈R ^d×c is the sparse projection matrix, P ^T is the rank transformation matrix of P,

4. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 3, is characterized in that: described in step (3) utilizes the training database of known label and the test database of unknown label to its The methods of joint training include:

(3-1) Convert the least squares regression model to:

s.t.P=Q

(3-3) According to the pseudo-label matrix

Statistics get

and m _c , and then calculate

(3-4) Based on

(3-5) Estimated value according to projection matrix

The pseudo-label matrix is

To update:

In the formula,

represents an intermediate auxiliary variable,

for

the element in column i and row j,

is the pseudo-label matrix

the element in column i and row k;

(3-6) Using the updated pseudo-label matrix

as the learned projection matrix P.

5. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 4, is characterized in that: step (3-2) specifically comprises:

(3-2-2) According to the initial value of the projection matrix

Use the following formula to get the initial value of the pseudo-label matrix:

In the formula,

represents an intermediate auxiliary variable,

is the initial value of the pseudo-label matrix

The element in column i and row k.

6. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 4, is characterized in that: step (3-4) specifically comprises:

(3-4-1) Obtain the augmented Lagrangian equation of the least squares regression model:

In the formula, T is the Lagrange multiplier, k>0 is a regular parameter, and tr() represents the trace of the matrix;

(3-4-2) Keep P, T, and k unchanged, and update Q:

The part related to the variable Q in the augmented Lagrangian equation is proposed, and we get:

Solving the above formula, we get:

(3-4-3) Keep Q, T, and k unchanged, and update P:

The part related to the variable P in the augmented Lagrangian equation is proposed, and we get:

Solving the above formula, we get:

Pi is the _ith column vector of P, and Ti is the _ith column vector of T;

(3-4-4) Keep Q and P unchanged, and update T and k:

T=T+k(P-C)

k=min(ρk,k _max )

In the formula, k _max is the maximum value of the preset k, ρ is the scaling factor, ρ>1;

(3-4-5) Check for convergence:

Check whether ||PQ|| _∞ <ε is true, if not, return to step (3-4-2), if or the number of iterations is greater than the set value, the value of P at this time is used as the required sparse projection matrix , || || _∞ represents the largest element in the data, and ε represents the convergence threshold.

7. the cross-database speech emotion recognition method based on joint distribution least squares regression according to claim 1, is characterized in that: the calculation method of test database speech emotion class label described in step (4) is:

Calculated using the following formula:

In the formula, P is the projection matrix learned in step (3), X _t represents the feature vector set of the speech segment in the test database, that is, the feature vector set of the speech segment to be recognized,

8. A cross-database speech emotion recognition device based on joint distribution least squares regression, comprising a processor and a computer program stored on a memory and running on the processor, characterized in that: the processor executes the program When implementing the method of any one of claims 1-6.