CN111583966A - Cross-database speech emotion recognition method and device based on joint distribution least square regression - Google Patents
Cross-database speech emotion recognition method and device based on joint distribution least square regression Download PDFInfo
- Publication number
- CN111583966A CN111583966A CN202010372728.2A CN202010372728A CN111583966A CN 111583966 A CN111583966 A CN 111583966A CN 202010372728 A CN202010372728 A CN 202010372728A CN 111583966 A CN111583966 A CN 111583966A
- Authority
- CN
- China
- Prior art keywords
- speech
- database
- squares regression
- matrix
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 66
- 238000012360 testing method Methods 0.000 claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 42
- 230000008451 emotion Effects 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 33
- 239000012634 fragment Substances 0.000 claims abstract description 18
- 230000002996 emotional effect Effects 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 5
- 230000003190 augmentative effect Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 7
- 108010046685 Rho Factor Proteins 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于联合分布最小二乘回归的跨数据库语音情感识别方法及装置,方法包括:(1)获取训练数据库和测试数据库,其中,训练语音数据库中包含有若干语音片段和对应的语音情感类别标签,测试数据库中仅包含有若干待识别语音片段;(2)利用若干声学低维描述子对语音片段进行处理并进行统计,将统计得到的每个信息作为一个情感特征,并将多个情感特征组成向量作为对应语音片段的特征向量;(3)建立基于联合分布的最小二乘回归模型,利用训练数据库与测试数据库联合训练,得到稀疏投影矩阵;(4)对于待识别语音片段,按照步骤(2)得到特征向量,并采用学习到的稀疏投影矩阵,得到对应的语音情感类别标签。本发明可以适应不同环境,准确率更高。
The invention discloses a cross-database speech emotion recognition method and device based on joint distribution least squares regression. The method includes: (1) acquiring a training database and a test database, wherein the training speech database includes several speech fragments and corresponding Speech emotion category label, the test database only contains a number of speech fragments to be recognized; (2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by the statistics as an emotional feature, and use it as an emotional feature. A plurality of emotional feature composition vectors are used as the feature vectors of the corresponding speech segments; (3) a least squares regression model based on joint distribution is established, and a sparse projection matrix is obtained by jointly training the training database and the test database; (4) For the speech segment to be recognized , obtain the feature vector according to step (2), and use the learned sparse projection matrix to obtain the corresponding voice emotion category label. The present invention can adapt to different environments and has higher accuracy.
Description
技术领域technical field
本发明涉及语音情感识别,尤其涉及一种基于联合分布最小二乘回归的跨数据库语音情感识别方法及装置。The present invention relates to speech emotion recognition, in particular to a cross-database speech emotion recognition method and device based on joint distribution least squares regression.
背景技术Background technique
语音情感识别的目的在于使得机器能够拥有足够智能从说话者的语音中提取它的情感状态(如高兴、恐惧、悲伤等),是人机交互中重要的一个环节,拥有巨大的研究潜能与发展前景。如结合驾驶员的语音、表情和行为信息检测其精神状态,可以及时提醒驾驶员集中注意力避免危险驾驶;在人机交互中检测对话人的语音情感可以使得对话更加流畅,更加照顾对话者的心理,贴近认知;可穿戴设备可以依据穿戴者的情感状态做出更为及时和贴心的反馈;同时,在课堂教学、老师陪护等各种各样的领域,语音情感识别都在发挥着越来越重要的作用。The purpose of speech emotion recognition is to enable the machine to have enough intelligence to extract its emotional state (such as happiness, fear, sadness, etc.) from the speaker's speech. It is an important part of human-computer interaction and has huge research potential and development. prospect. For example, detecting the driver's mental state by combining the driver's voice, facial expressions and behavior information can remind the driver to concentrate on avoiding dangerous driving in time; detecting the interlocutor's voice emotion in the human-computer interaction can make the conversation smoother and take care of the interlocutor's behavior. Psychological, close to cognition; wearable devices can give more timely and intimate feedback based on the wearer's emotional state; at the same time, in various fields such as classroom teaching and teacher escort, speech emotion recognition is playing a more and more important role. increasingly important role.
传统的语音情感识别都在同一个语音数据库上进行训练和测试,训练和测试的数据都遵循着同样的分布。而在实际生活中,训练出的模型必须面对不同的环境,发声背景中也会掺杂着各种各样的噪音。因此跨数据库语音情感识别面临着很大的挑战。如何使训练出的模型面对不同的环境都拥有良好的适应性,成为学术和工业界需要解决的问题。Traditional speech emotion recognition is trained and tested on the same speech database, and the training and testing data follow the same distribution. In real life, the trained model must face different environments, and the sounding background will be mixed with various noises. Therefore, cross-database speech emotion recognition faces great challenges. How to make the trained model have good adaptability to different environments has become a problem that needs to be solved in academia and industry.
发明内容SUMMARY OF THE INVENTION
发明目的:本发明针对现有技术存在的问题,提供一种基于联合分布最小二乘回归的跨数据库语音情感识别方法及装置,本发明对于不同环境都拥有良好的适应性,识别结果更准确。Purpose of the invention: The present invention provides a method and device for cross-database speech emotion recognition based on joint distribution least squares regression, aiming at the problems existing in the prior art. The present invention has good adaptability to different environments, and the recognition result is more accurate.
技术方案:本发明所述的基于联合分布最小二乘回归的跨数据库语音情感识别方法包括:Technical solution: The cross-database speech emotion recognition method based on joint distribution least squares regression according to the present invention includes:
(1)获取两个语音数据库,分别作为训练数据库和测试数据库,其中,训练语音数据库中包含有若干语音片段和对应的语音情感类别标签,而测试数据库中仅包含有若干待识别语音片段;(1) obtain two voice databases, respectively as training database and test database, wherein, training voice database contains several voice fragments and corresponding voice emotion category labels, and test database only contains several voice fragments to be recognized;
(2)利用若干声学低维描述子对语音片段进行处理并进行统计,将统计得到的每个信息作为一个情感特征,并将多个情感特征组成向量作为对应语音片段的特征向量;(2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by the statistics as an emotional feature, and use a plurality of emotional features to form a vector as the feature vector of the corresponding speech fragment;
(3)建立基于联合分布的最小二乘回归模型,利用已知标签的训练数据库与未知标签的测试数据库对其联合训练,得到一个连接语音片段与语音情感类别标签之间的稀疏投影矩阵;(3) Establish a least-squares regression model based on joint distribution, and use the training database of known labels and the test database of unknown labels to jointly train it to obtain a sparse projection matrix connecting the speech fragments and speech emotion category labels;
(4)对于测试数据库中待识别语音片段,按照步骤(2)得到特征向量,并采用学习到的稀疏投影矩阵,得到对应的语音情感类别标签。(4) For the speech segment to be recognized in the test database, the feature vector is obtained according to step (2), and the learned sparse projection matrix is used to obtain the corresponding speech emotion category label.
进一步的,步骤(2)具体包括:Further, step (2) specifically includes:
(2-1)对于每个语音片段,计算其16个声学低维描述子值和对应增量参数;所述16个声学低维描述子分别为:时间信号过零率、帧能量均方根、基音频率、谐波信噪比以及梅尔顿频率倒谱系数1-12;(2-1) For each speech segment, calculate its 16 acoustic low-dimensional descriptor values and corresponding increment parameters; the 16 acoustic low-dimensional descriptors are: time signal zero-crossing rate, frame energy root mean square , fundamental frequency, harmonic signal-to-noise ratio and Melton frequency cepstral coefficient 1-12;
(2-2)对于每个语音片段,分别对其16个声学低维描述子进行12种统计函数处理,所述12种统计函数分别为求平均值、标准差、峰态、偏度、最大值、最小值、相对位置、相对范围,以及两个线性回归系数及其均方误差;(2-2) For each speech segment, 12 kinds of statistical functions are processed for its 16 acoustic low-dimensional descriptors respectively, and the 12 kinds of statistical functions are mean value, standard deviation, kurtosis, skewness, maximum value, minimum value, relative position, relative range, and two linear regression coefficients and their mean squared errors;
(2-3)将统计得到的每个信息作为一个情感特征,并将多个情感特征组成向量作为对应语音片段的特征向量。(2-3) Each information obtained by statistics is regarded as an emotional feature, and a plurality of emotional features are formed into a vector as the feature vector of the corresponding speech segment.
进一步的,步骤(3)建立的最小二乘回归模型为:Further, the least squares regression model established in step (3) is:
式中,表示找到使括号内式子最小的矩阵P,Ls∈Rc×n为训练数据库语音片段的语音情感类别标签向量,C为语音情感类别的类数,n为训练数据库语音片段的个数,Xs∈Rd×n为训练数据库语音片段的特征向量,d为特征向量的维数,P∈Rd×c为稀疏投影矩阵,PT为P的转秩矩阵,为Frobenius范数的平方,λ、μ为控制正则项的权衡系数,Xt∈Rd×m为测试数据库语音片段的特征向量, m为测试数据库语音片段的段数, 分别为训练数据库、测试数据库中情感类别属于第c类的语音片段的集合,nc、mc分别为测试数据库中情感类别属于第c类的语音片段的个数,|| ||2,1为2,1范数。In the formula, means to find the matrix P that minimizes the expression in parentheses, L s ∈ R c×n is the speech emotion category label vector of the speech segment in the training database, C is the number of speech emotion classes, n is the number of speech segments in the training database, X s ∈R d×n is the feature vector of the speech segment in the training database, d is the dimension of the feature vector, P∈R d×c is the sparse projection matrix, P T is the rank transformation matrix of P, is the square of the Frobenius norm, λ and μ are the trade-off coefficients that control the regular term, X t ∈ R d×m is the feature vector of the test database speech segment, m is the number of segments of the test database speech segment, are the sets of speech clips whose emotion category belongs to the c-th category in the training database and the test database, respectively, n c , m c are the number of speech clips whose emotion category belongs to the c-th category in the test database, || || 2,1 is the 2,1 norm.
进一步的,步骤(3)中所述利用已知标签的训练数据库与未知标签的测试数据库对其进行联合训练的方法具体包括:Further, described in step (3), the method of using the training database of the known label and the test database of the unknown label to jointly train it specifically includes:
(3-1)将所述最小二乘回归模型转换为:(3-1) Convert the least squares regression model to:
s.t.P=Qs.t.P=Q
(3-2)通过上述转换后的最小二乘回归模型,估算测试数据库中所有语音片段对应的语音情感类别伪标签形成的伪标签矩阵 (3-2) Estimate the pseudo-label matrix formed by the pseudo-labels of the speech emotion categories corresponding to all speech segments in the test database through the transformed least squares regression model
(3-3)根据伪标签矩阵统计得到和mc,进而计算得到 (3-3) According to the pseudo-label matrix Statistics get and m c , and then calculate
(3-4)基于对转换后的最小二乘回归模型利用增广拉格朗日乘子法进行求解,得到投影矩阵估计值 (3-4) Based on The transformed least squares regression model is solved by the augmented Lagrange multiplier method to obtain the estimated value of the projection matrix
(3-5)根据投影矩阵估计值采用下式对伪标签矩阵进行更新:(3-5) Estimated value according to projection matrix The pseudo-label matrix is To update:
式中,表示中间辅助变量,为第i列第j行的元素,表示求取第i列元素值最大的一行的行数j,是伪标签矩阵第i列第k行的元素;In the formula, represents an intermediate auxiliary variable, for the element in column i and row j, Represents the row number j of the row with the largest element value in the i-th column, is the pseudo-label matrix the element in column i and row k;
(3-6)采用更新后的伪标签矩阵返回执行步骤(3-3),直至达到预设的循环次数后,将循环结束后得到的的投影矩阵估计值作为学习得到的投影矩阵P。(3-6) Using the updated pseudo-label matrix Return to step (3-3), until the preset number of cycles is reached, the estimated value of the projection matrix obtained after the cycle ends as the learned projection matrix P.
进一步的,步骤(3-2)具体包括:Further, step (3-2) specifically includes:
(3-2-1)利用转换后的最小二乘回归模型不加正则项的公式,求得投影矩阵估计值的初始值 (3-2-1) Obtain the initial value of the estimated value of the projection matrix by using the formula of the transformed least squares regression model without adding a regular term
(3-2-2)根据投影矩阵的初始值采用下式得到伪标签矩阵的初始值:(3-2-2) According to the initial value of the projection matrix Use the following formula to get the initial value of the pseudo-label matrix:
式中,表示中间辅助变量,是伪标签矩阵的初始值第i列第k行的元素。进一步的,步骤(3-4)具体包括:In the formula, represents an intermediate auxiliary variable, is the initial value of the pseudo-label matrix The element in column i and row k. Further, step (3-4) specifically includes:
(3-4-1)获取所述最小二乘回归模型的增广拉格朗日方程:(3-4-1) Obtain the augmented Lagrangian equation of the least squares regression model:
式中,T为拉格朗日乘子,k>0为一个正则参数,tr()表示求矩阵的迹;In the formula, T is the Lagrange multiplier, k>0 is a regular parameter, and tr() represents the trace of the matrix;
(3-4-2)保持P、T、k不变,更新Q:(3-4-2) Keep P, T, and k unchanged, and update Q:
将增广拉格朗日方程中与变量Q有关的部分提出,得到:The part related to the variable Q in the augmented Lagrangian equation is proposed, and we get:
求解上式,得到:Solving the above formula, we get:
(3-4-3)保持Q、T、k不变,更新P:(3-4-3) Keep Q, T, and k unchanged, and update P:
将增广拉格朗日方程中与变量P有关的部分提出,得到:The part related to the variable P in the augmented Lagrangian equation is proposed, and we get:
求解上式,得到:Solving the above formula, we get:
Pi是P的第i个列向量,Ti是T的第i个列向量;Pi is the ith column vector of P, and Ti is the ith column vector of T;
(3-4-4)保持Q、P不变,更新T、k:(3-4-4) Keep Q and P unchanged, and update T and k:
T=T+k(P-C)T=T+k(P-C)
k=min(ρk,kmax)k=min(ρk,k max )
式中,kmax是预设k的最大值,ρ是缩放系数,ρ>1;In the formula, k max is the maximum value of the preset k, ρ is the scaling factor, ρ>1;
(3-4-5)检查是否收敛:(3-4-5) Check for convergence:
检查||P-Q||∞<ε是否成立,若否,则返回执行步骤(3-4-2),若是或迭代次数大于设置值,则将此时的P的值作为所求的稀疏投影矩阵,|| ||∞表示求数据中的最大元素,ε表示收敛阈值。Check whether ||PQ|| ∞ <ε is true, if not, return to step (3-4-2), if or the number of iterations is greater than the set value, the value of P at this time is used as the required sparse projection matrix , || || ∞ represents the largest element in the data, and ε represents the convergence threshold.
进一步的,步骤(4)中所述测试数据库的语音情感类别标签的计算方法为:Further, the calculation method of the voice emotion category label of the test database described in step (4) is:
采用下式计算:Calculated using the following formula:
式中,P为我们学习到的最终的投影矩阵,Xt表示测试数据库语音片段的特征向量集合,即待识别语音片段的特征向量集合,表示中间辅助变量,j*表示待识别语音片段的语音情感类别标签。In the formula, P is the final projection matrix we have learned, X t represents the feature vector set of the speech segment in the test database, that is, the feature vector set of the speech segment to be recognized, represents the intermediate auxiliary variable, and j* represents the speech emotion category label of the speech segment to be recognized.
本发明所述的基于联合分布最小二乘回归的跨数据库语音情感识别装置包括处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法。The cross-database speech emotion recognition device based on joint distribution least squares regression according to the present invention includes a processor and a computer program stored in a memory and running on the processor, and the processor implements the above method when executing the program .
有益效果:本发明与现有技术相比,其显著优点是:本发明的跨数据库语音情感识别方法及装置是在跨库学习,因此,对于不同环境都拥有良好的适应性,识别结果更准确。Beneficial effects: Compared with the prior art, the present invention has the following significant advantages: the method and device for cross-database speech emotion recognition of the present invention are cross-database learning, therefore, they have good adaptability to different environments, and the recognition results are more accurate .
附图说明Description of drawings
图1是本发明提供的基于联合分布最小二乘回归的跨数据库语音情感识别方法的流程示意图。FIG. 1 is a schematic flowchart of a cross-database speech emotion recognition method based on joint distribution least squares regression provided by the present invention.
具体实施方式Detailed ways
本实施例提供了一种基于联合分布最小二乘回归的跨数据库语音情感识别方法,如图1所示,包括以下步骤:This embodiment provides a cross-database speech emotion recognition method based on joint distribution least squares regression, as shown in FIG. 1 , including the following steps:
(1)获取两个语音数据库,分别作为训练数据库和测试数据库,其中,训练语音数据库中包含有若干语音片段和对应的语音情感类别标签,而测试数据库中仅包含有若干待识别语音片段。(1) Acquire two speech databases as training database and test database, wherein the training speech database contains several speech fragments and corresponding speech emotion category labels, while the test database only contains several speech fragments to be recognized.
在本实施例中,我们使用了情感语音识别中常见的三类语音情感数据库:Berlin、eNTERFACE和CAISA。因为三类数据库包含的情感类别不同,所以在两两比较时都对数据进行了选择。当Berlin和eNTERFACE进行比较时,我们分别选取了375条数据和 1077条数据,情感类别为5类(生气、害怕、快乐、厌恶、悲伤);当Berlin和CAISA 进行比较时,我们分别选取了408条数据和1000条数据,情感类别为5类(生气、害怕、高兴、厌恶、悲伤);当eNTERFACE和CAISA进行比较时,我们分别选取了1072 条数据和1000条数据,情感类别为5类(生气、害怕、高兴、厌恶、悲伤)。In this embodiment, we use three types of speech emotion databases commonly used in emotional speech recognition: Berlin, eNTERFACE and CAISA. Because the three types of databases contain different sentiment categories, the data are selected in the pairwise comparison. When comparing Berlin and eNTERFACE, we selected 375 pieces of data and 1077 pieces of data respectively, and the emotion categories were 5 categories (angry, scared, happy, disgusted, sad); when Berlin and CAISA were compared, we selected 408 pieces of data respectively. There are 1000 pieces of data and 1000 pieces of data, and the emotion category is 5 categories (angry, scared, happy, disgusted, sad); when eINTERFACE and CAISA are compared, we select 1072 pieces of data and 1000 pieces of data respectively, and the emotion category is 5 categories ( angry, scared, happy, disgusted, sad).
(2)利用若干声学低维描述子对语音片段进行处理并进行统计,将统计得到的每个信息作为一个情感特征,并将多个情感特征组成向量作为对应语音片段的特征向量。(2) Use several acoustic low-dimensional descriptors to process and count the speech fragments, take each information obtained by statistics as an emotional feature, and use multiple emotional features to form a vector as the feature vector of the corresponding speech fragment.
该步骤具体包括:This step specifically includes:
(2-1)对于每个语音片段,计算其16个声学低维描述子值和对应增量参数;所述16个声学低维描述子分别为:时间信号过零率、帧能量均方根、基音频率、谐波信噪比以及梅尔顿频率倒谱系数1-12;描述子来源于INTERSPEECH 2009Emotion Challenge提供的功能集;(2-1) For each speech segment, calculate its 16 acoustic low-dimensional descriptor values and corresponding increment parameters; the 16 acoustic low-dimensional descriptors are: time signal zero-crossing rate, frame energy root mean square , fundamental frequency, harmonic signal-to-noise ratio and Melton frequency cepstral coefficients 1-12; the descriptor comes from the function set provided by INTERSPEECH 2009 Emotion Challenge;
(2-2)对于每个语音片段,利用openSIMLE软件分别对其16个声学低维描述子进行12种统计函数处理,所述12种统计函数分别为求平均值、标准差、峰态、偏度、最大值、最小值、相对位置、相对范围,以及两个线性回归系数及其均方误差;(2-2) For each speech segment, use openSIMLE software to process 12 statistical functions for its 16 acoustic low-dimensional descriptors, and the 12 statistical functions are average, standard deviation, kurtosis, partial Degree, maximum value, minimum value, relative position, relative range, and two linear regression coefficients and their mean square errors;
(2-3)将统计得到的每个信息作为一个情感特征,并将16×2×12=384个情感特征组成向量作为对应语音片段的特征向量。(2-3) Take each information obtained by statistics as an emotion feature, and use 16×2×12=384 emotion features to form a vector as the feature vector of the corresponding speech segment.
(3)建立基于联合分布的最小二乘回归模型,利用已知标签的训练数据库与未知标签的测试数据库对其联合训练,得到一个连接语音片段与语音情感类别标签之间的稀疏投影矩阵。(3) Establish a least squares regression model based on joint distribution, and use the training database of known labels and the test database of unknown labels to jointly train it, and obtain a sparse projection matrix connecting speech fragments and speech emotion category labels.
其中,建立的最小二乘回归模型为:Among them, the established least squares regression model is:
式中,表示找到使括号内式子最小的矩阵P,Ls∈Rc×n为训练数据库语音片段的语音情感类别标签向量,C为语音情感类别的类数,n为训练数据库语音片段的个数,Xs∈Rd×n为训练数据库语音片段的特征向量,d为特征向量的维数,P∈Rd×c为稀疏投影矩阵,PT为P的转秩矩阵,为Frobenius范数的平方,λ、μ为控制正则项的权衡系数,Xt∈Rd×m为测试数据库语音片段的特征向量,m为测试数据库语音片段的段数, 分别为训练数据库、测试数据库中情感类别属于第c类的语音片段的集合,nc、mc分别为测试数据库中情感类别属于第c类的语音片段的个数,|| ||2,1为2,1范数。In the formula, means to find the matrix P that minimizes the expression in parentheses, L s ∈ R c×n is the speech emotion category label vector of the speech segment in the training database, C is the number of speech emotion classes, n is the number of speech segments in the training database, X s ∈R d×n is the feature vector of the speech segment in the training database, d is the dimension of the feature vector, P∈R d×c is the sparse projection matrix, P T is the rank transformation matrix of P, is the square of the Frobenius norm, λ and μ are the trade-off coefficients that control the regular term, X t ∈R d×m is the feature vector of the test database speech segment, m is the number of segments of the test database speech segment, are the sets of speech clips whose emotion category belongs to the c-th category in the training database and the test database, respectively, n c , m c are the number of speech clips whose emotion category belongs to the c-th category in the test database, || || 2,1 is the 2,1 norm.
其中,利用已知标签的训练数据库与未知标签的测试数据库对其联合训练的方法具体包括:Wherein, the method for jointly training the training database with the known label and the test database with the unknown label specifically includes:
(3-1)将所述最小二乘回归模型转换为:(3-1) Convert the least squares regression model to:
s.t.P=Qs.t.P=Q
(3-2)通过上述转换后的最小二乘回归模型,估算测试数据库中所有语音片段对应的语音情感类别伪标签形成的伪标签矩阵 (3-2) Estimate the pseudo-label matrix formed by the pseudo-labels of the speech emotion categories corresponding to all speech segments in the test database through the transformed least squares regression model
(3-3)根据伪标签矩阵统计得到和mc,进而计算得到 (3-3) According to the pseudo-label matrix Statistics get and m c , and then calculate
(3-4)基于对转换后的最小二乘回归模型利用增广拉格朗日乘子法进行求解,得到投影矩阵估计值 (3-4) Based on The transformed least squares regression model is solved by the augmented Lagrange multiplier method to obtain the estimated value of the projection matrix
(3-5)根据投影矩阵估计值采用下式对伪标签矩阵进行更新:(3-5) Estimated value according to projection matrix The pseudo-label matrix is To update:
式中,表示中间辅助变量,为第i列第j行的元素,表示求取第i列元素值最大的一行的行数j,是伪标签矩阵第i列第k行的元素;In the formula, represents an intermediate auxiliary variable, for the element in column i and row j, Represents the row number j of the row with the largest element value in the i-th column, is the pseudo-label matrix the element in column i and row k;
(3-6)采用更新后的伪标签矩阵返回执行步骤(3-3),直至达到预设的循环次数后,将循环结束后得到的的投影矩阵估计值作为学习得到的投影矩阵P。(3-6) Using the updated pseudo-label matrix Return to step (3-3), until the preset number of cycles is reached, the estimated value of the projection matrix obtained after the cycle ends as the learned projection matrix P.
进一步的,步骤(3-2)具体包括:Further, step (3-2) specifically includes:
(3-2-1)利用转换后的最小二乘回归模型不加正则项的公式,求得投影矩阵估计值的初始值 (3-2-1) Obtain the initial value of the estimated value of the projection matrix by using the formula of the transformed least squares regression model without adding a regular term
(3-2-2)根据投影矩阵的初始值采用下式得到伪标签矩阵的初始值:(3-2-2) According to the initial value of the projection matrix Use the following formula to get the initial value of the pseudo-label matrix:
式中,表示中间辅助变量,是伪标签矩阵的初始值第i列第k行的元素。伪标签矩阵的每一列只有其对应的类别那一行为1,其余行都为0。In the formula, represents an intermediate auxiliary variable, is the initial value of the pseudo-label matrix The element in column i and row k. Pseudo-label matrix Each column of is only 1 for its corresponding category, and the other rows are 0.
步骤(3-4)具体包括:Step (3-4) specifically includes:
(3-4-1)获取所述最小二乘回归模型的增广拉格朗日方程:(3-4-1) Obtain the augmented Lagrangian equation of the least squares regression model:
式中,T为拉格朗日乘子,k>0为一个正则参数,tr()表示求矩阵的迹;In the formula, T is the Lagrange multiplier, k>0 is a regular parameter, and tr() represents the trace of the matrix;
(3-4-2)保持P、T、k不变,更新Q:(3-4-2) Keep P, T, and k unchanged, and update Q:
将增广拉格朗日方程中与变量Q有关的部分提出,得到:The part related to the variable Q in the augmented Lagrangian equation is proposed, and we get:
求解上式,得到:Solving the above formula, we get:
(3-4-3)保持Q、T、k不变,更新P:(3-4-3) Keep Q, T, and k unchanged, and update P:
将增广拉格朗日方程中与变量P有关的部分提出,得到:The part related to the variable P in the augmented Lagrangian equation is proposed, and we get:
求解上式,得到:Solving the above formula, we get:
Pi是P的第i个列向量,Ti是T的第i个列向量;Pi is the ith column vector of P, and Ti is the ith column vector of T;
(3-4-4)保持Q、P不变,更新T、k:(3-4-4) Keep Q and P unchanged, and update T and k:
T=T+k(P-C)T=T+k(P-C)
k=min(ρk,kmax)k=min(ρk,k max )
式中,kmax是预设k的最大值,ρ是缩放系数,ρ>1;In the formula, k max is the maximum value of the preset k, ρ is the scaling factor, ρ>1;
(3-4-5)检查是否收敛:(3-4-5) Check for convergence:
检查||P-Q||∞<ε是否成立,若否,则返回执行步骤(3-4-2),若是或迭代次数大于设置值,则将此时的P的值作为所求的稀疏投影矩阵,||||∞表示求数据中的最大元素,Check whether ||PQ|| ∞ <ε is true, if not, return to step (3-4-2), if or the number of iterations is greater than the set value, the value of P at this time is used as the required sparse projection matrix , |||| ∞ means to find the largest element in the data,
ε表示收敛阈值。ε represents the convergence threshold.
(4)对于测试数据库中待识别语音片段,按照步骤(2)得到特征向量,并采用学习到的稀疏投影矩阵,得到对应的语音情感类别标签。(4) For the speech segment to be recognized in the test database, the feature vector is obtained according to step (2), and the learned sparse projection matrix is used to obtain the corresponding speech emotion category label.
具体方法为采用下式计算类别标签:The specific method is to use the following formula to calculate the category label:
式中,P为我们学习到的最终的投影矩阵,Xt表示测试数据库语音片段的特征向量集合,即待识别语音片段的特征向量集合,表示中间辅助变量,j*表示待识别语音片段的语音情感类别标签。In the formula, P is the final projection matrix we have learned, X t represents the feature vector set of the speech segment in the test database, that is, the feature vector set of the speech segment to be recognized, represents the intermediate auxiliary variable, and j * represents the speech emotion category label of the speech segment to be recognized.
本实施例还提供了一种基于联合分布最小二乘回归的跨数据库语音情感识别装置,包括处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法。This embodiment also provides a cross-database speech emotion recognition device based on joint distributed least squares regression, including a processor and a computer program stored in a memory and executable on the processor, the processor executing the program implement the above method.
为了验证本发明的有效性,在语音情感数据库Berlin、eNTERFACE和CAISA数据库上我们两两分别进行了实验。在每一组实验中,我们将两种数据库分别作为源域和目标域,其中源域是作为训练集提供训练数据和标签,目标域是作为测试集,仅仅提供测试数据,不提供任何标签。为了更有效的检测识别准确率,我们采用了非加权平均召回率(UAR)与加权平均召回率(WAR)两种检测方法。其中,UAR表示每一类正确预测的数量除以每一类参与测试的数量,再对所有类的正确率取代数平均;而WAR表示所有正确预测的数量除以所有参与测试的数量,不考虑每一类数量的影响。综合考虑UAR和 WAR可以有效避免类别数量不平衡的影响。作为对比实验,我们选取了子空间学习中经典且高效的几类算法,分别为:SVM、TCA、TKL、DaLSR、DoSL。实验结果如下表1所示,其中,本方法在表中表示为英文缩写JDLSR,数据集为源域/目标域,E、B、 C分别为eNTERFACE、Berlin、CASIA数据集的缩写,评价标准为UAR/WAR。In order to verify the effectiveness of the present invention, we conduct experiments in pairs on the speech emotion database Berlin, eNTERFACE and CAISA database respectively. In each set of experiments, we use the two databases as the source domain and the target domain, where the source domain is used as the training set to provide training data and labels, and the target domain is used as the test set, providing only test data and no labels. In order to detect the recognition accuracy more effectively, we adopt two detection methods: Unweighted Average Recall (UAR) and Weighted Average Recall (WAR). Among them, UAR represents the number of correct predictions of each class divided by the number of participating tests for each class, and then averages the number of correct replacements for all classes; and WAR represents the number of all correct predictions divided by the number of all participating tests, regardless of The impact of each category quantity. Comprehensive consideration of UAR and WAR can effectively avoid the impact of unbalanced number of categories. As a comparative experiment, we selected several classic and efficient algorithms in subspace learning, namely: SVM, TCA, TKL, DaLSR, DoSL. The experimental results are shown in Table 1 below. Among them, the method is represented by the English abbreviation JDLSR in the table, the data set is the source domain/target domain, E, B, and C are the abbreviations of the eNTERFACE, Berlin, and CASIA data sets, respectively. The evaluation standard is UAR/WAR.
实验结果表明,基于本发明提出的微表情识别方法,取得了较高的跨数据库微表情识别率。The experimental results show that, based on the micro-expression recognition method proposed by the present invention, a high cross-database micro-expression recognition rate is achieved.
表1Table 1
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372728.2A CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372728.2A CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111583966A true CN111583966A (en) | 2020-08-25 |
CN111583966B CN111583966B (en) | 2022-06-28 |
Family
ID=72113186
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010372728.2A Active CN111583966B (en) | 2020-05-06 | 2020-05-06 | Cross-database speech emotion recognition method and device based on joint distribution least square regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583966B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397092A (en) * | 2020-11-02 | 2021-02-23 | 天津理工大学 | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace |
CN113112994A (en) * | 2021-04-21 | 2021-07-13 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolution neural network |
CN115035915A (en) * | 2022-05-31 | 2022-09-09 | 东南大学 | Cross-database speech emotion recognition method and device based on implicit alignment subspace learning |
CN115171662A (en) * | 2022-06-29 | 2022-10-11 | 东南大学 | Cross-library speech emotion recognition method and device based on CISF (common information System) model |
CN115497508A (en) * | 2022-08-23 | 2022-12-20 | 东南大学 | CDAR model-based cross-library speech emotion recognition method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120221333A1 (en) * | 2011-02-24 | 2012-08-30 | International Business Machines Corporation | Phonetic Features for Speech Recognition |
CN103594084A (en) * | 2013-10-23 | 2014-02-19 | 江苏大学 | Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning |
US9892726B1 (en) * | 2014-12-17 | 2018-02-13 | Amazon Technologies, Inc. | Class-based discriminative training of speech models |
CN110120231A (en) * | 2019-05-15 | 2019-08-13 | 哈尔滨工业大学 | Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization |
CN110390955A (en) * | 2019-07-01 | 2019-10-29 | 东南大学 | A cross-database speech emotion recognition method based on deep domain adaptive convolutional neural network |
CN111048117A (en) * | 2019-12-05 | 2020-04-21 | 南京信息工程大学 | Cross-library speech emotion recognition method based on target adaptation subspace learning |
-
2020
- 2020-05-06 CN CN202010372728.2A patent/CN111583966B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120221333A1 (en) * | 2011-02-24 | 2012-08-30 | International Business Machines Corporation | Phonetic Features for Speech Recognition |
CN103594084A (en) * | 2013-10-23 | 2014-02-19 | 江苏大学 | Voice emotion recognition method and system based on joint penalty sparse representation dictionary learning |
US9892726B1 (en) * | 2014-12-17 | 2018-02-13 | Amazon Technologies, Inc. | Class-based discriminative training of speech models |
CN110120231A (en) * | 2019-05-15 | 2019-08-13 | 哈尔滨工业大学 | Across corpus emotion identification method based on adaptive semi-supervised Non-negative Matrix Factorization |
CN110390955A (en) * | 2019-07-01 | 2019-10-29 | 东南大学 | A cross-database speech emotion recognition method based on deep domain adaptive convolutional neural network |
CN111048117A (en) * | 2019-12-05 | 2020-04-21 | 南京信息工程大学 | Cross-library speech emotion recognition method based on target adaptation subspace learning |
Non-Patent Citations (1)
Title |
---|
YUAN ZONG ET AL.: "Cross-Corpus Speech Emotion Recognition Based on Domain-adaptive Least Squares Regression", 《IEEE》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112397092A (en) * | 2020-11-02 | 2021-02-23 | 天津理工大学 | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace |
CN113112994A (en) * | 2021-04-21 | 2021-07-13 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolution neural network |
CN113112994B (en) * | 2021-04-21 | 2023-11-07 | 江苏师范大学 | Cross-corpus emotion recognition method based on graph convolutional neural network |
CN115035915A (en) * | 2022-05-31 | 2022-09-09 | 东南大学 | Cross-database speech emotion recognition method and device based on implicit alignment subspace learning |
CN115171662A (en) * | 2022-06-29 | 2022-10-11 | 东南大学 | Cross-library speech emotion recognition method and device based on CISF (common information System) model |
CN115497508A (en) * | 2022-08-23 | 2022-12-20 | 东南大学 | CDAR model-based cross-library speech emotion recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111583966B (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583966B (en) | Cross-database speech emotion recognition method and device based on joint distribution least square regression | |
Sincan et al. | Autsl: A large scale multi-modal turkish sign language dataset and baseline methods | |
CN112800998B (en) | Multi-mode emotion recognition method and system integrating attention mechanism and DMCCA | |
Yang et al. | Visual goal-step inference using wikihow | |
CN108805009A (en) | Classroom learning state monitoring method based on multimodal information fusion and system | |
Du et al. | Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition | |
CN111126263A (en) | Electroencephalogram emotion recognition method and device based on double-hemisphere difference model | |
CN112397092A (en) | Unsupervised cross-library speech emotion recognition method based on field adaptive subspace | |
Takano et al. | Bigram-based natural language model and statistical motion symbol model for scalable language of humanoid robots | |
CN111048117A (en) | Cross-library speech emotion recognition method based on target adaptation subspace learning | |
CN116029305A (en) | Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning | |
CN105426882A (en) | Method for rapidly positioning human eyes in human face image | |
CN116110119A (en) | Human behavior recognition method and system based on self-attention active contrast coding | |
CN108549703A (en) | A kind of training method of the Mongol language model based on Recognition with Recurrent Neural Network | |
Wang et al. | Early facial expression recognition using hidden markov models | |
Han et al. | Towards hard few-shot relation classification | |
CN119323818A (en) | Student emotion analysis method and system based on multi-mode dynamic memory big model | |
CN105632485A (en) | Language distance relation obtaining method based on language identification system | |
Shu et al. | Gaze behavior based depression severity estimation | |
Chai et al. | Communication tool for the hard of hearings: A large vocabulary sign language recognition system | |
Ren et al. | Subject-independent natural action recognition | |
Cui et al. | SinKD: Sinkhorn Distance Minimization for Knowledge Distillation | |
CN118133837A (en) | Thinking course learning method and related device based on education universe | |
Parvini et al. | An algorithmic approach for static and dynamic gesture recognition utilising mechanical and biomechanical characteristics | |
CN110879966A (en) | Student class attendance comprehension degree evaluation method based on face recognition and image processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |