CN102201237A - Emotional speaker identification method based on reliability detection of fuzzy support vector machine - Google Patents

Emotional speaker identification method based on reliability detection of fuzzy support vector machine Download PDF

Info

Publication number
CN102201237A
CN102201237A CN201110121720XA CN201110121720A CN102201237A CN 102201237 A CN102201237 A CN 102201237A CN 201110121720X A CN201110121720X A CN 201110121720XA CN 201110121720 A CN201110121720 A CN 201110121720A CN 102201237 A CN102201237 A CN 102201237A
Authority
CN
China
Prior art keywords
speaker
support vector
vector machine
component
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201110121720XA
Other languages
Chinese (zh)
Other versions
CN102201237B (en
Inventor
杨莹春
陈力
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201110121720XA priority Critical patent/CN102201237B/en
Publication of CN102201237A publication Critical patent/CN102201237A/en
Application granted granted Critical
Publication of CN102201237B publication Critical patent/CN102201237B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本发明公开了基于模糊支持向量机的可靠性检测的情感说话人识别方法,通过提取语音分量特征,并将其与UBM模型中对应的权重结合形成通用背景模型分量特征;将得到的通用背景模型分量特征作为模糊隶属度,建立通用背景模型分量下的模糊支持向量机模型;利用模糊支持向量机模型进行可靠性检测从而得到可靠特征;对可靠特征进行计算并识别说话者,提高了说话人识别系统的鲁棒性,改善系统识别说话人的性能。

The invention discloses an emotional speaker recognition method based on fuzzy support vector machine reliability detection, by extracting voice component features, and combining them with corresponding weights in UBM models to form general background model component features; the obtained general background model The component feature is used as the fuzzy membership degree, and the fuzzy support vector machine model under the general background model component is established; the reliable feature is obtained by using the fuzzy support vector machine model for reliability detection; the reliable feature is calculated and the speaker is identified, which improves the speaker recognition The robustness of the system improves the performance of the system in identifying speakers.

Description

基于模糊支持向量机的可靠性检测的情感说话人识别方法Emotional speaker recognition method based on reliability detection of fuzzy support vector machine

技术领域technical field

本发明涉及信号处理和模式识别,特别涉及一种基于模糊支持向量机的可靠性特征检测的情感说话人识别方法。The invention relates to signal processing and pattern recognition, in particular to an emotional speaker recognition method based on fuzzy support vector machine reliability feature detection.

背景技术Background technique

说话人识别技术是指利用信号处理和模式识别方法,根据说话人的语音识别其身份的技术,主要包括两个步骤:说话人模型训练和语音测试。Speaker recognition technology refers to the technology of using signal processing and pattern recognition methods to identify the speaker's identity according to the speaker's voice. It mainly includes two steps: speaker model training and voice testing.

目前,说话人识别采用的主要特征包括梅尔倒谱系数(                                                ),线性预测编码倒谱系数(

Figure 444526DEST_PATH_IMAGE002
),感觉加权的线性预测系数(
Figure 286580DEST_PATH_IMAGE003
)。说话人识别的算法主要包括矢量量化(),通用背景模型方法(
Figure 122261DEST_PATH_IMAGE005
),支持向量机(
Figure 944723DEST_PATH_IMAGE006
)等等。其中,
Figure 274073DEST_PATH_IMAGE005
在整个说话人识别领域应用非常广泛。Currently, the main features used in speaker recognition include Mel cepstral coefficients ( ), linear predictive coding cepstral coefficients (
Figure 444526DEST_PATH_IMAGE002
), the sensory weighted linear predictive coefficient (
Figure 286580DEST_PATH_IMAGE003
). The algorithm of speaker recognition mainly includes vector quantization ( ), the generic background model approach (
Figure 122261DEST_PATH_IMAGE005
),Support Vector Machines(
Figure 944723DEST_PATH_IMAGE006
)etc. in,
Figure 274073DEST_PATH_IMAGE005
It is widely used in the field of speaker recognition.

在情感说话人识别中,训练语音通常为中性情感语音,因为在现实应用中,一般情况下用户只会提供中性发音下的语音训练自己的模型。而测试时,语音可能包括各种情感的语音,如高兴,悲伤等。然而,传统的说话人识别系统并不能处理这种训练和测试条件的失配,因此,情感说话人识别需要解决的是说话人在训练和测试阶段的情感不一致而导致的说话人识别系统性能下降的问题。In emotional speaker recognition, the training speech is usually neutral emotional speech, because in real applications, users generally only provide speech under neutral pronunciation to train their own models. While testing, the voice may include voices of various emotions, such as happy, sad and so on. However, traditional speaker recognition systems cannot handle this mismatch between training and testing conditions. Therefore, emotional speaker recognition needs to solve the performance degradation of the speaker recognition system caused by the emotional inconsistency of the speaker during the training and testing stages. The problem.

我们通过实验观察发现,由于说话人在不同情感状态下的发声状态存在差异而导致语音特征的空间分布存在差异,因此,相对于中性训练模型而言,情感语音特征与其不匹配,可视为不可靠特征,在测试阶段加以剔除后将有助于系统识别性能的提升。Through experimental observations, we found that the spatial distribution of speech features is different due to the differences in the vocalization states of speakers in different emotional states. Therefore, compared with the neutral training model, the emotional speech features do not match it, which can be regarded as Unreliable features, after being eliminated in the test phase, will help improve the system's recognition performance.

发明内容Contents of the invention

针对现有技术的不足,本发明提出一种基于模糊支持向量机的可靠性特征检测的情感说话人识别方法,通过剔除测试语音中的情感语音特征来降低模型失配程度,从而提高说话人识别系统的鲁棒性,改善说话人识别的性能。Aiming at the deficiencies of the prior art, the present invention proposes an emotional speaker recognition method based on fuzzy support vector machine reliability feature detection, which reduces the degree of model mismatch by eliminating the emotional speech features in the test speech, thereby improving speaker recognition The robustness of the system improves the performance of speaker recognition.

为了解决上述技术问题,本发明的技术方案如下:In order to solve the problems of the technologies described above, the technical solution of the present invention is as follows:

基于模糊支持向量机的可靠性检测的情感说话人识别方法,包括如下步骤The emotional speaker recognition method based on the reliability detection of the fuzzy support vector machine comprises the following steps

1)  提取语音分量特征,并将其与UBM模型中对应的权重结合形成通用背景模型分量特征;1) Extract the voice component features, and combine them with the corresponding weights in the UBM model to form the general background model component features;

2)  将所述步骤1)得到的通用背景模型分量特征作为模糊隶属度,建立通用背景模型分量下的模糊支持向量机模型;2) The general background model component feature obtained in the step 1) is used as the fuzzy membership degree, and the fuzzy support vector machine model under the general background model component is established;

3)  对所述步骤2)的模糊支持向量机模型进行可靠性检测从而得到可靠特征;3) Reliability detection is performed on the fuzzy support vector machine model of the step 2) to obtain reliable features;

4)  对所述步骤3)的可靠特征进行计算识别说话者。4) Calculate the reliable features of the step 3) to identify the speaker.

作为可选方案:所述提取语音分量特征包括如下步骤:As an optional solution: the feature of extracting the speech component comprises the following steps:

1) 采集语音信号,对其进行信号预处理;1) Collect voice signals and perform signal preprocessing on them;

2)对预处理后的语音信号进行特征提取;2) Carry out feature extraction to the preprocessed speech signal;

所述特征提取选取基于梅尔倒谱系数的特征提取方法和/或基于线性预测倒谱系数的特征提取方法;The feature extraction selects a feature extraction method based on Mel cepstral coefficients and/or a feature extraction method based on linear predictive cepstral coefficients;

所述预处理依次包括如下步骤:Described preprocessing comprises the steps in turn:

采样量化、去零漂、预加重和加窗。Sample quantization, de-zeroing, pre-emphasis and windowing.

作为可选方案:所述形成通用背景模型分量特征包括如下步骤:As an optional solution: the forming of the common background model component features includes the following steps:

1)将采集的语音信号随机分成开发库和评测库;1) Randomly divide the collected voice signals into a development library and an evaluation library;

2)选取开发库中的所有语音并提取特征,将其通过

Figure 535290DEST_PATH_IMAGE007
方法训练通用背景模型;2) Select all voices in the development library and extract features, pass them through
Figure 535290DEST_PATH_IMAGE007
method to train a generic background model;

3)对所述每个测试语音分别在通用背景各高斯模型上计算权重;3) For each of the test voices, the weights are calculated on the Gaussian models of the general background;

4) 将步骤2)和步骤3)结合形成通用背景模型分量特征4) Combine step 2) and step 3) to form a general background model component feature

作为可选方案:所述模糊支持向量机模型为每个高斯分量上的可靠-不可靠特征的两类模糊支持向量机分类器,所述两类模糊支持向量机分类器的正样本选自所述开发库中的中性语音、负样本选自所述开发库中的情感语音。As an optional solution: the fuzzy support vector machine model is two types of fuzzy support vector machine classifiers of reliable-unreliable features on each Gaussian component, and the positive samples of the two types of fuzzy support vector machine classifiers are selected from the The neutral voice and negative samples in the development library are selected from the emotional voice in the development library.

作为可选方案:上述模糊支持向量机进行可靠性检测包括如下步骤:As an optional solution: the above fuzzy support vector machine for reliability detection includes the following steps:

1)通过公式

Figure 266486DEST_PATH_IMAGE008
计算测试语音
Figure 820702DEST_PATH_IMAGE009
在每个高斯分量下的可靠性得分;1) via the formula
Figure 266486DEST_PATH_IMAGE008
Computational Test Speech
Figure 820702DEST_PATH_IMAGE009
reliability score under each Gaussian component;

所述

Figure 637348DEST_PATH_IMAGE010
Figure 436677DEST_PATH_IMAGE011
为每个高斯分量下的分类面的参数said
Figure 637348DEST_PATH_IMAGE010
,
Figure 436677DEST_PATH_IMAGE011
is the parameter of the classification surface under each Gaussian component

2)通过公式

Figure 960062DEST_PATH_IMAGE012
计算测试语音在所有高斯分量下的加权可靠性得分;2) via the formula
Figure 960062DEST_PATH_IMAGE012
Computational Test Speech Weighted reliability score under all Gaussian components;

所述

Figure 257630DEST_PATH_IMAGE014
为权重特征said
Figure 257630DEST_PATH_IMAGE014
is the weight feature

3)通过步骤2)得到的结果判断是否为可靠特征,如果结果大于所设定的阈值则将其作为可靠特征,否则剔除。3) Judge whether it is a reliable feature through the result obtained in step 2). If the result is greater than the set threshold, it will be regarded as a reliable feature, otherwise it will be eliminated.

作为可选方案:通过上述特征计算识别说话者包括如下步骤;As an optional solution: identifying the speaker through the above feature calculation includes the following steps;

1)训练每个说话人的高斯混合模型,自适应说话人模型采用最大后验概率的方法;1) Train the Gaussian mixture model of each speaker, and the adaptive speaker model adopts the method of maximum posterior probability;

2)通过公式

Figure 595071DEST_PATH_IMAGE015
得到第
Figure 238542DEST_PATH_IMAGE016
个说话人模型中测试语音
Figure 636025DEST_PATH_IMAGE009
的似然得分,通过公式
Figure 427264DEST_PATH_IMAGE017
得到整句测试语句得分;2) via the formula
Figure 595071DEST_PATH_IMAGE015
get the first
Figure 238542DEST_PATH_IMAGE016
test speech in a speaker model
Figure 636025DEST_PATH_IMAGE009
The likelihood score of , by the formula
Figure 427264DEST_PATH_IMAGE017
Get the score of the whole test sentence;

所述

Figure 506078DEST_PATH_IMAGE018
为实验中设定的特征可靠性检测的阈值,
Figure 564907DEST_PATH_IMAGE019
为高斯分布的概率密度said
Figure 506078DEST_PATH_IMAGE018
is the threshold of feature reliability detection set in the experiment,
Figure 564907DEST_PATH_IMAGE019
is the probability density of the Gaussian distribution

3)根据步骤2)中得分最大的识别说话人即 3) According to the speaker with the highest score in step 2), that is

所述

Figure 349510DEST_PATH_IMAGE021
表示说话人身份标识。said
Figure 349510DEST_PATH_IMAGE021
Indicates the identity of the speaker.

本发明的有益效果在于:通过剔除语音段落中受情感变化影响较严重的不可靠特征,提高说话人识别系统的鲁棒性,改善系统识别说话人的性能。The beneficial effect of the present invention is that the robustness of the speaker recognition system is improved and the speaker recognition performance of the system is improved by eliminating unreliable features that are seriously affected by emotional changes in speech paragraphs.

附图说明Description of drawings

图1为基于模糊支持向量机的可靠性检测情感说话人的识别方法的基本原理图。Fig. 1 is the basic principle diagram of the recognition method of emotional speaker based on fuzzy support vector machine reliability detection.

具体实施方式Detailed ways

下面将结合附图和具体实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,基于模糊支持向量机的可靠性检测的情感说话人识别方法主要包括四个步骤As shown in Figure 1, the emotional speaker recognition method based on reliability detection of fuzzy support vector machine mainly includes four steps

1)提取语音分量特征,并将其与UBM模型中对应的权重结合形成通用背景模型分量特征;1) Extract the speech component features and combine them with the corresponding weights in the UBM model to form the general background model component features;

2)将所述步骤1)得到的通用背景模型分量特征作为模糊隶属度,建立通用背景模型分量下的模糊支持向量机模型UCFSVM;2) Use the general background model component features obtained in the step 1) as the fuzzy membership degree, and establish the fuzzy support vector machine model UCFSVM under the general background model component;

3)对所述步骤2)的模糊支持向量机模型UCFSVM进行可靠性检测通过得分

Figure 294332DEST_PATH_IMAGE022
的大小判断得到可靠特征;3) Carry out reliability detection pass score for the fuzzy support vector machine model UCFSVM in step 2)
Figure 294332DEST_PATH_IMAGE022
Reliable features are obtained by judging the size of

4)对所述步骤3)的可靠特征进行计算识别说话者。4) Perform calculations on the reliable features in step 3) to identify the speaker.

通用背景模型分量特征提取包括:Common background model component feature extraction includes:

采集语音信号,对其进行信号预处理,预处理的步骤包括采样量化,去零漂,预加重和加窗。The speech signal is collected, and the signal preprocessing is performed on it. The preprocessing steps include sampling and quantization, zero drift removal, pre-emphasis and windowing.

对预处理后的语音进行特征提取,采用的特征提取方法可以是基于梅尔倒谱系数(

Figure 709133DEST_PATH_IMAGE001
)的特征提取方法、基于线性预测倒谱系数的特征提取方法(
Figure 651681DEST_PATH_IMAGE002
)中的一种或者两种。Carry out feature extraction to the preprocessed speech, the feature extraction method that adopts can be based on Mel cepstral coefficient (
Figure 709133DEST_PATH_IMAGE001
) feature extraction method, feature extraction method based on linear predictive cepstral coefficient (
Figure 651681DEST_PATH_IMAGE002
) of one or both.

对于每段语音,得到一段特征序列,其中每帧特征是一个

Figure 136331DEST_PATH_IMAGE024
维的向量,
Figure 671218DEST_PATH_IMAGE025
表示该语句中特征的总帧数。For each speech, get a feature sequence , where each frame feature is a
Figure 136331DEST_PATH_IMAGE024
dimension vector,
Figure 671218DEST_PATH_IMAGE025
Indicates the total number of frames for features in this statement.

将所有训练

Figure 846984DEST_PATH_IMAGE026
模型的语音通过
Figure 772215DEST_PATH_IMAGE007
算法训练
Figure 557375DEST_PATH_IMAGE026
模型。每一个测试语音的特征
Figure 681189DEST_PATH_IMAGE027
分别在
Figure 27856DEST_PATH_IMAGE026
各高斯模型上求取权重
Figure 440383DEST_PATH_IMAGE028
。假设
Figure 530699DEST_PATH_IMAGE026
的模型参数为,其中,
Figure 465736DEST_PATH_IMAGE030
Figure 118620DEST_PATH_IMAGE032
分别表示权重、均值和方差。则特征
Figure 420289DEST_PATH_IMAGE009
属于第
Figure 341715DEST_PATH_IMAGE033
个高斯分量的后验概率可以表示为:will all training
Figure 846984DEST_PATH_IMAGE026
The voice of the model passes through
Figure 772215DEST_PATH_IMAGE007
algorithm training
Figure 557375DEST_PATH_IMAGE026
Model. Features of each test utterance
Figure 681189DEST_PATH_IMAGE027
Respectively
Figure 27856DEST_PATH_IMAGE026
Find the weight on each Gaussian model
Figure 440383DEST_PATH_IMAGE028
. suppose
Figure 530699DEST_PATH_IMAGE026
The model parameters for ,in,
Figure 465736DEST_PATH_IMAGE030
, and
Figure 118620DEST_PATH_IMAGE032
denote the weight, mean and variance, respectively. feature
Figure 420289DEST_PATH_IMAGE009
belongs to the
Figure 341715DEST_PATH_IMAGE033
The posterior probability of a Gaussian component can be expressed as:

Figure 791151DEST_PATH_IMAGE034
Figure 791151DEST_PATH_IMAGE034

其中,

Figure 223269DEST_PATH_IMAGE035
表示高斯分布的概率密度。in,
Figure 223269DEST_PATH_IMAGE035
Represents the probability density of a Gaussian distribution.

后验概率也可以理解为该特征属于该

Figure 176182DEST_PATH_IMAGE026
分量的权重,将原特征和权重结合,即可形成新的通用背景模型分量特征。The posterior probability can also be understood as the feature belongs to the
Figure 176182DEST_PATH_IMAGE026
The weight of the component, the original feature and the weight are combined to form a new general background model component feature.

上述步骤(1)形成的特征包含了特征在

Figure 537018DEST_PATH_IMAGE026
上的权重,使得新构建的权重特征不仅能够充当训练模糊支持向量机时的模糊隶属度角色,同时也能充当计算可信度时的各高斯分量重要性的权重角色。The features formed in the above step (1) include features in
Figure 537018DEST_PATH_IMAGE026
The weight on the above makes the newly constructed weight feature not only play the role of fuzzy membership degree when training fuzzy support vector machine, but also play the weight role of the importance of each Gaussian component when calculating the credibility.

建立通用背景模型分量下的模糊支持向量机模型:Establish a fuzzy support vector machine model under the general background model component:

Figure 208171DEST_PATH_IMAGE005
模型的基础上,为每个高斯分量训练一个可靠-不可靠特征的两类模糊支持向量机模型。其中中性特征被认为是可靠特征,情感特征被认为是不可靠特征,正样本选自开发库的中性语音,负样本选取的是其中的情感语音。其中,每个样本的模糊隶属度为步骤(1)中提及的权重特征。exist
Figure 208171DEST_PATH_IMAGE005
Based on the model, a two-class fuzzy support vector machine model with reliable-unreliable features is trained for each Gaussian component. Among them, the neutral features are considered reliable features, and the emotional features are considered unreliable features. The positive samples are selected from the neutral speech of the development library, and the negative samples are selected from the emotional speech. Among them, the fuzzy membership degree of each sample is the weight feature mentioned in step (1).

训练模糊支持向量机的方法为:对于一个带隶属度标记的训练样本集

Figure 443980DEST_PATH_IMAGE036
Figure 454662DEST_PATH_IMAGE037
;The method of training the fuzzy support vector machine is as follows: for a training sample set with membership mark
Figure 443980DEST_PATH_IMAGE036
:
Figure 454662DEST_PATH_IMAGE037
;

其中每个训练数据

Figure 484934DEST_PATH_IMAGE038
,如其为情感语音,则视为不可靠语音,其相应的标签,如其为中性语音,其标签为
Figure 978156DEST_PATH_IMAGE040
。where each training data
Figure 484934DEST_PATH_IMAGE038
, if it is emotional speech, it is regarded as unreliable speech, and its corresponding label , if it is a neutral voice, its label is
Figure 978156DEST_PATH_IMAGE040
.

优化超平面的问题等效为:The problem of optimizing the hyperplane is equivalent to:

Figure 905661DEST_PATH_IMAGE041
Figure 905661DEST_PATH_IMAGE041

其中,

Figure 44518DEST_PATH_IMAGE042
是一个常数,
Figure 191728DEST_PATH_IMAGE043
表示将
Figure 503761DEST_PATH_IMAGE044
映射到得特征空间向量,隶属度
Figure 994151DEST_PATH_IMAGE047
代表相应的数据
Figure 608410DEST_PATH_IMAGE048
属于某一类的程度,
Figure 448190DEST_PATH_IMAGE049
Figure 991166DEST_PATH_IMAGE050
分别表示分类超平面
Figure 345924DEST_PATH_IMAGE051
的线性系数和偏移量。该问题可以采用解线性不等式的理论解决。(Chun-Fu Lin, Sheng-De Wang. Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 13(2):464-471, March 2002.)。in,
Figure 44518DEST_PATH_IMAGE042
is a constant,
Figure 191728DEST_PATH_IMAGE043
express will
Figure 503761DEST_PATH_IMAGE044
from map to Get the feature space vector, the degree of membership
Figure 994151DEST_PATH_IMAGE047
represent the corresponding data
Figure 608410DEST_PATH_IMAGE048
belong to a certain degree,
Figure 448190DEST_PATH_IMAGE049
,
Figure 991166DEST_PATH_IMAGE050
Respectively represent the classification hyperplane
Figure 345924DEST_PATH_IMAGE051
The linear coefficient and offset of . This problem can be solved using the theory of solving linear inequalities. (Chun-Fu Lin, Sheng-De Wang. Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 13(2):464-471, March 2002.).

上面式子可以转换为其对偶表达形式:The above formula can be transformed into its dual form:

同时,根据库恩-塔克条件:Meanwhile, according to the Kuhn-Tucker condition:

由上两式可以求解得到每个高斯分量下的分类面参数:

Figure 282602DEST_PATH_IMAGE055
。The above two formulas can be solved to obtain the classification surface parameters under each Gaussian component: and
Figure 282602DEST_PATH_IMAGE055
.

基于模糊支持向量机的特征可靠性检测Feature Reliability Detection Based on Fuzzy Support Vector Machine

对于测试语音特征

Figure 677811DEST_PATH_IMAGE056
,需要计算其为可靠特征的得分,如果可靠性得分过低,要将其剔除。得分的计算分为两步:首先,求取该特征在通用背景模型单个高斯分量下的模糊支持向量机上的可靠性得分
Figure 288921DEST_PATH_IMAGE057
。其次,计算该特征在通用背景模型所有高斯分量下的模糊支持向量机上的可靠性得分的加权和,表示为:For testing speech features
Figure 677811DEST_PATH_IMAGE056
, need to calculate its score as a reliable feature, if the reliability score is too low, it should be removed. The calculation of the score is divided into two steps: first, the reliability score of the feature on the fuzzy support vector machine under the single Gaussian component of the general background model is obtained
Figure 288921DEST_PATH_IMAGE057
. Second, calculate the weighted sum of reliability scores of the feature on the fuzzy support vector machine under all Gaussian components of the general background model, expressed as:

Figure 439279DEST_PATH_IMAGE058
Figure 439279DEST_PATH_IMAGE058

其中,

Figure 290602DEST_PATH_IMAGE059
表示该特征在该高斯分量上的权重,
Figure 955119DEST_PATH_IMAGE060
的含义如上文所示。该得分可以用来判断其是否为可靠特征,如果得分大于阀值,则认为其为可靠特征,否则,将其剔除。in,
Figure 290602DEST_PATH_IMAGE059
Indicates that the feature is in the Weights on Gaussian components,
Figure 955119DEST_PATH_IMAGE060
meaning as above. The score can be used to judge whether it is a reliable feature, if the score is greater than the threshold, it is considered a reliable feature, otherwise, it will be eliminated.

可靠特征得分计算Reliable Feature Score Calculation

经过上述步骤(3)的可靠特征检测之后,需要计算整个语句的得分。After reliable feature detection in step (3) above, the score of the entire sentence needs to be calculated.

首先需要训练每个说话人的高斯混合模型,自适应说话人模型采用最大后验概率(

Figure 276379DEST_PATH_IMAGE061
)的方法。First, it is necessary to train a Gaussian mixture model for each speaker, and the adaptive speaker model uses the maximum posterior probability (
Figure 276379DEST_PATH_IMAGE061
)Methods.

Figure 30708DEST_PATH_IMAGE062
其次,对于第
Figure 830037DEST_PATH_IMAGE016
个说话人模型,测试语音特征
Figure 917204DEST_PATH_IMAGE009
的似然得分可以通过计算在第
Figure 347048DEST_PATH_IMAGE016
个说话人的似然得分得到,即下式:
Figure 30708DEST_PATH_IMAGE062
Second, for the
Figure 830037DEST_PATH_IMAGE016
a speaker model to test speech features
Figure 917204DEST_PATH_IMAGE009
The likelihood score of can be calculated by the
Figure 347048DEST_PATH_IMAGE016
The likelihood score of a speaker is obtained as follows:

对于整句测试语句,其得分计算方法为:For the entire test sentence, the score calculation method is:

Figure 385411DEST_PATH_IMAGE063
Figure 385411DEST_PATH_IMAGE063

为实验中设定的可靠性的阈值,如果可靠性得分大于阈值,则该特征得分保留,否则,会被剔除。 It is the reliability threshold set in the experiment. If the reliability score is greater than the threshold, the feature score will be kept, otherwise, it will be eliminated.

最后,选择该语句的目标说话人时选择得分最大的说话人的

Figure 631902DEST_PATH_IMAGE021
。Finally, when selecting the target speaker of the sentence, choose the speaker with the highest score
Figure 631902DEST_PATH_IMAGE021
.

Figure 29385DEST_PATH_IMAGE065
Figure 29385DEST_PATH_IMAGE065

实验结果Experimental results

实验中采用的数据库为中文情感语音数据库(MASC)。该数据库是在安静的环境下采用奥林巴斯DM-20录音笔录制的。该数据库包含68个母语为汉语的68个说话人,其中男性45人,女性23人。每个说话人共有5种情感的发音:中性、生气、高兴、愤怒和悲伤。每个说话人会在中性条件下朗读2段中性的段落,同时,会在每种情感下说出5个单词和20句语句各3遍。The database used in the experiment is Chinese Emotional Speech Database (MASC). The database was recorded with an Olympus DM-20 recorder in a quiet environment. The database contains 68 speakers whose native language is Chinese, including 45 males and 23 females. There are 5 emotion pronunciations for each speaker: neutral, angry, happy, angry and sad. Each speaker will read 2 neutral passages under neutral conditions, and at the same time, will speak 5 words and 20 sentences for each emotion 3 times.

本实验是在IBM服务器上进行的。其配置为:CPU E5420,主频2.5GHz。内存为4G。This experiment is carried out on the IBM server. Its configuration is: CPU E5420, main frequency 2.5GHz. The memory is 4G.

实验中,前18个说话人的语音作为开发库,18人的中性段落语音用于训练

Figure 319159DEST_PATH_IMAGE066
模型,该18个人5种情感下的的语句发音用于训练模糊支持向量机模型。后50个说话人组成评测集,每个说话人的
Figure 397973DEST_PATH_IMAGE067
模型是采用其中性段落自适应出来。五种情感语音下的所有语句用来进行测试,测试语音共计15,000句()。实验中,模拟的是说话人鉴别的过程,实验结果和基准的
Figure 526652DEST_PATH_IMAGE069
实验结果比较见表1。In the experiment, the voices of the first 18 speakers were used as the development library, and the neutral paragraph voices of the 18 speakers were used for training
Figure 319159DEST_PATH_IMAGE066
Model, the sentence pronunciation of the 18 people under 5 emotions is used to train the fuzzy support vector machine model. The last 50 speakers constitute the evaluation set, and each speaker’s
Figure 397973DEST_PATH_IMAGE067
The model is self-adapted using its neutral paragraphs. All the sentences under the five emotional voices are used for testing, and the total number of test voices is 15,000 sentences ( ). In the experiment, the process of speaker identification is simulated, the experimental results and benchmark
Figure 526652DEST_PATH_IMAGE069
The experimental results are compared in Table 1.

表 本方法效果和基准实验效果比较surface The effect of this method is compared with that of the benchmark experiment

情感分类Sentiment Classification 基准方法benchmark method 本方法This method 中性neutral 96.23%96.23% 95.50%95.50% 愤怒anger 31.50%31.50% 37.60%37.60% 高兴Happy 33.57%33.57% 39.47%39.47% 惊慌panic 35.00%35.00% 39.77%39.77% 悲伤sad 61.43%61.43% 63.63%63.63% 平均average 51.55%51.55% 55.19%55.19%

从上述实验结果可以看出,本方法可以有效地检测出语句中的可靠特征,在各情感状态下,识别的准确率得到了较大的提高。同时,总体的识别准确率也提高了3.64%。说明本方法对提高说话人识别系统的性能和鲁棒性有很大的帮助。It can be seen from the above experimental results that this method can effectively detect reliable features in sentences, and the recognition accuracy has been greatly improved in each emotional state. At the same time, the overall recognition accuracy has also increased by 3.64%. It shows that this method is of great help to improve the performance and robustness of the speaker recognition system.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员,在不脱离本发明构思的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围内。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, some improvements and modifications can also be made, and these improvements and modifications should also be considered Within the protection scope of the present invention.

Claims (6)

1. The emotion speaker identification method based on reliability detection of the fuzzy support vector machine is characterized by comprising the following steps
1) Extracting the voice component characteristics, and combining the voice component characteristics with corresponding weights in the UBM model to form general background model component characteristics;
2) using the general background model component characteristics obtained in the step 1) as fuzzy membership, and establishing a fuzzy support vector machine model under the general background model component;
3) carrying out reliability detection by using the fuzzy support vector machine model in the step 2) to obtain reliable characteristics;
4) computing the reliable characteristics of the step 3) to identify the speaker.
2. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine as claimed in claim 1, wherein said extracting speech component features comprises the steps of:
1) collecting voice signals and carrying out signal preprocessing on the voice signals;
2) extracting the characteristics of the preprocessed voice signals;
the feature extraction selects a feature extraction method based on a Mel cepstrum coefficient and/or a feature extraction method based on a linear prediction cepstrum coefficient;
the pretreatment comprises the following steps in sequence:
sample quantization, descum, pre-emphasis, and windowing.
3. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine as claimed in claim 1, wherein said forming of general background model component features comprises the steps of:
1) randomly dividing the collected voice signals into a development library and an evaluation library;
2) selecting all the voices in the development library, extracting features and passing the voices throughThe method trains a general background model;
3) calculating posterior probability on each Gaussian model component of the general background for each frame of voice as weight;
4) combining the step 2) and the step 3) to form the general background model component characteristics.
4. The method as claimed in claim 2, wherein the fuzzy SVM model is two classes of classifiers for neutral-emotion feature on each Gaussian component, and the positive samples of the two classes of classifiers are selected as neutral speech in the speech and the negative samples are selected as emotion speech in the speech.
5. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine according to claims 1-4, wherein said fuzzy support vector machine for reliability detection comprises the steps of:
1) by the formula
Figure 944839DEST_PATH_IMAGE002
Computing test speech
Figure 949704DEST_PATH_IMAGE003
A weighted reliability score at each gaussian component;
the above-mentionedFor the parameters of the classification plane under each Gaussian component
2) By the formula
Figure 906924DEST_PATH_IMAGE006
Calculating a weighted sum of the emotion probabilities of all Gaussian components;
the above-mentioned
Figure 817111DEST_PATH_IMAGE007
As a weight feature
3) Judging whether the characteristic is a reliable characteristic or not according to the result obtained in the step 2), if the result is smaller than a set threshold value, taking the characteristic as the reliable characteristic, otherwise, rejecting the characteristic.
6. The method for recognizing the emotional speaker based on the reliability detection of the fuzzy support vector machine as claimed in claims 1 to 4, wherein the computing and recognizing the speaker by the component feature of the general background model comprises the following steps;
1) training a Gaussian mixture model of each speaker, wherein the self-adaptive speaker model adopts a maximum posterior probability method;
2) by the formula
Figure 804659DEST_PATH_IMAGE008
To obtain the first
Figure 28967DEST_PATH_IMAGE009
Testing speech in individual speaker models
Figure 418360DEST_PATH_IMAGE010
Is given by the formula
Figure 263563DEST_PATH_IMAGE011
Obtaining the score of the whole sentence test sentence;
the above-mentioned
Figure 738406DEST_PATH_IMAGE012
For the threshold value of feature reliability detection set in the experiment,
Figure 563143DEST_PATH_IMAGE013
probability density of Gaussian distribution
3) Identifying the speaker with the largest score in step 2)
Figure 807043DEST_PATH_IMAGE014
The above-mentionedRepresenting speaker identityAnd (5) identifying.
CN201110121720XA 2011-05-12 2011-05-12 Emotional speaker identification method based on reliability detection of fuzzy support vector machine Expired - Fee Related CN102201237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110121720XA CN102201237B (en) 2011-05-12 2011-05-12 Emotional speaker identification method based on reliability detection of fuzzy support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110121720XA CN102201237B (en) 2011-05-12 2011-05-12 Emotional speaker identification method based on reliability detection of fuzzy support vector machine

Publications (2)

Publication Number Publication Date
CN102201237A true CN102201237A (en) 2011-09-28
CN102201237B CN102201237B (en) 2013-03-13

Family

ID=44661863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110121720XA Expired - Fee Related CN102201237B (en) 2011-05-12 2011-05-12 Emotional speaker identification method based on reliability detection of fuzzy support vector machine

Country Status (1)

Country Link
CN (1) CN102201237B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech emotion recognition method based on feature space self-adaptive projection
CN102930297A (en) * 2012-11-05 2013-02-13 北京理工大学 Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion
CN102968990A (en) * 2012-11-15 2013-03-13 江苏嘉利德电子科技有限公司 Speaker identifying method and system
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN106504772A (en) * 2016-11-04 2017-03-15 东南大学 Speech Emotion Recognition Method Based on Importance Weight Support Vector Machine Classifier
CN107886942A (en) * 2017-10-31 2018-04-06 东南大学 A kind of voice signal emotion identification method returned based on local punishment random spectrum
CN108922564A (en) * 2018-06-29 2018-11-30 北京百度网讯科技有限公司 Emotion identification method, apparatus, computer equipment and storage medium
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device
CN115104152A (en) * 2020-02-25 2022-09-23 松下电器(美国)知识产权公司 Speaker identification device, speaker identification method, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758332A (en) * 2005-10-31 2006-04-12 浙江大学 Speaker recognition method based on MFCC linear emotion compensation
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaker Recognition Method Based on Fundamental Band Envelope Removal of Emotional Speech
JP2008146054A (en) * 2006-12-06 2008-06-26 Korea Electronics Telecommun Speaker information acquisition system and method using voice feature information of speaker

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1758332A (en) * 2005-10-31 2006-04-12 浙江大学 Speaker recognition method based on MFCC linear emotion compensation
JP2008146054A (en) * 2006-12-06 2008-06-26 Korea Electronics Telecommun Speaker information acquisition system and method using voice feature information of speaker
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaker Recognition Method Based on Fundamental Band Envelope Removal of Emotional Speech

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHENYU SHAN ET AL: "Scores selection for emotional speaker recognition", 《ADVANCES IN BIOMETRICS THIRD INTERNATIONAL CONFERENCE, ICB 2009》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech emotion recognition method based on feature space self-adaptive projection
CN102930297A (en) * 2012-11-05 2013-02-13 北京理工大学 Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion
CN102930297B (en) * 2012-11-05 2015-04-29 北京理工大学 Emotion recognition method for enhancing coupling hidden markov model (HMM) voice-vision fusion
CN102968990A (en) * 2012-11-15 2013-03-13 江苏嘉利德电子科技有限公司 Speaker identifying method and system
CN102968990B (en) * 2012-11-15 2015-04-15 朱东来 Speaker identifying method and system
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103258532B (en) * 2012-11-28 2015-10-28 河海大学常州校区 A kind of Chinese speech sensibility recognition methods based on fuzzy support vector machine
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN106504772A (en) * 2016-11-04 2017-03-15 东南大学 Speech Emotion Recognition Method Based on Importance Weight Support Vector Machine Classifier
CN106504772B (en) * 2016-11-04 2019-08-20 东南大学 Speech Emotion Recognition Method Based on Importance Weight Support Vector Machine Classifier
CN107886942A (en) * 2017-10-31 2018-04-06 东南大学 A kind of voice signal emotion identification method returned based on local punishment random spectrum
CN107886942B (en) * 2017-10-31 2021-09-28 东南大学 Voice signal emotion recognition method based on local punishment random spectral regression
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device
CN108922564A (en) * 2018-06-29 2018-11-30 北京百度网讯科技有限公司 Emotion identification method, apparatus, computer equipment and storage medium
CN108922564B (en) * 2018-06-29 2021-05-07 北京百度网讯科技有限公司 Emotion recognition method and device, computer equipment and storage medium
CN115104152A (en) * 2020-02-25 2022-09-23 松下电器(美国)知识产权公司 Speaker identification device, speaker identification method, and program

Also Published As

Publication number Publication date
CN102201237B (en) 2013-03-13

Similar Documents

Publication Publication Date Title
CN102201237B (en) Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN102799899B (en) Special audio event layered and generalized identification method based on SVM (Support Vector Machine) and GMM (Gaussian Mixture Model)
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system
CN108922541B (en) Multi-dimensional feature parameter voiceprint recognition method based on DTW and GMM models
Weninger et al. Deep learning based mandarin accent identification for accent robust ASR.
Lee et al. Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams
Lengerich et al. An end-to-end architecture for keyword spotting and voice activity detection
TWI395201B (en) Method and system for identifying emotional voices
Semwal et al. Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models
CN104240706B (en) It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
CN105632501A (en) Deep-learning-technology-based automatic accent classification method and apparatus
CN101645269A (en) Language recognition system and method
CN103578481B (en) A kind of speech-emotion recognition method across language
CN105280181B (en) A kind of training method and Language Identification of languages identification model
CN110211594A (en) A kind of method for distinguishing speek person based on twin network model and KNN algorithm
Franco et al. Adaptive and discriminative modeling for improved mispronunciation detection
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN104901807A (en) Vocal print password method available for low-end chip
Guo et al. Speaker Verification Using Short Utterances with DNN-Based Estimation of Subglottal Acoustic Features.
Zeinali et al. A fast speaker identification method using nearest neighbor distance
Chakroun et al. A hybrid system based on GMM-SVM for speaker identification
Vestman et al. Supervector compression strategies to speed up i-vector system development
Lin An improved GMM-based clustering algorithm for efficient speaker identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130313

CF01 Termination of patent right due to non-payment of annual fee