CN102201237A

CN102201237A - Emotional speaker identification method based on reliability detection of fuzzy support vector machine

Info

Publication number: CN102201237A
Application number: CN201110121720XA
Authority: CN
Inventors: 杨莹春; 陈力; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-05-12
Filing date: 2011-05-12
Publication date: 2011-09-28
Anticipated expiration: 2031-05-12
Also published as: CN102201237B

Abstract

The invention discloses an emotional speaker recognition method based on fuzzy support vector machine reliability detection, by extracting voice component features, and combining them with corresponding weights in UBM models to form general background model component features; the obtained general background model The component feature is used as the fuzzy membership degree, and the fuzzy support vector machine model under the general background model component is established; the reliable feature is obtained by using the fuzzy support vector machine model for reliability detection; the reliable feature is calculated and the speaker is identified, which improves the speaker recognition The robustness of the system improves the performance of the system in identifying speakers.

Description

Emotional speaker recognition method based on reliability detection of fuzzy support vector machine

技术领域technical field

本发明涉及信号处理和模式识别，特别涉及一种基于模糊支持向量机的可靠性特征检测的情感说话人识别方法。The invention relates to signal processing and pattern recognition, in particular to an emotional speaker recognition method based on fuzzy support vector machine reliability feature detection.

背景技术Background technique

说话人识别技术是指利用信号处理和模式识别方法，根据说话人的语音识别其身份的技术，主要包括两个步骤：说话人模型训练和语音测试。Speaker recognition technology refers to the technology of using signal processing and pattern recognition methods to identify the speaker's identity according to the speaker's voice. It mainly includes two steps: speaker model training and voice testing.

目前，说话人识别采用的主要特征包括梅尔倒谱系数（），线性预测编码倒谱系数（

），感觉加权的线性预测系数（

）。说话人识别的算法主要包括矢量量化（），通用背景模型方法（

），支持向量机（

）等等。其中，

在整个说话人识别领域应用非常广泛。Currently, the main features used in speaker recognition include Mel cepstral coefficients ( ), linear predictive coding cepstral coefficients (

), the sensory weighted linear predictive coefficient (

). The algorithm of speaker recognition mainly includes vector quantization ( ), the generic background model approach (

),Support Vector Machines(

)etc. in,

It is widely used in the field of speaker recognition.

在情感说话人识别中，训练语音通常为中性情感语音，因为在现实应用中，一般情况下用户只会提供中性发音下的语音训练自己的模型。而测试时，语音可能包括各种情感的语音，如高兴，悲伤等。然而，传统的说话人识别系统并不能处理这种训练和测试条件的失配，因此，情感说话人识别需要解决的是说话人在训练和测试阶段的情感不一致而导致的说话人识别系统性能下降的问题。In emotional speaker recognition, the training speech is usually neutral emotional speech, because in real applications, users generally only provide speech under neutral pronunciation to train their own models. While testing, the voice may include voices of various emotions, such as happy, sad and so on. However, traditional speaker recognition systems cannot handle this mismatch between training and testing conditions. Therefore, emotional speaker recognition needs to solve the performance degradation of the speaker recognition system caused by the emotional inconsistency of the speaker during the training and testing stages. The problem.

我们通过实验观察发现，由于说话人在不同情感状态下的发声状态存在差异而导致语音特征的空间分布存在差异，因此，相对于中性训练模型而言，情感语音特征与其不匹配，可视为不可靠特征，在测试阶段加以剔除后将有助于系统识别性能的提升。Through experimental observations, we found that the spatial distribution of speech features is different due to the differences in the vocalization states of speakers in different emotional states. Therefore, compared with the neutral training model, the emotional speech features do not match it, which can be regarded as Unreliable features, after being eliminated in the test phase, will help improve the system's recognition performance.

发明内容Contents of the invention

针对现有技术的不足，本发明提出一种基于模糊支持向量机的可靠性特征检测的情感说话人识别方法，通过剔除测试语音中的情感语音特征来降低模型失配程度，从而提高说话人识别系统的鲁棒性，改善说话人识别的性能。Aiming at the deficiencies of the prior art, the present invention proposes an emotional speaker recognition method based on fuzzy support vector machine reliability feature detection, which reduces the degree of model mismatch by eliminating the emotional speech features in the test speech, thereby improving speaker recognition The robustness of the system improves the performance of speaker recognition.

为了解决上述技术问题，本发明的技术方案如下：In order to solve the problems of the technologies described above, the technical solution of the present invention is as follows:

基于模糊支持向量机的可靠性检测的情感说话人识别方法,包括如下步骤The emotional speaker recognition method based on the reliability detection of the fuzzy support vector machine comprises the following steps

1) 提取语音分量特征,并将其与UBM模型中对应的权重结合形成通用背景模型分量特征;1) Extract the voice component features, and combine them with the corresponding weights in the UBM model to form the general background model component features;

2) 将所述步骤1)得到的通用背景模型分量特征作为模糊隶属度,建立通用背景模型分量下的模糊支持向量机模型;2) The general background model component feature obtained in the step 1) is used as the fuzzy membership degree, and the fuzzy support vector machine model under the general background model component is established;

3) 对所述步骤2)的模糊支持向量机模型进行可靠性检测从而得到可靠特征;3) Reliability detection is performed on the fuzzy support vector machine model of the step 2) to obtain reliable features;

4) 对所述步骤3)的可靠特征进行计算识别说话者。4) Calculate the reliable features of the step 3) to identify the speaker.

作为可选方案：所述提取语音分量特征包括如下步骤:As an optional solution: the feature of extracting the speech component comprises the following steps:

1) 采集语音信号，对其进行信号预处理;1) Collect voice signals and perform signal preprocessing on them;

2)对预处理后的语音信号进行特征提取;2) Carry out feature extraction to the preprocessed speech signal;

所述特征提取选取基于梅尔倒谱系数的特征提取方法和/或基于线性预测倒谱系数的特征提取方法;The feature extraction selects a feature extraction method based on Mel cepstral coefficients and/or a feature extraction method based on linear predictive cepstral coefficients;

所述预处理依次包括如下步骤:Described preprocessing comprises the steps in turn:

采样量化、去零漂、预加重和加窗。Sample quantization, de-zeroing, pre-emphasis and windowing.

作为可选方案：所述形成通用背景模型分量特征包括如下步骤：As an optional solution: the forming of the common background model component features includes the following steps:

1）将采集的语音信号随机分成开发库和评测库；1) Randomly divide the collected voice signals into a development library and an evaluation library;

2）选取开发库中的所有语音并提取特征，将其通过

方法训练通用背景模型；2) Select all voices in the development library and extract features, pass them through

method to train a generic background model;

3）对所述每个测试语音分别在通用背景各高斯模型上计算权重；3) For each of the test voices, the weights are calculated on the Gaussian models of the general background;

4) 将步骤2）和步骤3）结合形成通用背景模型分量特征4) Combine step 2) and step 3) to form a general background model component feature

作为可选方案：所述模糊支持向量机模型为每个高斯分量上的可靠-不可靠特征的两类模糊支持向量机分类器，所述两类模糊支持向量机分类器的正样本选自所述开发库中的中性语音、负样本选自所述开发库中的情感语音。As an optional solution: the fuzzy support vector machine model is two types of fuzzy support vector machine classifiers of reliable-unreliable features on each Gaussian component, and the positive samples of the two types of fuzzy support vector machine classifiers are selected from the The neutral voice and negative samples in the development library are selected from the emotional voice in the development library.

作为可选方案：上述模糊支持向量机进行可靠性检测包括如下步骤：As an optional solution: the above fuzzy support vector machine for reliability detection includes the following steps:

1）通过公式

计算测试语音

在每个高斯分量下的可靠性得分；1) via the formula

Computational Test Speech

reliability score under each Gaussian component;

所述

、

为每个高斯分量下的分类面的参数said

,

is the parameter of the classification surface under each Gaussian component

2）通过公式

计算测试语音在所有高斯分量下的加权可靠性得分；2) via the formula

Computational Test Speech Weighted reliability score under all Gaussian components;

所述

为权重特征said

is the weight feature

3）通过步骤2）得到的结果判断是否为可靠特征，如果结果大于所设定的阈值则将其作为可靠特征，否则剔除。3) Judge whether it is a reliable feature through the result obtained in step 2). If the result is greater than the set threshold, it will be regarded as a reliable feature, otherwise it will be eliminated.

作为可选方案：通过上述特征计算识别说话者包括如下步骤;As an optional solution: identifying the speaker through the above feature calculation includes the following steps;

1)训练每个说话人的高斯混合模型，自适应说话人模型采用最大后验概率的方法；1) Train the Gaussian mixture model of each speaker, and the adaptive speaker model adopts the method of maximum posterior probability;

2）通过公式

得到第

个说话人模型中测试语音

的似然得分，通过公式

得到整句测试语句得分；2) via the formula

get the first

test speech in a speaker model

The likelihood score of , by the formula

Get the score of the whole test sentence;

所述

为实验中设定的特征可靠性检测的阈值，

为高斯分布的概率密度said

is the threshold of feature reliability detection set in the experiment,

is the probability density of the Gaussian distribution

3）根据步骤2）中得分最大的识别说话人即 3) According to the speaker with the highest score in step 2), that is

所述

表示说话人身份标识。said

Indicates the identity of the speaker.

本发明的有益效果在于：通过剔除语音段落中受情感变化影响较严重的不可靠特征，提高说话人识别系统的鲁棒性，改善系统识别说话人的性能。The beneficial effect of the present invention is that the robustness of the speaker recognition system is improved and the speaker recognition performance of the system is improved by eliminating unreliable features that are seriously affected by emotional changes in speech paragraphs.

附图说明Description of drawings

图1为基于模糊支持向量机的可靠性检测情感说话人的识别方法的基本原理图。Fig. 1 is the basic principle diagram of the recognition method of emotional speaker based on fuzzy support vector machine reliability detection.

具体实施方式Detailed ways

下面将结合附图和具体实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，基于模糊支持向量机的可靠性检测的情感说话人识别方法主要包括四个步骤As shown in Figure 1, the emotional speaker recognition method based on reliability detection of fuzzy support vector machine mainly includes four steps

1）提取语音分量特征,并将其与UBM模型中对应的权重结合形成通用背景模型分量特征;1) Extract the speech component features and combine them with the corresponding weights in the UBM model to form the general background model component features;

2）将所述步骤1)得到的通用背景模型分量特征作为模糊隶属度,建立通用背景模型分量下的模糊支持向量机模型UCFSVM;2) Use the general background model component features obtained in the step 1) as the fuzzy membership degree, and establish the fuzzy support vector machine model UCFSVM under the general background model component;

3）对所述步骤2)的模糊支持向量机模型UCFSVM进行可靠性检测通过得分

的大小判断得到可靠特征;3) Carry out reliability detection pass score for the fuzzy support vector machine model UCFSVM in step 2)

Reliable features are obtained by judging the size of

4）对所述步骤3)的可靠特征进行计算识别说话者。4) Perform calculations on the reliable features in step 3) to identify the speaker.

通用背景模型分量特征提取包括：Common background model component feature extraction includes:

采集语音信号，对其进行信号预处理，预处理的步骤包括采样量化，去零漂，预加重和加窗。The speech signal is collected, and the signal preprocessing is performed on it. The preprocessing steps include sampling and quantization, zero drift removal, pre-emphasis and windowing.

对预处理后的语音进行特征提取，采用的特征提取方法可以是基于梅尔倒谱系数（

）的特征提取方法、基于线性预测倒谱系数的特征提取方法（

）中的一种或者两种。Carry out feature extraction to the preprocessed speech, the feature extraction method that adopts can be based on Mel cepstral coefficient (

) feature extraction method, feature extraction method based on linear predictive cepstral coefficient (

) of one or both.

对于每段语音，得到一段特征序列，其中每帧特征是一个

维的向量，

表示该语句中特征的总帧数。For each speech, get a feature sequence , where each frame feature is a

dimension vector,

Indicates the total number of frames for features in this statement.

将所有训练

模型的语音通过

算法训练

模型。每一个测试语音的特征

分别在

各高斯模型上求取权重

。假设

的模型参数为，其中，

、和

分别表示权重、均值和方差。则特征

属于第

个高斯分量的后验概率可以表示为：will all training

The voice of the model passes through

algorithm training

Model. Features of each test utterance

Respectively

Find the weight on each Gaussian model

. suppose

The model parameters for ,in,

, and

denote the weight, mean and variance, respectively. feature

belongs to the

The posterior probability of a Gaussian component can be expressed as:

其中，

表示高斯分布的概率密度。in,

Represents the probability density of a Gaussian distribution.

后验概率也可以理解为该特征属于该

分量的权重，将原特征和权重结合，即可形成新的通用背景模型分量特征。The posterior probability can also be understood as the feature belongs to the

The weight of the component, the original feature and the weight are combined to form a new general background model component feature.

上述步骤（1）形成的特征包含了特征在

上的权重，使得新构建的权重特征不仅能够充当训练模糊支持向量机时的模糊隶属度角色，同时也能充当计算可信度时的各高斯分量重要性的权重角色。The features formed in the above step (1) include features in

The weight on the above makes the newly constructed weight feature not only play the role of fuzzy membership degree when training fuzzy support vector machine, but also play the weight role of the importance of each Gaussian component when calculating the credibility.

建立通用背景模型分量下的模糊支持向量机模型：Establish a fuzzy support vector machine model under the general background model component:

在

模型的基础上，为每个高斯分量训练一个可靠-不可靠特征的两类模糊支持向量机模型。其中中性特征被认为是可靠特征，情感特征被认为是不可靠特征，正样本选自开发库的中性语音，负样本选取的是其中的情感语音。其中，每个样本的模糊隶属度为步骤（1）中提及的权重特征。exist

Based on the model, a two-class fuzzy support vector machine model with reliable-unreliable features is trained for each Gaussian component. Among them, the neutral features are considered reliable features, and the emotional features are considered unreliable features. The positive samples are selected from the neutral speech of the development library, and the negative samples are selected from the emotional speech. Among them, the fuzzy membership degree of each sample is the weight feature mentioned in step (1).

训练模糊支持向量机的方法为：对于一个带隶属度标记的训练样本集

：

；The method of training the fuzzy support vector machine is as follows: for a training sample set with membership mark

:

;

其中每个训练数据

，如其为情感语音，则视为不可靠语音，其相应的标签，如其为中性语音，其标签为

。where each training data

, if it is emotional speech, it is regarded as unreliable speech, and its corresponding label , if it is a neutral voice, its label is

.

优化超平面的问题等效为：The problem of optimizing the hyperplane is equivalent to:

其中，

是一个常数，

表示将

从映射到得特征空间向量，隶属度

代表相应的数据

属于某一类的程度，

，

分别表示分类超平面

的线性系数和偏移量。该问题可以采用解线性不等式的理论解决。（Chun-Fu Lin, Sheng-De Wang. Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 13(2):464-471, March 2002.）。in,

is a constant,

express will

from map to Get the feature space vector, the degree of membership

represent the corresponding data

belong to a certain degree,

,

Respectively represent the classification hyperplane

The linear coefficient and offset of . This problem can be solved using the theory of solving linear inequalities. (Chun-Fu Lin, Sheng-De Wang. Fuzzy Support Vector Machines. IEEE Transactions on Neural Networks, 13(2):464-471, March 2002.).

上面式子可以转换为其对偶表达形式：The above formula can be transformed into its dual form:

同时，根据库恩-塔克条件：Meanwhile, according to the Kuhn-Tucker condition:

由上两式可以求解得到每个高斯分量下的分类面参数：和

。The above two formulas can be solved to obtain the classification surface parameters under each Gaussian component: and

.

基于模糊支持向量机的特征可靠性检测Feature Reliability Detection Based on Fuzzy Support Vector Machine

对于测试语音特征

，需要计算其为可靠特征的得分，如果可靠性得分过低，要将其剔除。得分的计算分为两步：首先，求取该特征在通用背景模型单个高斯分量下的模糊支持向量机上的可靠性得分

。其次，计算该特征在通用背景模型所有高斯分量下的模糊支持向量机上的可靠性得分的加权和，表示为：For testing speech features

, need to calculate its score as a reliable feature, if the reliability score is too low, it should be removed. The calculation of the score is divided into two steps: first, the reliability score of the feature on the fuzzy support vector machine under the single Gaussian component of the general background model is obtained

. Second, calculate the weighted sum of reliability scores of the feature on the fuzzy support vector machine under all Gaussian components of the general background model, expressed as:

其中，

表示该特征在该高斯分量上的权重，

的含义如上文所示。该得分可以用来判断其是否为可靠特征，如果得分大于阀值，则认为其为可靠特征，否则，将其剔除。in,

Indicates that the feature is in the Weights on Gaussian components,

meaning as above. The score can be used to judge whether it is a reliable feature, if the score is greater than the threshold, it is considered a reliable feature, otherwise, it will be eliminated.

可靠特征得分计算Reliable Feature Score Calculation

经过上述步骤（3）的可靠特征检测之后，需要计算整个语句的得分。After reliable feature detection in step (3) above, the score of the entire sentence needs to be calculated.

首先需要训练每个说话人的高斯混合模型，自适应说话人模型采用最大后验概率（

）的方法。First, it is necessary to train a Gaussian mixture model for each speaker, and the adaptive speaker model uses the maximum posterior probability (

)Methods.

其次，对于第

个说话人模型，测试语音特征

的似然得分可以通过计算在第

个说话人的似然得分得到，即下式：

Second, for the

a speaker model to test speech features

The likelihood score of can be calculated by the

The likelihood score of a speaker is obtained as follows:

对于整句测试语句，其得分计算方法为：For the entire test sentence, the score calculation method is:

为实验中设定的可靠性的阈值，如果可靠性得分大于阈值，则该特征得分保留，否则，会被剔除。 It is the reliability threshold set in the experiment. If the reliability score is greater than the threshold, the feature score will be kept, otherwise, it will be eliminated.

最后，选择该语句的目标说话人时选择得分最大的说话人的

。Finally, when selecting the target speaker of the sentence, choose the speaker with the highest score

.

实验结果Experimental results

实验中采用的数据库为中文情感语音数据库（MASC）。该数据库是在安静的环境下采用奥林巴斯DM-20录音笔录制的。该数据库包含68个母语为汉语的68个说话人，其中男性45人，女性23人。每个说话人共有5种情感的发音：中性、生气、高兴、愤怒和悲伤。每个说话人会在中性条件下朗读2段中性的段落，同时，会在每种情感下说出5个单词和20句语句各3遍。The database used in the experiment is Chinese Emotional Speech Database (MASC). The database was recorded with an Olympus DM-20 recorder in a quiet environment. The database contains 68 speakers whose native language is Chinese, including 45 males and 23 females. There are 5 emotion pronunciations for each speaker: neutral, angry, happy, angry and sad. Each speaker will read 2 neutral passages under neutral conditions, and at the same time, will speak 5 words and 20 sentences for each emotion 3 times.

本实验是在IBM服务器上进行的。其配置为：CPU E5420,主频2.5GHz。内存为4G。This experiment is carried out on the IBM server. Its configuration is: CPU E5420, main frequency 2.5GHz. The memory is 4G.

实验中，前18个说话人的语音作为开发库，18人的中性段落语音用于训练

模型，该18个人5种情感下的的语句发音用于训练模糊支持向量机模型。后50个说话人组成评测集，每个说话人的

模型是采用其中性段落自适应出来。五种情感语音下的所有语句用来进行测试，测试语音共计15，000句（）。实验中，模拟的是说话人鉴别的过程，实验结果和基准的

实验结果比较见表1。In the experiment, the voices of the first 18 speakers were used as the development library, and the neutral paragraph voices of the 18 speakers were used for training

Model, the sentence pronunciation of the 18 people under 5 emotions is used to train the fuzzy support vector machine model. The last 50 speakers constitute the evaluation set, and each speaker’s

The model is self-adapted using its neutral paragraphs. All the sentences under the five emotional voices are used for testing, and the total number of test voices is 15,000 sentences ( ). In the experiment, the process of speaker identification is simulated, the experimental results and benchmark

The experimental results are compared in Table 1.

表本方法效果和基准实验效果比较surface The effect of this method is compared with that of the benchmark experiment

情感分类Sentiment Classification 基准方法benchmark method 本方法This method 中性neutral 96.23%96.23% 95.50%95.50% 愤怒anger 31.50%31.50% 37.60%37.60% 高兴Happy 33.57%33.57% 39.47%39.47% 惊慌panic 35.00%35.00% 39.77%39.77% 悲伤sad 61.43%61.43% 63.63%63.63% 平均average 51.55%51.55% 55.19%55.19%

从上述实验结果可以看出，本方法可以有效地检测出语句中的可靠特征，在各情感状态下，识别的准确率得到了较大的提高。同时，总体的识别准确率也提高了3.64%。说明本方法对提高说话人识别系统的性能和鲁棒性有很大的帮助。It can be seen from the above experimental results that this method can effectively detect reliable features in sentences, and the recognition accuracy has been greatly improved in each emotional state. At the same time, the overall recognition accuracy has also increased by 3.64%. It shows that this method is of great help to improve the performance and robustness of the speaker recognition system.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员，在不脱离本发明构思的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围内。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, some improvements and modifications can also be made, and these improvements and modifications should also be considered Within the protection scope of the present invention.

Claims

1. The emotion speaker identification method based on reliability detection of the fuzzy support vector machine is characterized by comprising the following steps

1) Extracting the voice component characteristics, and combining the voice component characteristics with corresponding weights in the UBM model to form general background model component characteristics;

2) using the general background model component characteristics obtained in the step 1) as fuzzy membership, and establishing a fuzzy support vector machine model under the general background model component;

3) carrying out reliability detection by using the fuzzy support vector machine model in the step 2) to obtain reliable characteristics;

4) computing the reliable characteristics of the step 3) to identify the speaker.

2. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine as claimed in claim 1, wherein said extracting speech component features comprises the steps of:

1) collecting voice signals and carrying out signal preprocessing on the voice signals;

2) extracting the characteristics of the preprocessed voice signals;

the feature extraction selects a feature extraction method based on a Mel cepstrum coefficient and/or a feature extraction method based on a linear prediction cepstrum coefficient;

the pretreatment comprises the following steps in sequence:

sample quantization, descum, pre-emphasis, and windowing.

3. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine as claimed in claim 1, wherein said forming of general background model component features comprises the steps of:

1) randomly dividing the collected voice signals into a development library and an evaluation library;

2) selecting all the voices in the development library, extracting features and passing the voices throughThe method trains a general background model;

3) calculating posterior probability on each Gaussian model component of the general background for each frame of voice as weight;

4) combining the step 2) and the step 3) to form the general background model component characteristics.

4. The method as claimed in claim 2, wherein the fuzzy SVM model is two classes of classifiers for neutral-emotion feature on each Gaussian component, and the positive samples of the two classes of classifiers are selected as neutral speech in the speech and the negative samples are selected as emotion speech in the speech.

5. The method for emotion speaker recognition based on reliability detection of fuzzy support vector machine according to claims 1-4, wherein said fuzzy support vector machine for reliability detection comprises the steps of:

1) by the formula

Computing test speech

A weighted reliability score at each gaussian component;

the above-mentioned、For the parameters of the classification plane under each Gaussian component

2) By the formula

Calculating a weighted sum of the emotion probabilities of all Gaussian components;

the above-mentioned

As a weight feature

3) Judging whether the characteristic is a reliable characteristic or not according to the result obtained in the step 2), if the result is smaller than a set threshold value, taking the characteristic as the reliable characteristic, otherwise, rejecting the characteristic.

6. The method for recognizing the emotional speaker based on the reliability detection of the fuzzy support vector machine as claimed in claims 1 to 4, wherein the computing and recognizing the speaker by the component feature of the general background model comprises the following steps;

1) training a Gaussian mixture model of each speaker, wherein the self-adaptive speaker model adopts a maximum posterior probability method;

2) by the formula