CN111785285A

CN111785285A - Voiceprint recognition method for home multi-feature parameter fusion

Info

Publication number: CN111785285A
Application number: CN202010439120.7A
Authority: CN
Inventors: 张晖; 张金鑫; 赵海涛; 孙雁飞; 倪艺洋; 朱洪波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-10-16

Abstract

The invention discloses a voiceprint recognition method for household multi-feature parameter fusion, comprising the following steps: respectively calculating MFCC feature parameters, GFCC feature parameters and LPCC feature parameters extracted from speech signals; Three Gaussian mixture models are trained with LPCC feature parameters; the results of the three Gaussian mixture models are weighted and fused, soft decision is made, the threshold is set, and the stochastic gradient descent method is used to obtain the optimal weight coefficient and output the final recognition result. The present invention fuses the MFCC feature parameters, the GFCC feature parameters and the LPCC feature parameters, which makes up for the defect that a single feature parameter cannot better express the speaker's features, thereby greatly improving the accuracy of voiceprint recognition.

Description

Voiceprint recognition method for home multi-feature parameter fusion

技术领域technical field

本发明属于声纹识别领域，具体涉及一种面向家居多特征参数融合的声纹识别方法。The invention belongs to the field of voiceprint recognition, in particular to a voiceprint recognition method for household multi-feature parameter fusion.

背景技术Background technique

声纹识别也称为说话人识别，包括说话人辨认和说话人确认。声纹识别应用领域十分广泛，包括金融领域、军事安全、医疗领域以及家居安全领域等等。在许多声纹识别系统的识别之前，除了预处理操作外，特征参数和模型匹配对识别的准确率至关重要。Voiceprint recognition, also known as speaker recognition, includes speaker identification and speaker confirmation. Voiceprint recognition has a wide range of applications, including financial, military security, medical, and home security. Before the recognition of many voiceprint recognition systems, in addition to preprocessing operations, feature parameters and model matching are critical to the recognition accuracy.

传统单一的特征参数无法较好的表达说话人的语音特征，可能会产生过拟合，并且MFCC特征参数容易并模仿。除了单一的特征之外，许多学者将GFCC和MFCC直接相连接，形成新的特征参数向量，这样会带来维度灾难，同时增加系统的计算量。因此，目前的家居声纹识别算法无法满足较好的表达说话人的特征需求，其识别的准确率有待提高。The traditional single feature parameter cannot express the speaker's speech features well, which may cause over-fitting, and the MFCC feature parameters are easy to imitate. In addition to a single feature, many scholars directly connect GFCC and MFCC to form a new feature parameter vector, which will bring dimensional disaster and increase the computational complexity of the system. Therefore, the current home voiceprint recognition algorithm cannot meet the needs of expressing the characteristics of the speaker better, and its recognition accuracy needs to be improved.

发明内容SUMMARY OF THE INVENTION

发明目的：为了克服现有技术中存在的不足，提供一种面向家居多特征参数融合的声纹识别方法，有效的解决了单一特征参数无法完全表达说话人的语音特征的问题，提高了声纹识别的准确率。Purpose of the invention: In order to overcome the deficiencies in the prior art, a voiceprint recognition method for household multi-feature parameter fusion is provided, which effectively solves the problem that a single feature parameter cannot fully express the speaker's voice features, and improves the voiceprint performance. recognition accuracy.

技术方案：为实现上述目的，本发明提供一种面向家居多特征参数融合的声纹识别方法，包括如下步骤：Technical solution: In order to achieve the above purpose, the present invention provides a voiceprint recognition method for household multi-feature parameter fusion, comprising the following steps:

S1：分别计算提取到语音信号的MFCC特征参数、GFCC特征参数和LPCC特征参数；S1: Calculate the MFCC feature parameters, GFCC feature parameters and LPCC feature parameters extracted from the speech signal respectively;

S2：分别利用MFCC特征参数、GFCC特征参数和LPCC特征参数训练三个混合高斯模型；S2: Use MFCC feature parameters, GFCC feature parameters and LPCC feature parameters to train three mixture Gaussian models respectively;

S3：将三个混合高斯模型的结果加权融合，进行软判决，设定阈值，用随机梯度下降法，得到最优的权重系数，输出最终的识别结果。S3: The results of the three mixed Gaussian models are weighted and fused, soft decision is made, the threshold is set, and the stochastic gradient descent method is used to obtain the optimal weight coefficient and output the final recognition result.

进一步的，所述步骤S1中语音信号在进行特征参数提取之前经过预处理操作。Further, in the step S1, the speech signal undergoes a preprocessing operation before the feature parameter extraction is performed.

进一步的，所述步骤S1中预处理操作包括采样量化、预加重、分帧加窗、端点检测。Further, the preprocessing operations in the step S1 include sample quantization, pre-emphasis, frame-by-frame windowing, and endpoint detection.

进一步的，所述步骤S1中MFCC特征参数的提取过程为：Further, the extraction process of the MFCC feature parameters in the step S1 is:

A1)对输入的语音信号进行预处理，生成时域信号，对每一帧语音信号通过快速傅里叶变换或离散傅里叶变换处理得到语音线性频谱；A1) carry out preprocessing to the input speech signal, generate time domain signal, and obtain speech linear spectrum by fast Fourier transform or discrete Fourier transform processing for each frame of speech signal;

A2)将线性频谱输入Mel滤波器组进行滤波，生成Mel频谱，取Mel频谱的对数能量，生成相应的对数频谱；A2) filter the linear spectrum input Mel filter bank, generate Mel spectrum, get the logarithmic energy of Mel spectrum, generate corresponding logarithmic spectrum;

A3)通过使用离散余弦变换将对数频谱求解转换为MFCC特征参数。A3) Convert the logarithmic spectrum solution to MFCC characteristic parameters by using discrete cosine transform.

进一步的，所述步骤S1中GFCC特征参数的提取过程为：Further, the extraction process of the GFCC feature parameters in the step S1 is:

B1)将语音信号进行预处理，生成时域信号，通过快速傅里叶变换或离散傅里叶变换处理得到离散功率谱；B1) preprocessing the speech signal to generate a time-domain signal, and obtain a discrete power spectrum through fast Fourier transform or discrete Fourier transform processing;

B2)对离散功率谱求平方生成语音能量谱，使用Gammatone滤波器组进行滤波处理；B2) square the discrete power spectrum to generate the speech energy spectrum, and use the Gammatone filter bank for filtering;

B3)对每个Gammatone滤波器的输出进行指数压缩，获得一组指数能量谱；B3) exponentially compress the output of each Gammatone filter to obtain a set of exponential energy spectrum;

B4)使用离散余弦变换将指数能量谱转化为GFCC特征参数。B4) Using discrete cosine transform to transform exponential energy spectrum into GFCC characteristic parameters.

进一步的，所述步骤S1中LPCC特征参数的提取过程为：Further, the extraction process of the LPCC feature parameters in the step S1 is:

C1)设定声道模型的系统函数；C1) Set the system function of the channel model;

C2)设定系统函数的冲击响应，计算冲击响应的复倒谱；C2) Set the impulse response of the system function, and calculate the complex cepstrum of the impulse response;

C3)根据复倒谱与倒谱系数的关系，计算得到LPCC特征参数。C3) According to the relationship between the complex cepstrum and the cepstral coefficient, the LPCC characteristic parameters are obtained by calculation.

进一步的，所述步骤S3中识别结果的判定方式为：当加权融合的结果大于等于阈值时，识别为目标说话人，否则识别为非目标说话人。Further, the determination method of the recognition result in the step S3 is as follows: when the result of weighted fusion is greater than or equal to the threshold, the target speaker is recognized, otherwise, the non-target speaker is recognized.

有益效果：本发明与现有技术相比，将MFCC特征参数、GFCC特征参数和LPCC特征参数进行融合，弥补了单一特征参数无法较好的表达说话人的特征的缺陷，从而大幅提高声纹识别准确度。Beneficial effects: Compared with the prior art, the present invention fuses MFCC feature parameters, GFCC feature parameters and LPCC feature parameters, which makes up for the defect that a single feature parameter cannot better express the speaker's features, thereby greatly improving voiceprint recognition. Accuracy.

附图说明Description of drawings

图1为本发明方法的总体结构框图；Fig. 1 is the overall structure block diagram of the method of the present invention;

图2为MFCC特征参数提取流程图；Fig. 2 is a flowchart of MFCC feature parameter extraction;

图3为GFCC特征参数提取流程图。Figure 3 is a flowchart of GFCC feature parameter extraction.

具体实施方式Detailed ways

下面结合附图和具体实施例，进一步阐明本发明，应理解这些实施例仅用于说明本发明而不用于限制本发明的范围，在阅读了本发明之后，本领域技术人员对本发明的各种等价形式的修改均落于本申请所附权利要求所限定的范围。Below in conjunction with the accompanying drawings and specific embodiments, the present invention will be further clarified. It should be understood that these embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. Modifications of equivalent forms all fall within the scope defined by the appended claims of this application.

如图1所示，本发明提供一种面向家居CNN分类与特征匹配联合的声纹识别方法，包括如下步骤：As shown in Figure 1, the present invention provides a voiceprint recognition method for household CNN classification and feature matching, comprising the following steps:

1)对输入的说话人的语音进行预处理，预处理包括采样量化、预加重、加窗和分帧、端点检测等。预处理目的是消除发声器官和语音采集设备的干扰，提高系统的识别率。1) Preprocess the input speaker's speech, including sample quantization, pre-emphasis, windowing and framing, endpoint detection, etc. The purpose of preprocessing is to eliminate the interference of vocal organs and voice acquisition equipment and improve the recognition rate of the system.

2)分别计算提取到语音信号的MFCC特征参数、GFCC特征参数和LPCC特征参数；2) Calculate the MFCC feature parameters, GFCC feature parameters and LPCC feature parameters extracted from the speech signal respectively;

3)分别利用MFCC特征参数、GFCC特征参数和LPCC特征参数训练三个混合高斯模型，分别为GMM模型A、GMM模型B和GMM模型C；3) Using MFCC feature parameters, GFCC feature parameters and LPCC feature parameters to train three Gaussian mixture models, which are GMM model A, GMM model B and GMM model C respectively;

4)将GMM模型A、GMM模型B和GMM模型C的结果加权融合，进行软判决，设定阈值，用随机梯度下降法，得到最优的权重系数，输出最终的识别结果。4) The results of GMM model A, GMM model B and GMM model C are weighted and fused, soft decision is made, threshold is set, and the stochastic gradient descent method is used to obtain the optimal weight coefficient and output the final recognition result.

如图2所示，本实施例中MFCC特征参数的提取过程为：As shown in Figure 2, the extraction process of the MFCC feature parameters in this embodiment is:

A1)对输入的语音信号s(n)进行预处理，生成时域信号x(n)(信号序列的长度N＝256)，接着，对每一帧语音信号通过快速傅里叶变换或离散傅里叶变换处理得到语音线性频谱X(k)，可表示为：A1) Preprocess the input speech signal s(n) to generate a time-domain signal x(n) (the length of the signal sequence N=256), and then, perform fast Fourier transform or discrete Fourier transform on each frame of speech signal Lie transform processing obtains the speech linear spectrum X(k), which can be expressed as:

A2)将线性频谱X(k)输入Mel滤波器组进行滤波，生成Mel频谱，接着取Mel频谱的对数能量，生成相应的对数频谱S(m)。A2) Input the linear spectrum X(k) into the Mel filter bank for filtering to generate the Mel spectrum, and then take the logarithmic energy of the Mel spectrum to generate the corresponding logarithmic spectrum S(m).

这里，Mel滤波器组是一组三角带同滤波器H_m(k)，且需满足0≤m≤M，其中M表示滤波器的数量，通常为20～28。带通滤波器的传递函数可以表示为：Here, the Mel filter bank is a set of triangular-strip filters H _m (k), and needs to satisfy 0≤m≤M, where M represents the number of filters, usually 20-28. The transfer function of the bandpass filter can be expressed as:

式(2)中，f(m)为中心频率。In formula (2), f(m) is the center frequency.

其中，之所以对Mel能量频谱取对数，是为了促进声纹识别系统性能的提升。语音线性频谱X(k)到对数频谱S(m)的传递函数为：Among them, the reason for taking the logarithm of the Mel energy spectrum is to promote the improvement of the performance of the voiceprint recognition system. The transfer function from the speech linear spectrum X(k) to the logarithmic spectrum S(m) is:

A3)通过使用离散余弦变换(DCT)将对数频谱S(m)求解转换为MFCC特征参数，MFCC特征参数的第n维特征分量C(n)的表达式为：A3) The logarithmic spectrum S(m) is converted into MFCC feature parameters by using discrete cosine transform (DCT), and the expression of the nth dimension feature component C(n) of the MFCC feature parameters is:

通过上述步骤获得的MFCC特征参数仅反映语音信号的静态特性，可通过求其的一阶、二阶差分得到动态特性参数。The MFCC characteristic parameters obtained through the above steps only reflect the static characteristics of the speech signal, and the dynamic characteristic parameters can be obtained by calculating the first-order and second-order differences thereof.

本实施例中GFCC(Gammatone频率倒谱系数)特征参数的提取过程中应用到Gammatone滤波器，其设计方案为：In the present embodiment, the extraction process of the characteristic parameter of GFCC (Gammatone frequency cepstral coefficient) is applied to the Gammatone filter, and its design scheme is:

Gammatone滤波器组用于模拟耳蜗基底膜的听觉特性，其时域表达式如下：The Gammatone filter bank is used to simulate the auditory properties of the cochlear basilar membrane, and its time domain expression is as follows:

g(f，t)＝t^n-1e^-2πbtcos(2πf_i+φ_i)U(t)，1≤i≤N (5)式中，N——滤波器个数；g(f, t)=t ^n-1 e ^-2πbt cos(2πf _i +φ _i )U(t), 1≤i≤N (5) In the formula, N——the number of filters;

n----滤波器级数，一般取4；n——Number of filter stages, generally 4;

i——滤波器序数；i——filter ordinal;

f_i——滤波器的中心频率；f _i ——the center frequency of the filter;

U(t)——单位阶跃函数；U(t)——unit step function;

b_i——滤波器的衰减因子；b _i ——the attenuation factor of the filter;

φ_i——序列为i的滤波器的相位，一般取0。φ _i ——The phase of the filter whose sequence is i, generally 0.

每个滤波器的带宽与人耳的听觉临界频带有关，根据心理学的理论，临界频带可用等效矩形带宽来表达：The bandwidth of each filter is related to the auditory critical band of the human ear. According to the theory of psychology, the critical band can be expressed by the equivalent rectangular bandwidth:

滤波器的衰减因子b_i与带宽有关，脉冲响应的衰减率由衰减因子b_i决定。其表达式为：The attenuation factor b _i of the filter is related to the bandwidth, and the attenuation rate of the impulse response is determined by the attenuation factor b _i . Its expression is:

b_i＝1.019EBR(f) (7)b _i = 1.019EBR(f) (7)

Gammatone滤波器的时域冲激函数是模拟函数，为了方便计算处理，需要对其离散化，对式(4)进行拉普拉斯变换有：The time-domain impulse function of the Gammatone filter is an analog function. In order to facilitate the calculation and processing, it needs to be discretized. The Laplace transform of formula (4) is as follows:

输入的语音信号s(n)与g_i(n)经过卷积运算可得Gammatone滤波器的输出。The input speech signal s(n) and g _i (n) can get the output of Gammatone filter through convolution operation.

GFCC特征参数的提取过程类似于MFCC特征参数的提取过程，只需要用Gammatone滤波器组代替传统的Mel滤波器组，这样就有效地利用了Gammatone滤波器的耳蜗基底膜特性，能很好地对语音信号进行非线性处理。The extraction process of GFCC feature parameters is similar to the extraction process of MFCC feature parameters. It only needs to use the Gammatone filter bank to replace the traditional Mel filter bank, which effectively utilizes the cochlear basilar membrane characteristics of the Gammatone filter The speech signal is processed non-linearly.

基于上述Gammatone滤波器，如图3所示，GFCC(Gammatone频率倒谱系数)特征参数的提取过程为：Based on the above Gammatone filter, as shown in Figure 3, the extraction process of GFCC (Gammatone frequency cepstral coefficient) characteristic parameters is as follows:

B1)首先将输入的语音信号s(n)进行预处理，生成时域信号x(n)，通过快速傅里叶变换或离散傅里叶变换处理得到离散功率谱X(k)，其表达式为：B1) First, the input speech signal s(n) is preprocessed to generate a time domain signal x(n), and the discrete power spectrum X(k) is obtained through fast Fourier transform or discrete Fourier transform processing, and its expression for:

B2)对离散功率谱X(k)求平方生成语音能量谱，然后使用Gammatone滤波器组进行滤波处理。B2) Square the discrete power spectrum X(k) to generate the speech energy spectrum, and then use the Gammatone filter bank for filtering.

B3)为了更好地改善声纹识别系统性能，对每个滤波器的输出进行指数压缩，获得一组指数能量谱s₁,s₂,…,s_M：B3) In order to better improve the performance of the voiceprint recognition system, exponentially compress the output of each filter to obtain a set of exponential energy spectra s ₁ , s ₂ ,...,s _M :

式中，e(f)——指数压缩值，M——滤波器通道数。In the formula, e(f)——exponential compression value, M——number of filter channels.

B4)最后，使用离散余弦变换(DCT)将指数能量谱转化为GFCC特征参数，其表达式为：B4) Finally, use discrete cosine transform (DCT) to transform the exponential energy spectrum into the GFCC characteristic parameters, whose expression is:

式中，L——特征参数的维度。In the formula, L is the dimension of the feature parameter.

本实施例中LPCC(线性预测倒谱系数)特征参数的提取过程为：In the present embodiment, the extraction process of LPCC (Linear Prediction Cepstral Coefficient) characteristic parameter is:

假设声道模型的系统函数如下：Suppose the system function of the vocal tract model is as follows:

式(12)中p是预测器的阶数。In equation (12), p is the order of the predictor.

设h(n)是H(z)的冲击响应，

是h(n)的复倒谱，则Let h(n) be the impulse response of H(z),

is the complex cepstrum of h(n), then

综合式(12)和式(13)两式，并对z^-1求导，简化后可以得到：Synthesizing formula (12) and formula (13), and taking the derivative of z ^-1 , after simplification, we can get:

把公式(14)等号两边z^-1各次幂的系数相加，可得到复倒谱，如下：The complex cepstrum can be obtained by adding the coefficients of the powers of z ^-1 on both sides of the equal sign of formula (14), as follows:

根据复倒谱与倒谱系数的关系：According to the relationship between complex cepstrum and cepstral coefficient:

可以计算得到线性预测倒谱系数：The linear prediction cepstral coefficients can be calculated:

其中c(n)为线性预测倒谱系数LPCC,a_n为线性预测系数。where c( _n ) is the linear prediction cepstral coefficient LPCC, and an is the linear prediction coefficient.

本实施例中的步骤4中GMM模型A、GMM模型B和GMM模型C的混合度均取1024。三个模型的输出结果分别为a、b、c,对三个结果进行加权融合，权重系数为ω_i且

最终结果D＝ω_1a+ω_2b+ω_3c，设定阈值γ，当D大于等于阈值γ时，识别为目标说话人，否则识别为非目标说话人。In step 4 in this embodiment, the mixing degree of GMM model A, GMM model B, and GMM model C are all set to 1024. The output results of the three models are a, b, and c, respectively, and the three results are weighted and fused, and the weight coefficient is ω _i and

The final result is D=ω _1a +ω _2b +ω _3c , and a threshold γ is set. When D is greater than or equal to the threshold γ, it is recognized as a target speaker, otherwise it is recognized as a non-target speaker.

Claims

1. The voiceprint recognition method for home multi-feature parameter fusion is characterized by comprising the following steps: the method comprises the following steps:

s1: respectively calculating and extracting MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters of the voice signals;

s2: training three Gaussian mixture models by respectively using MFCC characteristic parameters, GFCC characteristic parameters and LPCC characteristic parameters;

s3: and weighting and fusing the results of the three Gaussian mixture models, performing soft decision, setting a threshold value, obtaining an optimal weight coefficient by using a random gradient descent method, and outputting a final recognition result.

2. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the speech signal undergoes a preprocessing operation in the step S1 before feature parameter extraction.

3. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 2, wherein: the preprocessing operation in step S1 includes sampling quantization, pre-emphasis, frame windowing, and endpoint detection.

4. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the MFCC characteristic parameters in step S1 is as follows:

A1) preprocessing an input voice signal to generate a time domain signal, and processing each frame of voice signal through fast Fourier transform or discrete Fourier transform to obtain a voice linear frequency spectrum;

A2) inputting the linear frequency spectrum into a Mel filter bank for filtering to generate a Mel frequency spectrum, and taking the logarithmic energy of the Mel frequency spectrum to generate a corresponding logarithmic frequency spectrum;

A3) the log spectrum solution is converted to MFCC feature parameters by using a discrete cosine transform.

5. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the GFCC characteristic parameters in step S1 is as follows:

B1) preprocessing a voice signal to generate a time domain signal, and obtaining a discrete power spectrum through fast Fourier transform or discrete Fourier transform processing;

B2) squaring the discrete power spectrum to generate a voice energy spectrum, and performing filtering processing by using a Gamma atom filter bank;

B3) performing exponential compression on the output of each Gamma filter to obtain a group of exponential energy spectrums;

B4) the exponential energy spectrum is converted into GFCC characteristic parameters using a discrete cosine transform.

6. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the extraction process of the LPCC characteristic parameters in step S1 is as follows:

C1) setting a system function of the vocal tract model;

C2) setting the impulse response of a system function, and calculating the complex cepstrum of the impulse response;

C3) and calculating to obtain the LPCC characteristic parameters according to the relation between the complex cepstrum and the cepstrum coefficient.

7. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the determination method of the recognition result in step S3 is: and when the result of the weighted fusion is greater than or equal to the threshold value, the target speaker is identified, otherwise, the non-target speaker is identified.

8. The home furnishing multi-feature parameter fusion-oriented voiceprint recognition method according to claim 1, wherein: the Gammatone filter bank in the step B2 is used for simulating the auditory characteristics of the cochlear basilar membrane, and the time domain expression thereof is as follows:

g(f，t)＝t^n-1e^-2πbtcos(2πf_i+φ_i)U(t)，1≤i≤N

wherein N is the number of filters, N is the number of filter stages, i is the number of filter stages, f_iFor the center frequency of the filter, U (t) is the unit step function, b_iIs the attenuation factor of the filter, phi_iIs the phase of the filter of sequence i.