CN105957520A

CN105957520A - Voice state detection method suitable for echo cancellation system

Info

Publication number: CN105957520A
Application number: CN201610519040.6A
Authority: CN
Inventors: 王珂; 明萌; 纪红; 李曦; 张鹤立
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2016-09-21
Anticipated expiration: 2036-07-04
Also published as: CN105957520B

Abstract

The invention relates to a voice state detection method suitable for an echo cancellation system, and relates to the technical field of voice interaction based on an IP network. The present invention utilizes noise training samples and voice training samples to construct a Support Vector Machine (SVM) classifier, the signal to be detected is the far-end and near-end signals after the block, and the SVM classifier based on the Gaussian mixture model constructed is used to classify the block The far-end signal is judged by VAD. If the judgment result is no voice, stop filter update and filtering, and directly output the near-end voice signal. If it is judged that there is voice at the far end, double-end call judgment is performed; when there is a double-end call, stop filtering The filter coefficients are updated to filter the near-end signal; otherwise, the filter coefficients are updated and filtered according to the far-end signal. The invention improves the accuracy of voice activity detection, avoids misjudgment of the double-end mute state as the double-end talk state, and prevents wrong update and filtering of the filter under the condition of no reference signal.

Description

A Speech State Detection Method Applicable to Echo Cancellation System

技术领域technical field

本发明涉及基于IP网络的语音交互技术领域，具体是指一种适用于回声消除系统的语音状态检测方法。The invention relates to the technical field of voice interaction based on an IP network, in particular to a voice state detection method suitable for an echo cancellation system.

背景技术Background technique

回声消除技术广泛应用于电话会议系统、车载蓝牙系统、IP电话等基于IP网络的语音交互系统中，用以消除扬声器播放的声音经过多种路径传播后被麦克风拾取，并传回到系统远端形成的声学回声。回声消除的核心思想是通过一个自适应滤波器模拟回声路径，并将估计回声信号从麦克风拾取的信号中减去。Echo cancellation technology is widely used in voice interactive systems based on IP networks such as teleconferencing systems, car bluetooth systems, IP phones, etc., to eliminate the sound played by the speaker after being picked up by the microphone after being propagated through various paths, and transmitted back to the remote end of the system Acoustic echoes formed. The core idea of echo cancellation is to simulate the echo path through an adaptive filter and subtract the estimated echo signal from the signal picked up by the microphone.

语音状态检测在回声消除中起着至关重要的作用。在声音信号进入滤波器之前需要首先对当前语音状态进行判断，根据系统所处的语音状态决定滤波器的工作状态。是否能准确迅速地判断系统语音状态，对回声消除的效果有很大的影响。Voice state detection plays a vital role in echo cancellation. Before the sound signal enters the filter, it is necessary to judge the current voice state first, and determine the working state of the filter according to the voice state of the system. Whether the voice state of the system can be judged accurately and quickly has a great influence on the effect of echo cancellation.

现有的回声消除系统通常直接使用DTD(Double Talk Detection，双端通话检测)算法判断系统是否处于双端通话状态，并在双端通话状态下停止滤波器系数更新，防止这种情况下滤波器由于受到近端语音的干扰而发散。常用的DTD算法——Geigel算法通过比较近端信号和远端信号的幅度值判断是否存在近端语音，在近端信号与远端信号幅度的比值ξ^(g)大于特定值T时认为系统处于双端通话状态。即当：Existing echo cancellation systems usually directly use the DTD (Double Talk Detection) algorithm to determine whether the system is in a double-talk state, and stop updating the filter coefficients in the double-talk state to prevent the filter from Divergent due to interference from near-end speech. The commonly used DTD algorithm— ^Geigel algorithm judges whether there is near-end speech by comparing the amplitude values of the near-end signal and the far-end signal. Double talk status. That is when:

${ξ ξ}^{((g g))} = = \frac{| | y the y ((k k)) | |}{m m a a x x {{| | x x ((k k - - 11)) | |,, ... ...,, | | x x ((k k - - N N)) | |}}} > > T T$

时，认为存在近端语音，系统处于双端通话状态。其中|y(k)|是近端语音幅度值，max{|x(k-1)|,...,|x(k-N)|}是远端语音信号前N个采样点的最大幅度值。门限T根据回声路径衰减来确定，通常可以取0.5；N通常与滤波器长度相等。, the near-end voice is considered to exist, and the system is in the double-end conversation state. Where |y(k)| is the near-end voice amplitude value, max{|x(k-1)|,...,|x(k-N)|} is the maximum amplitude value of the first N sampling points of the far-end voice signal . The threshold T is determined according to the attenuation of the echo path, and can usually be 0.5; N is usually equal to the filter length.

但该方法存在如下缺点：But this method has the following disadvantages:

1、Geigel算法假设了近端语音远大于远端的回声信号，并不完全符合回声消除的实际情况，因此在某些情况下不是很准确。1. The Geigel algorithm assumes that the near-end voice is much larger than the far-end echo signal, which does not fully conform to the actual situation of echo cancellation, so it is not very accurate in some cases.

2、不进行远端VAD(Voice Activity Detection，语音活动性检测)就直接进行DTD可能会导致双端静音状态被误判为双端通话状态。2. Directly performing DTD without performing remote VAD (Voice Activity Detection, voice activity detection) may cause the double-end mute state to be misjudged as a double-end call state.

3、仅在双端通话状态下停止滤波器系数更新，在远端语音不存在的状态下持续进行滤波和系数更新可能导致滤波器发散，并从近端信号中错误地减去并不存在的远端语音。3. Stop filter coefficient update only in the double-talk state, continuous filtering and coefficient update in the state where the far-end voice does not exist may cause the filter to diverge and incorrectly subtract non-existent from the near-end signal Far end voice.

发明内容Contents of the invention

为了克服上述的三个问题，本发明提出一种结合VAD和DTD的语音状态检测方法，并根据检测结果设计新的滤波和更新策略以提高检测准确率，避免语音状态的误判，防止滤波器的错误更新和滤波。In order to overcome the above three problems, the present invention proposes a voice state detection method combining VAD and DTD, and designs a new filtering and update strategy according to the detection results to improve detection accuracy, avoid misjudgment of voice state, and prevent filter error updates and filtering.

本发明提供的一种适用于回声消除系统的语音状态检测方法，实现步骤如下：A kind of voice state detection method suitable for echo cancellation system provided by the present invention, the implementation steps are as follows:

第一步：利用噪声训练样本和语音训练样本构造支持向量机SVM分类器。The first step: use noise training samples and voice training samples to construct a support vector machine SVM classifier.

分别对噪声训练样本和语音训练样本进行特征值提取和高斯混合模型GMM训练，构造对应的高斯超向量。利用高斯超向量构造SVM分类器核函数，以及语音信号和噪声信号对应的SVM模型，使用构造好的核函数和SVM模型构造得到SVM分类器。The feature value extraction and Gaussian mixture model GMM training are performed on the noise training samples and speech training samples respectively, and the corresponding Gaussian supervectors are constructed. The Gaussian super vector is used to construct the SVM classifier kernel function, and the SVM model corresponding to the speech signal and the noise signal, and the SVM classifier is obtained by using the constructed kernel function and the SVM model.

第二步：待检测信号是分块后的远端和近端信号。使用构造好的基于高斯混合模型的SVM分类器对本分块远端信号进行VAD判决。Step 2: The signal to be detected is the far-end and near-end signals after block. Use the constructed SVM classifier based on the Gaussian mixture model to make VAD judgment on the remote signal of this block.

对本分块远端信号进行特征值提取和GMM训练，构造高斯超向量。将本分块远端信号对应的高斯超向量输入到构造好的SVM分类器中进行判决。如果分类为噪声，判断结果为无语音，则停止滤波器更新和滤波，直接输出近端语音信号。否则说明远端有语音，进行下一步的双端通话判决。Perform eigenvalue extraction and GMM training on the remote signal of this block to construct a Gaussian supervector. Input the Gaussian supervector corresponding to the far-end signal of this block into the constructed SVM classifier for judgment. If it is classified as noise and the judgment result is no speech, then stop filter updating and filtering, and directly output the near-end speech signal. Otherwise, it means that there is voice at the far end, and the next step of double-end call judgment is carried out.

第三步：判断系统是否属于双端通话状态。Step 3: Determine whether the system is in the double-end call state.

计算远端信号和误差信号的归一化互相关ξ_XECC，比较归一化互相关ξ_XECC和设置的门限T_XECC，当ξ_XECC＜T_XECC时，近端有语音，系统处于双端通话状态，停止滤波器系数更新，对近端信号进行滤波。当ξ_XECC≥T_XECC时，近端无语音，根据远端信号进行滤波器系数更新和滤波。Calculate the normalized cross-correlation ξ _XECC of the far-end signal and the error signal, compare the normalized cross-correlation ξ _XECC with the set threshold T _XECC , when ξ _XECC < T _XECC , there is voice at the near end, and the system is in a double-ended conversation state , stop updating the filter coefficients, and filter the near-end signal. When ξ _{XECC ≥} T _XECC , there is no speech at the near end, and the filter coefficients are updated and filtered according to the far end signal.

本发明的优点与积极效果在于：Advantage and positive effect of the present invention are:

(1)使用基于高斯混合模型的支持向量机算法对远端信号进行语音活动性检测，提高了语音活动性检测的准确性，克服了常用的基于能量的语音活动性检测方法存在的在低信噪比条件下检测不准确的问题。(1) Use the support vector machine algorithm based on the Gaussian mixture model to detect the voice activity of the far-end signal, which improves the accuracy of voice activity detection and overcomes the low signal quality of the commonly used energy-based voice activity detection method. The problem of inaccurate detection under the condition of noise ratio.

(2)在双端通话检测之前首先进行远端语音活动性检测，在远端有语音时再进行双端通话检测，能够避免将双端静音状态误判为双端通话状态。采用基于互相关的双端通话检测算法，提高了双端通话检测的准确性。(2) The remote voice activity detection is performed first before the double-end call detection, and then the double-end call detection is performed when there is voice at the far end, which can avoid misjudgment of the double-end mute state as the double-end call state. A double-talk detection algorithm based on cross-correlation is adopted to improve the accuracy of double-talk detection.

(3)根据系统所处的不同语音状态采取不同的滤波和更新策略。与传统回声消除系统仅在双端通话时停止滤波器系数更新相比，在远端无语音的状态下也停止滤波器系数更新和滤波，可以进一步防止在没有参考信号的情况下滤波器的错误更新和滤波。(3) Different filtering and updating strategies are adopted according to different speech states of the system. Compared with the traditional echo cancellation system, which only stops the update of filter coefficients when double-talking, the update and filtering of filter coefficients are also stopped in the state of no voice at the far end, which can further prevent the error of the filter in the absence of a reference signal update and filter.

附图说明Description of drawings

图1是本发明的适用于回声消除系统的语音状态检测方法的整体流程示意图；Fig. 1 is the overall schematic flow chart of the voice state detection method applicable to the echo cancellation system of the present invention;

图2是本发明实施例仿真所用的两段PCM流示意图；Fig. 2 is a schematic diagram of two sections of PCM streams used in the emulation of the embodiment of the present invention;

图3是本发明实施例仅使用基于能量的DTD检测进行回声消除的效果示意图；Fig. 3 is a schematic diagram of the effect of echo cancellation using only energy-based DTD detection in an embodiment of the present invention;

图4是本发明实施例采用本发明方法进行回声消除的效果示意图；Fig. 4 is a schematic diagram of the effect of echo cancellation using the method of the present invention in an embodiment of the present invention;

图5是本发明实施例使用改进前的回声消除库的Sipdroid回声消除效果示意图；Fig. 5 is a schematic diagram of the Sipdroid echo cancellation effect using the echo cancellation library before the embodiment of the present invention;

图6是本发明实施例使用改进后的回声消除库的Sipdroid回声消除效果示意图；Fig. 6 is a schematic diagram of the Sipdroid echo cancellation effect using the improved echo cancellation library according to the embodiment of the present invention;

具体实施方式detailed description

下面将结合附图和实施例对本发明作进一步的详细说明。The present invention will be further described in detail with reference to the accompanying drawings and embodiments.

本发明方法在DTD之前首先对远端信号进行VAD，在VAD检测出远端信号不存在时直接停止滤波器系数更新和滤波，以防止滤波器发散及错误地滤波。在VAD检测出存在远端语音时再进行DTD，并在双端通话时停止滤波器系数更新。其中使用的VAD算法是基于GMM(Gaussian Mixture Model，高斯混合模型)的SVM(Support Vector Machine，支持向量机)算法，该算法利用GMM构造特征超向量，将GMM超向量用于SVM的特征值输入及核函数构造，准确率高于常用的基于能量或相关性的VAD算法。使用的DTD算法是基于远端信号与误差信号互相关的DTD，准确率也高于常用的基于能量的Geigel算法。通过将远端VAD和DTD结合起来，可以提高语音状态检测的准确性。通过在不同语音状态下采取不同的滤波策略，可以防止滤波器的发散及错误的滤波，大大改善回声消除的效果。The method of the invention performs VAD on the far-end signal before the DTD, and directly stops updating and filtering the filter coefficients when the VAD detects that the far-end signal does not exist, so as to prevent the filter from diverging and erroneously filtering. DTD is performed when the VAD detects that there is a far-end voice, and the update of the filter coefficients is stopped when the double-end talk occurs. The VAD algorithm used is the SVM (Support Vector Machine, Support Vector Machine) algorithm based on GMM (Gaussian Mixture Model, Gaussian Mixture Model). And kernel function construction, the accuracy rate is higher than the commonly used VAD algorithm based on energy or correlation. The DTD algorithm used is based on the DTD of the cross-correlation between the remote signal and the error signal, and the accuracy rate is also higher than the commonly used Geigel algorithm based on energy. By combining the far-end VAD and DTD, the accuracy of voice state detection can be improved. By adopting different filtering strategies in different speech states, the divergence of the filter and wrong filtering can be prevented, and the effect of echo cancellation can be greatly improved.

结合图1说明本发明的适用于回声消除系统的语音状态检测方法的各步骤。The steps of the speech state detection method applicable to the echo cancellation system of the present invention are described with reference to FIG. 1 .

步骤一，利用噪声训练样本和语音训练样本构造SVM分类器，包括步骤S101～S103。Step 1, using noise training samples and speech training samples to construct an SVM classifier, including steps S101-S103.

步骤S101：对噪声信号训练样本和语音信号训练样本进行特征值提取。这里采用的特征值是Mel倒谱系数(MFCC)。MFCC具体提取过程：对信号进行预加重、分块及加窗处理，将加窗后的分块经过快速傅里叶变换(FFT)求出每一分块的频谱参数。将每一分块的频谱参数通过一组由K个三角形带通滤波器所组成的Mel刻度滤波器，K个Mel带通滤波器编号从0到K-1，将每个频带的输出取对数，求出每一个输出的对数能量，对每个分块语音信号获得对应的K个对数频谱。K为正整数，一般取值为20～30。最后将得到的K个对数频谱进行余弦变换求出Mel倒谱系数。将对数频谱经过离散余弦变换变换到倒谱频域得到Mel倒谱系数的公式如下：Step S101: Perform feature value extraction on noise signal training samples and speech signal training samples. The eigenvalues used here are Mel cepstral coefficients (MFCC). The specific extraction process of MFCC: pre-emphasis, block and window processing are performed on the signal, and the windowed blocks are subjected to fast Fourier transform (FFT) to obtain the spectral parameters of each block. Pass the spectral parameters of each block through a set of Mel scale filters composed of K triangular bandpass filters, and the K Mel bandpass filters are numbered from 0 to K-1, and the output of each frequency band is paired Number, calculate the logarithmic energy of each output, and obtain the corresponding K logarithmic spectrum for each block speech signal. K is a positive integer, and generally takes a value of 20-30. Finally, the obtained K logarithmic spectrums are cosine transformed to obtain the Mel cepstrum coefficients. The formula for obtaining the Mel cepstrum coefficient by transforming the logarithmic spectrum into the cepstrum frequency domain through discrete cosine transform is as follows:

${m m}_{i i} ((l l)) = = {Σ Σ}_{k k = = 00}^{K K - - 11} {S S}_{i i} ((k k)) c c o o s the s ((\frac{π π l l ((k k + + 11 / / 22))}{K K})),, 00 \leq \leq k k < < K K,, 00 \leq \leq l l < < L L - - - - - - ((11))$

其中，S_i(k)为第i个分块信号通过编号k的带通滤波器后对应得到的对数频谱，K为Mel带通滤波器的个数，m_i(l)为第i个分块语音信号的MFCC的第l阶参数，L为提取的MFCC的总阶数，公式(1)中i表示对应第i个分块，i为正整数。Among them, S _i (k) is the logarithmic spectrum corresponding to the i-th block signal after passing through the band-pass filter numbered k, K is the number of Mel band-pass filters, m _i (l) is the i-th The first order parameter of the MFCC of the block speech signal, L is the total order number of the extracted MFCC, i in the formula (1) represents the corresponding i block, and i is a positive integer.

步骤S102：生成噪声信号训练样本和语音信号训练样本对应的高斯超向量。Step S102: Generate Gaussian supervectors corresponding to the training samples of the noise signal and the training samples of the speech signal.

分别利用噪声信号训练样本和语音信号训练样本的MFCC参数建立噪声信号和语音信号对应的高斯混合模型。GMM本质上是一种多维概率密度函数，N阶高斯混合模型g(x)是由N个单高斯分布的线性组合来描述帧特征在特征空间的分布，对某一分块，g(x)表示如下：The Gaussian mixture models corresponding to the noise signal and the speech signal are established by using the MFCC parameters of the noise signal training samples and the speech signal training samples respectively. GMM is essentially a multidimensional probability density function. The N-order Gaussian mixture model g(x) is a linear combination of N single Gaussian distributions to describe the distribution of frame features in the feature space. For a block, g(x) Expressed as follows:

$g g ((x x)) = = {Σ Σ}_{i i = = 11}^{N N} {w w}_{i i} {p p}_{i i} ((x x)) - - - - - - ((22))$

其中，x是训练样本本分块的MFCC参数构成的L维特征向量，N是高斯混合模型的阶数，p_i(x)为高斯混合模型的第i个高斯分量，w_i为高斯混合模型分量p_i(x)的加权因子。Among them, x is the L-dimensional feature vector formed by the MFCC parameters of the training sample block, N is the order of the Gaussian mixture model, p _i (x) is the i-th Gaussian component of the Gaussian mixture model, and w _i is the Gaussian mixture model Weighting factor for component p _i (x).

p_i(x)表示如下：p _i (x) is expressed as follows:

${p p}_{i i} ((x x)) = = \frac{11}{{((22 π π))}^{\frac{L L}{22}} | | {Σ Σ}_{i i} {| |}^{\frac{11}{22}}} exp exp {{- - \frac{{((x x - - {μ μ}_{i i}))}^{T T} {Σ Σ}_{i i}^{- - 11} ((x x - - {μ μ}_{i i}))}{22}}} - - - - - - ((33))$

其中，Σ_i是第i个高斯分量的协方差矩阵，μ_i是第i个高斯分量的均值向量，因此，GMM模型的参数集λ可表示如下：Among them, Σ _i is the covariance matrix of the i-th Gaussian component, μ _i is the mean vector of the i-th Gaussian component, therefore, the parameter set λ of the GMM model can be expressed as follows:

λ＝(w_i,μ_i,Σ_i),i＝1,2,...,N (4)λ=(w _i , μ _i ,Σ _i ), i=1,2,...,N (4)

相应的高斯混合模型g(x)可以表示为：The corresponding Gaussian mixture model g(x) can be expressed as:

$g g ((x x)) = = {Σ Σ}_{i i = = 11}^{N N} {w w}_{i i} N N ((x x;; {μ μ}_{i i},, {Σ Σ}_{i i})) - - - - - - ((55))$

其中，N(.)表示高斯概率密度函数。Among them, N(.) represents the Gaussian probability density function.

建立GMM模型的过程实际上就是通过训练估计GMM模型的参数的过程。可以采用最大期望EM算法进行模型参数更新。该算法有两个主要步骤：期望E步和最大化M步。E步利用当前的参数集计算完整数据的似然度函数的期望值，M步通过最大化期望函数获取新的参数。E步和M步一直迭代直至收敛。最后分别可以得到语音和噪声的GMM模型，设为g(s)和g(n)，s表示语音信号，n表示噪声信号。The process of establishing the GMM model is actually the process of estimating the parameters of the GMM model through training. The maximum expectation EM algorithm can be used to update the model parameters. The algorithm has two main steps: the expectation E step and the maximization M step. Step E uses the current parameter set to calculate the expected value of the likelihood function of the complete data, and step M obtains new parameters by maximizing the expected function. E-step and M-step are iterated until convergence. Finally, the GMM models of speech and noise can be obtained respectively, set as g(s) and g(n), s represents the speech signal, and n represents the noise signal.

利用建立好的高斯混合模型构造高斯超向量。高斯超向量是高斯混合模型的参数构造而成的，可以将语音和噪声的GMM高斯超向量m_s和m_n分别表示如下：Use the established Gaussian mixture model to construct a Gaussian supervector. The Gaussian supervector is constructed from the parameters of the Gaussian mixture model, and the GMM Gaussian supervectors m _s and m _n of speech and noise can be expressed as follows:

${m m}_{s the s} = = (({((\sqrt{{w w}_{11}} {Σ Σ}_{11}^{- - 11 / / 22} {μ μ}_{11}^{s the s}))}^{T T},, {((\sqrt{{w w}_{22}} {Σ Σ}_{22}^{- - 11 / / 22} {μ μ}_{22}^{s the s}))}^{T T},, ... ...,, {((\sqrt{{w w}_{N N}} {Σ Σ}_{N N}^{- - 11 / / 22} {μ μ}_{N N}^{s the s}))}^{T T})) - - - - - - ((66))$

${m m}_{n no} = = (({((\sqrt{{w w}_{11}} {Σ Σ}_{11}^{- - 11 / / 22} {μ μ}_{11}^{n no}))}^{T T},, {((\sqrt{{w w}_{22}} {Σ Σ}_{22}^{- - 11 / / 22} {μ μ}_{22}^{n no}))}^{T T},, ... ...,, {((\sqrt{{w w}_{N N}} {Σ Σ}_{N N}^{- - 11 / / 22} {μ μ}_{N N}^{n no}))}^{T T})) - - - - - - ((77))$

为g(s)中各高斯分量的均值向量，为g(n)中各高斯分量的均值向量。 is the mean vector of each Gaussian component in g(s), is the mean vector of each Gaussian component in g(n).

步骤S103：利用构造好的高斯超向量构造SVM分类器。分别利用噪声信号和语音信号对应的高斯超向量m_n和m_s建立噪声信号和语音信号对应的SVM模型。利用噪声信号和语音信号对应的高斯超向量m_n和m_s构造K-L核函数。该核函数使用两个GMM概率分布之间的K-L散度构造而成。Step S103: Construct an SVM classifier using the constructed Gaussian supervector. Using the Gaussian supervectors m _n and m _s corresponding to the noise signal and the speech signal respectively, the SVM models corresponding to the noise signal and the speech signal are established. The KL kernel function is constructed by using the Gaussian supervectors m _n and m _s corresponding to the noise signal and the speech signal. This kernel function is constructed using the KL divergence between two GMM probability distributions.

由语音和噪声的GMM超向量m_n和m_s构造的核函数K(n,s)具体表达式如下：The specific expression of the kernel function K(n,s) constructed from the GMM supervectors m _n and m _s of speech and noise is as follows:

$K K ((n no,, s the s)) = = {Σ Σ}_{i i = = 11}^{N N} {((\sqrt{{w w}_{i i}} {Σ Σ}^{- - \frac{11}{22}} {μ μ}_{i i}^{n no}))}^{T T} ((\sqrt{{w w}_{i i}} {Σ Σ}^{- - \frac{11}{22}} {μ μ}_{i i}^{s the s})) - - - - - - ((88))$

确定核函数、语音信号的SVM和噪声信号的SVM后可以得到SVM分类器。After determining the kernel function, the SVM of the speech signal and the SVM of the noise signal, the SVM classifier can be obtained.

步骤二，使用构造好的基于GMM的SVM分类器对本分块远端信号进行VAD判决。输入SVM分类器的待检测信号是分块后的远端和近端信号。需要首先进行傅里叶变换转换到频域，然后根据信号频谱计算信号分块的特征值，即MFCC、归一化互相关等。具体可分为步骤S201～S203。Step 2: Use the constructed GMM-based SVM classifier to make a VAD decision on the remote signal of this block. The signal to be detected input to the SVM classifier is the divided far-end and near-end signals. It is necessary to perform Fourier transform conversion to the frequency domain first, and then calculate the eigenvalues of the signal block according to the signal spectrum, that is, MFCC, normalized cross-correlation, etc. Specifically, it can be divided into steps S201-S203.

步骤S201：本分块远端信号MFCC参数提取。MFCC参数的具体提取过程同步骤101，通过公式(1)最终得到本分块远端信号对应的MFCC参数。Step S201: Extract the MFCC parameters of the remote signal of the block. The specific extraction process of the MFCC parameters is the same as step 101, and the MFCC parameters corresponding to the remote signal of this block are finally obtained through the formula (1).

步骤S202：本分块远端信号对应的高斯超向量生成。利用本分块远端信号MFCC参数建立高斯混合模型，并利用建立好的高斯混合模型构造本分块远端信号对应的高斯超向量。高斯超向量生成方法同步骤S102，如公式(6)和(7)所示。Step S202: Generate a Gaussian supervector corresponding to the remote signal of this block. The MFCC parameters of the remote signal of this block are used to establish a Gaussian mixture model, and the Gaussian supervector corresponding to the remote signal of this block is constructed by using the established Gaussian mixture model. The Gaussian supervector generation method is the same as step S102, as shown in formulas (6) and (7).

步骤S203：将本分块远端信号对应的高斯超向量输入到构造好的SVM分类器中，使用基于GMM的SVM算法进行语音/噪声分类。得出远端语音的VAD判决结果。如果分类为噪声，判断结果为无语音，则停止滤波器更新和滤波，直接输出近端语音信号。如果分类为语音，说明远端有语音，进行下一步的双端通话判决。Step S203: Input the Gaussian supervector corresponding to the far-end signal of this block into the constructed SVM classifier, and use the GMM-based SVM algorithm to perform speech/noise classification. Obtain the VAD judgment result of the far-end voice. If it is classified as noise and the judgment result is no speech, then stop filter updating and filtering, and directly output the near-end speech signal. If it is classified as voice, it means that there is voice at the far end, and the next step of double-end call judgment is performed.

步骤三，判断系统是否属于双端通话状态。Step 3, judging whether the system is in a double-end conversation state.

步骤S301：计算误差信号。Step S301: Calculate the error signal.

自适应滤波器系数模拟了回声路径，因此本分块远端信号与自适应滤波器系数进行卷积可以得到估计回声信号x^T(n)w(n)，误差信号e(n)即为本分块的近端信号d(n)与估计回声信号x^T(n)w(n)之差。The adaptive filter coefficient simulates the echo path, so the convolution of the remote signal of this block with the adaptive filter coefficient can obtain the estimated echo signal x ^T (n)w(n), and the error signal e(n) is the original The difference between the block's near-end signal d(n) and the estimated echo signal ^xT (n)w(n).

自适应滤波器系数是根据自适应算法，利用误差信号和远端信号不断更新的。一种常用的更新算法——LMS算法的更新公式如下：The adaptive filter coefficients are continuously updated using the error signal and the remote signal according to an adaptive algorithm. A commonly used update algorithm - the update formula of the LMS algorithm is as follows:

w(n+1)＝w(n)+2μe(n)x(n) (9)w(n+1)=w(n)+2μe(n)x(n) (9)

其中，μ是步长，w(n)是滤波器权重向量，e(n)是误差信号，x(n)是远端信号。n代表第n个时刻(采样点)。where μ is the step size, w(n) is the filter weight vector, e(n) is the error signal, and x(n) is the far-end signal. n represents the nth moment (sampling point).

步骤S302：计算远端信号和误差信号的归一化互相关。由于时域的互相关运算可以转换为频域的点乘，即两个信号频谱值逐点相乘，因此可以直接利用远端信号频谱X(k)和误差信号频谱E(k)求得该归一化互相关的值，计算复杂度较低。归一化互相关在频域的计算方法：Step S302: Calculate the normalized cross-correlation between the remote signal and the error signal. Since the cross-correlation operation in the time domain can be converted into a dot product in the frequency domain, that is, two signal spectrum values are multiplied point by point, so the remote signal spectrum X(k) and the error signal spectrum E(k) can be directly used to obtain the The value of the normalized cross-correlation has low computational complexity. Calculation method of normalized cross-correlation in frequency domain:

${ξ ξ}_{X x E E. C C C C} = = \underset{k k}{m m a a x x} \frac{E E. [[X x ((k k)) E E. ((k k))]]}{\sqrt{E E. [[X x {((k k))}^{22}]] E E. [[E E. {((k k))}^{22}]]}} - - - - - - ((1010))$

ξ_XECC表示远端信号和误差信号的归一化互相关，k表示频点。ξ _XECC represents the normalized cross-correlation of the remote signal and the error signal, and k represents the frequency point.

步骤S303：DTD判决。比较远端信号和误差信号的归一化互相关ξ_XECC和归一化互相关门限。当近端无语音时，远端信号和误差信号的归一化互相关ξ_XECC应该等于1，而近端有语音时，归一化互相关ξ_XECC小于1。因此，可以设置一个略小于1的常数T_XECC作为门限值，T_XECC通常取值在0.9到1之间，且该门限值根据检测结果实时更新。更新的算法根据实际情况选取。一个好的门限值应该使误报概率和漏报概率都相对较小。例如：可以首先任意选择一个略小于1的常数，然后设置近端语音为0，计算误报概率和漏报概率，在一定范围内调整T_XECC，直到误报概率和漏报概率都较小。Step S303: DTD decision. Compare the normalized cross-correlation ξ _XECC and the normalized cross-correlation threshold of the far-end signal and the error signal. When there is no speech at the near end, the normalized cross-correlation _ξXECC of the far-end signal and the error signal should be equal to 1, and when there is speech at the near-end, the normalized cross-correlation _ξXECC is less than 1. Therefore, a constant T _XECC slightly smaller than 1 can be set as a threshold value, and T _XECC usually has a value between 0.9 and 1, and the threshold value is updated in real time according to the detection result. The updated algorithm is selected according to the actual situation. A good threshold should make both the probability of false positive and the probability of false negative relatively small. For example: first arbitrarily select a constant slightly smaller than 1, then set the near-end voice to 0, calculate the false alarm probability and false negative probability, and adjust T _XECC within a certain range until the false positive probability and false negative probability are both small.

当归一化互相关小于门限时，即：When the normalized cross-correlation is less than the threshold, that is:

ξ_XECC＜T_XECC (11)系统处于双端通话状态，停止滤波器系数更新，直接使用原来的滤波器系数对近端信号进行滤波；否则，不存在近端语音，只存在远端语音，这时既进行滤波器系数更新，也进行滤波。ξ _XECC < T _XECC (11) When the system is in the state of double-ended conversation, stop updating the filter coefficients, and directly use the original filter coefficients to filter the near-end signal; otherwise, there is no near-end voice, only far-end voice, which Both update the filter coefficients and perform filtering.

将本发明提出的语音状态检测方法应用于实际的回声消除系统中，包括两个终端，使用VoIP软件Sipdroid对实际通话效果进行验证。The voice state detection method proposed by the present invention is applied to an actual echo cancellation system, including two terminals, and the actual call effect is verified by using the VoIP software Sipdroid.

首先使用matlab对本发明提出的结合VAD和DTD的语音状态检测方法进行仿真。仿真所用的语音信号包括1段30秒的远端语音PCM(Pulse Code Modulation，脉冲编码调制)流以及1段与之对应的近端语音PCM流，采样频率均为8000Hz。在回声消除系统中，滤波器的长度设为128，自适应滤波算法采用BFDAF算法(即频域的NLMS算法)，而语音状态检测算法采用本发明提出的语音状态检测方法。First, use matlab to simulate the speech state detection method combined with VAD and DTD proposed by the present invention. The speech signals used in the simulation include a 30-second far-end speech PCM (Pulse Code Modulation, pulse code modulation) stream and a corresponding near-end speech PCM stream, and the sampling frequency is 8000 Hz. In the echo cancellation system, the length of the filter is set to 128, the adaptive filtering algorithm adopts the BFDAF algorithm (that is, the NLMS algorithm in the frequency domain), and the speech state detection algorithm adopts the speech state detection method proposed by the present invention.

如图2所示，为仿真所用的两段PCM流。从上至下依次为远端信号波形、近端信号波形。横坐标为时间，单位s；纵坐标为幅度值。采用原有的语音状态检测方法，即仅使用基于能量的DTD检测，回声消除效果如图3所示。从图中可以看出，在VAD未改进的条件下，前半段的回声消除效果较好，但还是存在少量残余回声；后半段的效果则不是很理想，原声被消除得比较多，回声消除后的信号产生了较大失真。As shown in Figure 2, it is two sections of PCM streams used for simulation. From top to bottom, there are far-end signal waveforms and near-end signal waveforms. The abscissa is time, the unit is s; the ordinate is the amplitude value. Using the original voice state detection method, that is, only using energy-based DTD detection, the echo cancellation effect is shown in Figure 3. It can be seen from the figure that under the condition of unimproved VAD, the echo cancellation effect in the first half is better, but there is still a small amount of residual echo; After the signal has a large distortion.

采用本发明提出的语音状态检测方法，回声消除的效果如图4所示。对比改进之前和改进之后分别进行回声消除后得到的两段PCM流，可以看出回声消除效果在改进语音状态检测方法后有明显的改善。残余回声消除更加彻底，近端语音也几乎没有出现失真现象。Using the speech state detection method proposed by the present invention, the effect of echo cancellation is shown in FIG. 4 . Comparing the two PCM streams obtained after echo cancellation before and after improvement, it can be seen that the effect of echo cancellation has been significantly improved after improving the speech state detection method. Residual echo cancellation is more thorough, and near-end speech is almost free of distortion.

为了进一步验证本发明提出的语音状态检测方法在实际回声消除系统中的效果，对该方法编写相应的C程序，并利用语音通信软件Sipdroid对该方法进行测试。In order to further verify the effect of the voice state detection method proposed by the present invention in the actual echo cancellation system, the corresponding C program is written for the method, and the method is tested by using the voice communication software Sipdroid.

根据本发明的语音状态检测方法的步骤修改回声消除库WebRTC中执行VAD和DTD的部分，然后在Sipdroid中调用该回声消除库。在不同环境下使用Sipdroid进行实际双端通话并进行录音，保存回声消除前后的语音PCM流，以便进行回声消除效果分析。According to the steps of the speech state detection method of the present invention, modify the part of implementing VAD and DTD in the echo cancellation library WebRTC, and then call the echo cancellation library in Sipdroid. Use Sipdroid to make actual double-ended calls and record them in different environments, and save the voice PCM stream before and after echo cancellation for the purpose of echo cancellation effect analysis.

为了在取出语音流后进行观察分析时比较方便和清晰，每次测试中，两位通话者依次从1到10进行报数。在不同环境下，分别对改进前和改进后的Sipdroid版本进行多次通话测试以便进行对比。In order to make observation and analysis more convenient and clear after taking out the voice stream, in each test, the two callers reported the number from 1 to 10 in turn. In different environments, the Sipdroid version before and after improvement were tested several times for comparison.

首先对使用改进前的回声消除库的Sipdroid回声消除效果进行多次通话测试，并取出远端、近端和回声消除后的PCM流。测试结果如图5所示，图中仅截取报数部分的PCM流。其中，第一段PCM流是远端信号，第二段PCM流是近端信号，第三段PCM流是回声消除后的近端信号。可见，回声消除效果不是很理想，报数部分有少许残余回声，虚线框圈出部分。其他测试结果大部分与此类似。Firstly, the Sipdroid echo cancellation effect using the pre-improved echo cancellation library is tested for multiple calls, and the far-end, near-end and echo-cancelled PCM streams are taken out. The test results are shown in Figure 5, in which only the PCM stream of the reporting part is intercepted. Wherein, the first segment of the PCM stream is a far-end signal, the second segment of the PCM stream is a near-end signal, and the third segment of the PCM stream is a near-end signal after echo cancellation. It can be seen that the echo cancellation effect is not very ideal, there is a little residual echo in the reporting part, and the dotted line circles the part. Most of the other test results were similar.

然后，对使用改进后的回声消除库的Sipdroid的回声消除效果也使用同样方法进行多次通话测试，并取出远端、近端和回声消除后的PCM流。图6为比较有代表性的一次测试结果。与图5类似，图中第一段PCM流是远端信号，第二段PCM流是近端信号，第三段PCM流是回声消除后的近端信号。可见，使用本发明改进后的语音检测方法后，回声消除效果比较理想，报数部分的残余回声消除比较彻底，如虚线框圈出部分，同时原声的保留也没有受到影响。多次测试发现，在不同环境下，回声消除的效果会受到一定影响，稳定性还有待进一步提高。但在大多数情况下，使用本发明的语音状态检测方法后的回声消除效果都较改进前的回声消除效果有明显改善。Then, use the same method to test the echo cancellation effect of Sipdroid using the improved echo cancellation library, and take out the far-end, near-end and echo-cancelled PCM streams. Figure 6 is a representative test result. Similar to FIG. 5 , the first segment of the PCM stream in the figure is the far-end signal, the second segment of the PCM stream is the near-end signal, and the third segment of the PCM stream is the near-end signal after echo cancellation. It can be seen that after using the improved speech detection method of the present invention, the echo cancellation effect is relatively ideal, and the residual echo cancellation of the reporting part is relatively thorough, such as the part circled by the dotted line, and the original sound is not affected at the same time. Multiple tests have found that in different environments, the effect of echo cancellation will be affected to a certain extent, and the stability needs to be further improved. However, in most cases, the echo cancellation effect after using the speech state detection method of the present invention is significantly improved compared with the echo cancellation effect before the improvement.

Claims

1. a kind of speech state detection method that is applicable to echo cancellation system, it is characterized in that, realization steps are as follows:

The first step: constructing a support vector machine SVM classifier using noise training samples and voice training samples;

Perform feature value extraction and Gaussian mixture model GMM training on the noise training samples and speech training samples respectively, construct the corresponding Gaussian supervector, and then use the Gaussian supervector to construct the kernel function of the SVM classifier, and the SVM model corresponding to the speech signal and the noise signal ; Use the constructed kernel function and the SVM model to construct the SVM classifier;

The second step: the signal to be detected is the far-end and near-end signals after the block, and the constructed SVM classifier is used to perform VAD judgment on the block far-end signal; VAD means voice activity detection;

Perform eigenvalue extraction and GMM training on the remote signal of this block to construct a Gaussian supervector, and then input the Gaussian supervector corresponding to the remote signal of this block into the constructed SVM classifier for judgment; if the judgment result is noise, it means If there is no voice, stop filter update and filtering, and directly output the near-end voice signal, otherwise it means that there is voice at the far end, and proceed to the next double-ended call judgment;

Step 3: Determine whether the system is in a double-ended call state;

Calculate the normalized cross-correlation ξ _XECC of the remote signal and the error signal; compare the normalized cross-correlation ξ _XECC with the set threshold T _XECC , when ξ _XECC < T _XECC , the system is in double-talk state, stop the filter coefficient Update, filter the near-end signal; otherwise, if there is no speech at the near-end, update and filter the filter coefficients according to the far-end signal.

2. a kind of speech state detection method that is applicable to echo cancellation system according to claim 1, is characterized in that, described first step constructs SVM classifier, comprises the steps:

Step S101: Extracting eigenvalues from the noise signal training samples and speech signal training samples; the adopted eigenvalues are Mel cepstral coefficients MFCC;

The extraction process of MFCC is: pre-emphasis, block and window processing are performed on the signal, the block after windowing is subjected to fast Fourier transform FFT to obtain the spectral parameters of each block; the spectrum of each block is The parameters pass through a set of Mel scale filters composed of K triangular band-pass filters, and take the logarithm of the output of each frequency band to obtain the logarithmic spectrum; set the numbers of the K band-pass filters from 0 to K- 1, then the logarithmic spectrum corresponding to the i-th block passing through the band-pass filter numbered k is S _i (k), and the l-th order parameter m _i (l) of the MFCC of the i-th block is:

Among them, L is the total order of the extracted MFCC;

Step S102: generating Gaussian supervectors of noise signal training samples and speech signal training samples;

Using the MFCC parameters of the noise signal training sample and the speech signal training sample respectively to establish a Gaussian mixture model corresponding to the noise signal and the speech signal;

For a block, the N-order Gaussian mixture model g(x) is expressed as:

Among them, x is the L-dimensional feature vector formed by the MFCC parameters of the training sample block, p _i (x) is the i-th Gaussian component of the Gaussian mixture model, and w _i is the weighting factor of the i-th Gaussian component; Σ _i is The covariance matrix of the i-th Gaussian component, μ _i is the mean vector of the i-th Gaussian component;

The Gaussian mixture model g(x) is further expressed as: N(.) represents the Gaussian probability density function;

The maximum expectation algorithm is used to update the parameters of the Gaussian mixture model, and the Gaussian mixture model of the speech signal training sample is finally obtained as g(s), where the mean vector of each Gaussian component is s represents the speech signal; the Gaussian mixture model of the finally obtained noise signal training sample is g(n), where the mean vector of each Gaussian component is n represents the noise signal; use the established Gaussian mixture model to construct the Gaussian supervectors m _s and m _n of the speech signal training samples and the noise signal training samples respectively:

Step S103: using the constructed Gaussian super vector to construct an SVM classifier;

Use the Gaussian super vector m _n and m _s to establish the SVM model corresponding to the noise signal and the speech signal;

Using the Gaussian supervectors m _n and m _s to construct the kernel function K(n,s) is as follows:

Determine the kernel function, the SVM model of the speech signal and the SVM of the noise signal, and obtain the SVM classifier.

3. according to claim 1 or 2, a kind of speech state detection method that is applicable to echo cancellation system, it is characterized in that, in the described 3rd step, the method for calculating error signal is: the far-end signal of this division block The estimated echo signal is obtained by convolution with the coefficient of the adaptive filter, and the error signal is the difference between the local block's near-end signal and the estimated echo signal.

4. a kind of speech state detection method that is applicable to echo cancellation system according to claim 1 or 2, it is characterized in that, in the described 3rd step, calculate the normalization of far-end signal and error signal according to following formula Cross-correlation _ξXECC :

Among them, k represents the frequency point, X(k) is the spectrum of the remote signal, and E(k) is the spectrum of the error signal.

5. a kind of speech state detection method that is applicable to echo cancellation system according to claim 1 or 2, is characterized in that, in the described 3rd step, the threshold T _XECC of setting is the value between 0.9 to 1, And it will be updated in real time according to the judgment result.