CN101778322B

CN101778322B - Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic

Info

Publication number: CN101778322B
Application number: CN2009102503930A
Authority: CN
Inventors: 刘文举; 程宁; 李超
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2009-12-07
Filing date: 2009-12-07
Publication date: 2013-09-25
Anticipated expiration: 2029-12-07
Also published as: CN101778322A

Abstract

The invention discloses a microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic, aiming at two important factors influencing the postfiltering sound enhancement performance of a microphone array, i.e. accurate estimation for signal parameters and suitable compromise between increasing noise reduction performance and reducing voice distortion. Thescheme of the invention comprises the following steps of carrying out time domain alignment on signals collected by the microphone array, and carrying out short-time Fourier transform and characteristic value analysis based of power spectrum; determining the dimensionality of a signal subspace through the existence probability of target voice signal in maximation noise-carried voice signals; self-adaptively selecting a distribution model of a noise power spectrum in the noise-carried voice signals; estimating noise power spectrum by utilizing a conditional probability; estimating an auditory masking threshold value based on the signal subspace; and estimating a postfilter by combining Lagrange multipliers according to the auditory sensing characteristics.

Description

Microphone array post-filtering speech enhancement method based on multiple models and auditory characteristics

技术领域 technical field

本发明涉及麦克风阵列的信号子空间方法、听觉掩蔽效应及后滤波器的设计。The invention relates to a signal subspace method of a microphone array, an auditory masking effect and the design of a post-filter.

背景技术 Background technique

现实生活中的语音常常受到环境中噪声的影响，多通道语音增强方法在近些年来受到了广泛的关注。麦克风阵列语音增强方法相对于单通道语音增强方法的优势在于它可以利用多路信号之间的相关性更准确地估计信号的特性，从而达到更好的语音增强效果。其中，麦克风阵列后滤波语音增强方法更是由于其出色的降噪性能近年来得到了广泛的使用。Simmer等(参考文献1：K.Uwe Simmer，et al，“Post-filtering techniques”，inMicrophone Arrays，M.Brandstein and D.Ward，Eds.New York：Springer，ch.3，pp.36-60，2001.)证明了最小均方误差意义下的最优多通道语音增强解可分解为一个最小方差非畸变响应波束形成器加上一个单通道的维纳后滤波器的形式。尽管理论上证明了后滤波方法的最优性，但在实际应用中，由于很难精确地估计出语音信号和噪声信号的功率谱来得到理想的后滤波器，限制了后滤波方法的性能。所以，合理的后滤波器设计，准确的信号功率谱估计都可以使得语音增强方法的性能得到大幅的提高。Zelinski(参考文献2：R.Zelinski，“A microphone array with adaptive post-filteringfor noise reduction in reverberant rooms”，in Proc.of ICASSP-88，1988，Vol.5，pp.2578-2581.)假设各个阵元上的噪声信号是不相关的，提出了一种后滤波器设计方法。但由于实际环境中，阵元噪声之间是存在一定相关性的，所以该方法性能较差。McCowan(参考文献3：Iain A.McCowan，HervéBourlard，“Microphone array post-filter based on noise field coherence”，IEEETransaction on Speech and Audio Processing，Vol.11，pp.709-715，Nov.2003.)考虑了噪声之间的相关性，利用散射噪声场的特性，提出了一种后滤波器设计方法，具有较好的语音增强性能。但由于其方法是基于散射噪声场假设的，所以，当实际场合中的噪声场不符合散射噪声场时，该方法性能会有明显的下降。本发明利用人耳的听觉掩蔽效应，提出了一种基于听觉感知特性的后滤波器设计方法。为了更准确地估计噪声功率谱，本发明将带噪信号空间分解为信号子空间和噪声子空间，提出了用目标语音信号信号存在概率最大化来估计子空间维度的方法，合理地估计出信号子空间和噪声子空间的维度，在噪声子空间上，提出了用条件概率估计噪声功率谱的方法。实验证明，本发明所提出的噪声估计方法比以往的噪声估计方法更为准确，所提出的基于听觉感知特性的后滤波器也比传统的后滤波器更为有效。Speech in real life is often affected by noise in the environment, and multi-channel speech enhancement methods have received extensive attention in recent years. The advantage of the microphone array speech enhancement method over the single-channel speech enhancement method is that it can use the correlation between multiple signals to more accurately estimate the characteristics of the signal, thereby achieving better speech enhancement effects. Among them, the microphone array post-filter speech enhancement method has been widely used in recent years because of its excellent noise reduction performance. Simmer et al. (Reference 1: K.Uwe Simmer, et al, "Post-filtering techniques", in Microphone Arrays, M. Brandstein and D. Ward, Eds. New York: Springer, ch.3, pp.36-60, 2001.) proved that the optimal multi-channel speech enhancement solution in the sense of minimum mean square error can be decomposed into the form of a minimum variance undistorted response beamformer plus a single-channel Wiener post-filter. Although the optimality of the post-filtering method is proved in theory, in practical applications, it is difficult to accurately estimate the power spectrum of the speech signal and the noise signal to obtain an ideal post-filter, which limits the performance of the post-filtering method. Therefore, reasonable post-filter design and accurate signal power spectrum estimation can greatly improve the performance of speech enhancement methods. Zelinski (Reference 2: R. Zelinski, "A microphone array with adaptive post-filtering for noise reduction in reverberant rooms", in Proc. of ICASSP-88, 1988, Vol.5, pp.2578-2581.) assumes that each array The noise signal on the element is uncorrelated, and a post-filter design method is proposed. However, due to the fact that there is a certain correlation between the array element noises in the actual environment, the performance of this method is poor. McCowan (Reference 3: Iain A. McCowan, Hervé Bourlard, "Microphone array post-filter based on noise field coherence", IEEE Transaction on Speech and Audio Processing, Vol.11, pp.709-715, Nov.2003.) considered The correlation between noises, using the characteristics of the scattering noise field, proposes a post-filter design method with better speech enhancement performance. However, because the method is based on the assumption of the scattering noise field, the performance of the method will decrease obviously when the noise field in the actual occasion does not conform to the scattering noise field. The invention utilizes the auditory masking effect of the human ear to propose a post-filter design method based on auditory perception characteristics. In order to estimate the noise power spectrum more accurately, the present invention decomposes the noisy signal space into a signal subspace and a noise subspace, and proposes a method for estimating the dimension of the subspace by maximizing the existence probability of the target speech signal, and reasonably estimating the signal Dimensions of the subspace and noise subspace. On the noise subspace, a method for estimating the noise power spectrum with conditional probability is proposed. Experiments prove that the noise estimation method proposed by the present invention is more accurate than previous noise estimation methods, and the proposed post-filter based on auditory perception characteristics is also more effective than the traditional post-filter.

假设由L个麦克风组成的阵列上接收到的带噪语音信号向量的频域表示为：X＝[X₁，…，X_L]^H。由阵列输入信号的加权相加得到的增强后的语音信号的频域表示如下：Assume that the frequency domain representation of the noisy speech signal vector received on an array composed of L microphones is: X=[X ₁ , . . . , X _L ] ^H . The frequency domain representation of the enhanced speech signal obtained by the weighted addition of the array input signals is as follows:

Y＝w^HX＝w^H[Sd+N](1)Y=w ^H X=w ^H [Sd+N](1)

其中，模型w是阵列加权系数，S是目标信号，d＝[d₁，…，d_L]^T是传播向量，N＝[N₁，…，N_L]^H是噪声信号向量，[·]^H为共轭转置算子。Among them, the model w is the array weighting coefficient, S is the target signal, d=[d ₁ ,…,d _L ] ^T is the propagation vector, N=[N ₁ ,…,N _L ] ^H is the noise signal vector, [·] ^H is the conjugate transpose operator.

误差信号e＝S-w^HX的功率为：The power of error signal e=Sw ^H X is:

${φ φ}_{ee ee} = = E E. [[{{S S - - {w w}^{H h} X x}} {{{S S}^{H h} - - {X x}^{H h} w w}}]] = = {φ φ}_{SS SS} - - {w w}^{H h} {φ φ}_{XS XS} - - {φ φ}_{XS XS}^{H h} w w + + {w w}^{H h} {Φ Φ}_{XX XX} w w - - - - - - ((22))$

其中，Φ_XX是多通道带噪语音信号X的交叉功率谱矩阵，φ_XS是多通道带噪语音信号X与单通道目标信号S的互功率谱，φ_SS是单通道目标语音信号S的功率谱。Among them, Φ _XX is the cross power spectrum matrix of multi-channel noisy speech signal X, φ _XS is the cross power spectrum of multi-channel noisy speech signal X and single-channel target signal S, φ _SS is the power of single-channel target speech signal S Spectrum.

令φ_ee对权值w求导数，使其为零，可得最优加权系数：Let φ _ee take the derivative of the weight w to make it zero, and the optimal weighting coefficient can be obtained:

${w w}_{opt opt} = = {Φ Φ}_{XX XX}^{- - 11} {φ φ}_{XS XS} - - - - - - ((33))$

在目标语音信号与噪声不相关的假设下，(3)式变为：Under the assumption that the target speech signal is uncorrelated with the noise, formula (3) becomes:

${w w}_{opt opt} = = {Φ Φ}_{XX XX}^{- - 11} {φ φ}_{SS SS} d d = = {[[{φ φ}_{SS SS} {dd dd}^{H h} + + {Φ Φ}_{NN NN}]]}^{- - 11} {φ φ}_{SS SS} d d - - - - - - ((44))$

应用Sherman-Morrison-Woodbury恒等式，上式又可表示为：Applying the Sherman-Morrison-Woodbury identity, the above formula can be expressed as:

${w w}_{opt opt} = = [[\frac{{φ φ}_{SS SS}}{{φ φ}_{SS SS} + + {(({d d}^{H h} {Φ Φ}_{NN NN}^{- - 11} d d))}^{- - 11}}]] \frac{{Φ Φ}_{NN NN}^{- - 11} d d}{{d d}^{H h} {Φ Φ}_{NN NN}^{- - 11} d d} = = [[\frac{{φ φ}_{SS SS}}{{φ φ}_{SS SS} + + {φ φ}_{Nn n}}]] \frac{{Φ Φ}_{NN NN}^{- - 11} d d}{{d d}^{H h} {Φ Φ}_{NN NN}^{- - 11} d d} - - - - - - ((55))$

其中，φ_NN分别是单通道噪声的自功率谱，Φ_NN是多通道噪声交叉功率谱矩阵。式(5)可看成一个最小方差非畸变响应波束形成器Φ_NN ^-1d/(d^HΦ_NN ^-1d)加上一个单通道的维纳后滤波器φ_SS/(φ_SS+φ_NN)。Among them, φ _NN is the self-power spectrum of single-channel noise, and Φ _NN is the cross power spectrum matrix of multi-channel noise. Equation (5) can be regarded as a minimum variance non-distortion response beamformer Φ _NN ^-1 d/(d ^H Φ _NN ^-1 d) plus a single-channel Wiener post-filter φ _SS /(φ _SS +φ _NN ).

发明内容 Contents of the invention

为了解决现有技术的问题，本发明的目的在于对单通道后滤波器进行设计，利用多分布模型自适应选择方法和听觉特性设计一种新的后滤波器。单通道后滤波器设计需要考虑的问题包括两个方面：好的降噪性能和较小的目标语音信号畸变。通常而言，后滤波器在降噪的同时，也可能会增加目标语音信号的畸变。所以，对这两者进行合理的折中是后滤波器设计必须考虑的问题。In order to solve the problems in the prior art, the object of the present invention is to design a single-channel post-filter, and to design a new post-filter by using a multi-distribution model adaptive selection method and auditory characteristics. The issues that need to be considered in the design of single-channel post-filter include two aspects: good noise reduction performance and small target speech signal distortion. Generally speaking, the post-filter may increase the distortion of the target speech signal while reducing noise. Therefore, a reasonable compromise between the two is a problem that must be considered in the design of the post-filter.

为达成所述目的，本发明提供一种基于多模型和听觉特性的麦克风阵列后滤波语音增强方法，该方法的具体步骤如下：In order to achieve the stated purpose, the present invention provides a method for speech enhancement based on multi-model and auditory characteristic microphone array post-filtering, the specific steps of the method are as follows:

步骤a：通过L个麦克风组成的麦克风阵列采集带噪声的多路语音信号，把各路带噪声的语音信号进行时域对齐，使用短时离散傅里叶变换将对齐后的各路信号表示成复数值的频率信号形式，计算麦克风阵列多路信号的功率谱矩阵并对此功率谱矩阵进行特征值分解得到特征值矩阵和特征向量矩阵；Step a: Collect noisy multi-channel speech signals through a microphone array composed of L microphones, align each noisy speech signal in the time domain, and use short-time discrete Fourier transform to express the aligned signals as In the form of complex-valued frequency signals, calculate the power spectrum matrix of the multi-channel signal of the microphone array and perform eigenvalue decomposition on the power spectrum matrix to obtain the eigenvalue matrix and eigenvector matrix;

步骤b：通过极大化带噪语音信号中目标语音信号的存在概率，确定信号子空间的维度Q，且Q≤L；Step b: Determine the dimension Q of the signal subspace by maximizing the existence probability of the target speech signal in the noisy speech signal, and Q≤L;

步骤c：基于谱的平稳性，自适应选择带噪语音信号中噪声功率谱分布模型；Step c: adaptively select the noise power spectrum distribution model in the noisy speech signal based on the stationarity of the spectrum;

步骤d：利用条件概率估计噪声功率谱；Step d: Estimate the noise power spectrum using conditional probability;

步骤e：根据信号子空间维度和噪声功率谱估计，利用听觉掩蔽效应，基于信号子空间估计得到各频点的听觉掩蔽阈值；Step e: According to the signal subspace dimension and the noise power spectrum estimation, using the auditory masking effect, the auditory masking threshold of each frequency point is obtained based on the signal subspace estimation;

步骤f：根据噪声功率谱、听觉掩蔽阈值，结合拉格朗日乘子估计后滤波器，使得增强语音中的残余噪声小于人耳的听觉掩蔽阈值，从而消除残余噪声影响，并使目标语音信号的畸变尽可能的小，完成麦克风阵列后滤波语音增强。Step f: According to the noise power spectrum and auditory masking threshold, combined with Lagrangian multipliers to estimate the post-filter, so that the residual noise in the enhanced speech is smaller than the auditory masking threshold of the human ear, thereby eliminating the influence of residual noise and making the target speech signal The distortion is as small as possible, and the speech enhancement is filtered after completing the microphone array.

其中，所述对功率谱矩阵进行特征值分解，包括：Wherein, the eigenvalue decomposition of the power spectrum matrix includes:

利用特征值分解将带噪语音信号空间分为两个子空间，即信号子空间：包含目标语音信号和噪声；噪声子空间：只包含噪声；把带噪语音信号X在时帧t和频率k的功率谱矩阵Φ_XX(k，t)特征值分解为：Using eigenvalue decomposition, the noisy speech signal space is divided into two subspaces, that is, the signal subspace: contains the target speech signal and noise; the noise subspace: only contains noise; the noisy speech signal X in the time frame t and frequency k The eigenvalue decomposition of the power spectrum matrix Φ _XX (k, t) is:

Φ_XX(k，t)＝UΛ_XXU^H＝U(Λ_SS+φ_NN(k，t)I)U^H Φ _XX (k, t) = UΛ _XX U ^H ＝ U(Λ _SS + φ _NN (k, t)I) U ^H

其中，X＝S+N，X为带噪语音信号，S为目标语音信号，N为噪声；Λ_XX为特征值降序排列的带噪语音信号功率谱特征值矩阵，Λ_SS为特征值降序排列的目标语音信号功率谱特征值矩阵，U为特征向量矩阵，φ_NN(k，t)为时帧t和频率k的噪声功率，I为L阶单位阵，[·]^H为共轭转置算子。Wherein, X=S+N, X is a noisy speech signal, S is a target speech signal, and N is a noise; Λ _XX is the noisy speech signal power spectrum eigenvalue matrix of the eigenvalue descending order, and Λ _SS is the eigenvalue descending order arrangement The eigenvalue matrix of the power spectrum of the target speech signal, U is the eigenvector matrix, φ _NN (k, t) is the noise power of time frame t and frequency k, I is the L-order identity matrix, [·] ^H is the conjugate transpose operator.

其中，所述确定信号子空间维度是取最合适的Q值使得带噪语音中目标语音信号存在的概率最大；利用条件概率计算，步骤包括：Wherein, the determination of the signal subspace dimension is to get the most suitable Q value so that the probability of the target speech signal in the noisy speech is maximized; using conditional probability calculation, the steps include:

定义互斥事件H₀和H₁：Define mutually exclusive events H ₀ and H ₁ :

事件H₀：带噪语音信号中，只存在噪声，不存在目标语音信号；Event H ₀ : In the noisy speech signal, there is only noise and no target speech signal;

事件H₁：带噪语音信号中，目标语音信号与噪声同时存在；Event H ₁ : In the noisy speech signal, the target speech signal and noise exist simultaneously;

信号子空间维度Q定义为：The signal subspace dimension Q is defined as:

$\underset{Q Q}{arg arg max max} P P ((S S ((k k,, t t)) | | {H h}_{11}))$

其中，S(k，t)是目标语音信号信号在第t帧的第k个频率点上的功率谱，P(·)是目标语音信号谱的分布函数，argmax(·)是寻找具有最大评分的参数值的算子。Among them, S(k, t) is the power spectrum of the target speech signal signal at the kth frequency point of frame t, P(·) is the distribution function of the target speech signal spectrum, and argmax(·) is to find The operator of the parameter value of .

其中，所述基于谱的平稳性，自适应选择带噪语音信号中噪声功率谱分布模型，包括以下步骤：Wherein, the described stationarity based on the spectrum, adaptively selects the noise power spectrum distribution model in the noisy speech signal, comprising the following steps:

步骤c1：定义一个用来表述功率谱的平稳性的判别函数Ω：Step c1: Define a discriminant function Ω used to describe the stationarity of the power spectrum:

$Ω Ω = = \frac{\sqrt[((L L - - Q Q))]{{Π Π}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}}}}{\frac{11}{L L - - Q Q} {Σ Σ}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}}}$

即，Ω为几何平均

对算术平均

的比值，其中，

是带噪语音信号功率谱特征值矩阵Λ_XX的第i个特征值，i∈{Q+1，…，L}是特征值的下标，Ω的值在0到1之间；That is, Ω is the geometric mean

arithmetic mean

The ratio of , where,

is the i-th eigenvalue of the noisy speech signal power spectrum eigenvalue matrix _ΛXX , i∈{Q+1,...,L} is the subscript of the eigenvalue, and the value of Ω is between 0 and 1;

步骤c2：根据判别函数值与预设阈值比较，确定适用在带噪语音信号中的噪声功率谱分布模型。Step c2: Determine the noise power spectrum distribution model applicable to the noisy speech signal according to the comparison between the value of the discriminant function and the preset threshold.

其中，所述根据判别函数值与预设阈值的比较步骤包括：Wherein, the step of comparing the discriminant function value with the preset threshold includes:

步骤c21：确定两个预设阈值Ω₁和Ω₂，Ω₁＜Ω₂；Step c21: Determine two preset thresholds Ω ₁ and Ω ₂ , Ω ₁ <Ω ₂ ;

步骤c22：比较判别函数与预设阈值，特别地，如果判别函数小于预设阈值Ω₁，则选用零均值高斯分布；如果判别大于预设阈值Ω₂，则选用伽玛分布；否则选用拉普拉斯分布。Step c22: compare the discriminant function with the preset threshold, in particular, if the discriminant function is smaller than the preset threshold Ω ₁ , use the zero-mean Gaussian distribution; if the discriminant function is greater than the preset threshold Ω ₂ , use the Gamma distribution; otherwise, use the Lapp Russ distribution.

其中，利用条件概率估计噪声功率谱的步骤包括：Wherein, the steps of estimating the noise power spectrum using the conditional probability include:

对于每一帧带噪语音信号，它只含有噪声的概率是P(H₀|X)，即含有噪声又含有目标语音信号的概率是P(H₁|X)；针对这两种情况，分别估计噪声功率谱如下：For each frame of noisy speech signal, the probability that it only contains noise is P(H ₀ |X), that is, the probability that it contains both noise and target speech signal is P(H ₁ |X); for these two cases, respectively The estimated noise power spectrum is as follows:

$\{\begin{matrix} {H h}_{00} : : {φ φ}_{NN NN}^{00} = = \frac{11}{L L} {Σ Σ}_{i i = = 11}^{L L} {λ λ}_{{X x}_{i i}} \\ {H h}_{11} : : {φ φ}_{NN NN}^{11} = = \frac{11}{L L - - Q Q} {Σ Σ}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}} \end{matrix}$

其中，φ_NN ⁰和φ_NN ¹分别是噪声在互斥事件H₀和H₁发生情况下的功率谱，i∈{1，…，L}是特征值的下标；Among them, φ _NN ⁰ and φ _NN ¹ are the power spectrum of the noise when mutually exclusive events H ₀ and H ₁ occur, respectively, and i∈{1,…,L} is the subscript of the eigenvalue;

根据条件概率公式，噪声功率谱估计如下：According to the conditional probability formula, the noise power spectrum is estimated as follows:

${\overset{~ ~}{φ φ}}_{NN NN} = = P P (({H h}_{00} | | X x)) {φ φ}_{NN NN}^{00} + + P P (({H h}_{11} | | X x)) {φ φ}_{NN NN}^{11} . .$

其中，所述估计听觉掩蔽阈值的步骤包括：Wherein, the step of estimating the auditory masking threshold comprises:

步骤f1：将听觉频率范围0-15500Hz划分为若干个关键子频带；Step f1: dividing the auditory frequency range 0-15500Hz into several key sub-bands;

步骤f2：分别计算每个子频带中的听觉掩蔽阈值。Step f2: Calculate the auditory masking threshold in each sub-band separately.

其中，所述计算每个子频带中的听觉掩蔽阈值是计算各子频带上各频点的能量，计算人耳基膜对于各频段声音的传播系数，然后将各子频带上各频点的能量和各频段声音的传播系数两者相乘得到人耳基膜上的激励能量值，再根据人耳基膜上的激励能量值与听觉掩蔽阈值的函数关系计算得到掩蔽阈值。Wherein, the calculation of the auditory masking threshold in each sub-band is to calculate the energy of each frequency point on each sub-band, calculate the transmission coefficient of the human ear basilar membrane for the sound of each frequency band, and then calculate the energy of each frequency point on each sub-band and Multiply the propagation coefficients of the sound in each frequency band to obtain the excitation energy value on the basilar membrane of the human ear, and then calculate the masking threshold according to the functional relationship between the excitation energy value on the basilar membrane of the human ear and the auditory masking threshold.

其中，所述结合拉格朗日乘子估计后滤波器G的步骤如下：Wherein, the steps of combining the estimated filter G with Lagrange multipliers are as follows:

步骤fa：在残余噪声功率小于掩蔽阈值的约束条件下，最小化目标语音信号的畸变，以此建立最优化问题；Step fa: Under the constraint that the residual noise power is less than the masking threshold, minimize the distortion of the target speech signal, thereby establishing an optimization problem;

步骤fb：结合拉格朗日乘子求解，得到后滤波器的最优估计；Step fb: Combining with Lagrangian multipliers to solve, obtain the optimal estimate of the post-filter;

步骤fc：带入听觉掩蔽阈值和噪声功率谱估计，完成后滤波器的设计。Step fc: Bring in the auditory masking threshold and noise power spectrum estimation, and complete the design of the post-filter.

本发明的有益效果：本发明利用人耳的听觉掩蔽效应提出了一种合理的折中方案，设计了一种新的基于听觉感知特性的后滤波器。传统的噪声估计方法是基于VAD的噪声估计方法，也就是检测出带噪语音中的纯噪声帧，用这些帧上的平均功率谱来估计语音与噪声混合帧上的噪声功率谱。由于噪声是变化的，各帧上的噪声实际上是不同的。所以，基于VAD的噪声估计方法用纯噪声帧上的平均噪声功率谱来估计所有帧上的噪声功率谱会导致较大的估计误差。针对这一情况，本发明提出了一种基于带噪信号子空间分解的噪声功率谱估计方法，在每一帧信号上都估计噪声功率谱，极大的减少了噪声估计误差。接着，本发明利用人耳的听觉掩蔽效应设计后滤波器，使得增强后语音中的残余噪声被目标语音所掩蔽，在降噪的同时也减少了目标语音的失真。Beneficial effects of the present invention: the present invention uses the auditory masking effect of the human ear to propose a reasonable compromise solution, and designs a new post-filter based on auditory perception characteristics. The traditional noise estimation method is a noise estimation method based on VAD, which detects the pure noise frames in the noisy speech, and uses the average power spectrum on these frames to estimate the noise power spectrum on the speech and noise mixed frames. Since the noise is variable, the noise is actually different on each frame. Therefore, the VAD-based noise estimation method uses the average noise power spectrum on pure noise frames to estimate the noise power spectrum on all frames, which will lead to a large estimation error. In view of this situation, the present invention proposes a noise power spectrum estimation method based on subspace decomposition of noisy signals, which estimates the noise power spectrum on each frame signal, greatly reducing noise estimation errors. Next, the present invention utilizes the auditory masking effect of the human ear to design the post-filter so that the residual noise in the enhanced speech is covered by the target speech, reducing the distortion of the target speech while reducing the noise.

附图说明 Description of drawings

本发明进一步的特色和优点将参考说明性的附图在下面描述。Further features and advantages of the invention will be described below with reference to the illustrative drawings.

图1示出一个应用基于多模型和听觉特性的麦克风阵列后滤波语音增强方法的示例流程图；Fig. 1 shows an example flowchart of applying the microphone array post-filtering speech enhancement method based on multiple models and auditory characteristics;

图2是一个确定信号子空间维度方法的流程图；Fig. 2 is a flowchart of a method for determining signal subspace dimensions;

图3是一个确定带噪语音信号中噪声功率谱分布模型的流程图；Fig. 3 is a flow chart of determining the noise power spectrum distribution model in the noisy speech signal;

图4是一个利用条件概率估计噪声功率谱的流程图；Fig. 4 is a flowchart of estimating the noise power spectrum using conditional probability;

图5是一个计算听觉掩蔽阈值的流程图；Fig. 5 is a flowchart of calculating the auditory masking threshold;

图6是一个设计后滤波器的流程图。Figure 6 is a flow chart for designing a post-filter.

具体实施方式 Detailed ways

应当理解，不同示例以及附图的下列详细说明不是意在把本发明限制于特殊的说明性实施例；被描述的说明性实施例仅仅是例证本发明的各个步骤，其范围由附加的权利要求来定义。It should be understood that the following detailed description of the various examples and drawings is not intended to limit the invention to the particular illustrative embodiments; the described illustrative embodiments merely exemplify the various steps of the invention, the scope of which is defined by the appended claims to define.

本发明利用人耳的听觉掩蔽效应提出了一种合理的折中方案，设计了一种新的基于听觉感知特性的后滤波器。人耳的听觉掩蔽效应是指，在通常情况下，目标语音信号信号是强信号，而背景噪声相对较弱，这样听觉系统会根据具体的目标语音信号信号确定频域上的听觉掩蔽阈值，如果使滤波后的残余噪声限制在人耳的听觉掩蔽阈值之下，那么该噪声就不会被人耳感知，从而实现对带噪语音信号的增强。具体的步骤如下：The present invention uses the auditory masking effect of the human ear to propose a reasonable compromise solution, and designs a new post-filter based on auditory perception characteristics. The auditory masking effect of the human ear means that, under normal circumstances, the target speech signal signal is a strong signal, while the background noise is relatively weak, so the auditory system will determine the auditory masking threshold in the frequency domain according to the specific target speech signal signal, if If the filtered residual noise is limited below the auditory masking threshold of the human ear, then the noise will not be perceived by the human ear, thereby enhancing the noisy speech signal. The specific steps are as follows:

一种新的基于多模型和听觉特性的麦克风阵列后滤波语音增强方法，包括下列步骤：A new method for speech enhancement based on multi-model and auditory characteristics after microphone array filtering, comprising the following steps:

步骤b：通过极大化带噪语音信号中目标语音信号的存在概率，确定信号子空间的维度Q；Step b: Determine the dimension Q of the signal subspace by maximizing the existence probability of the target speech signal in the noisy speech signal;

通常使用的噪声估计方法是基于VAD的噪声估计方法。也就是检测出带噪语音中的纯噪声帧，用这些帧上的平均功率谱来估计语音与噪声混合帧上的噪声功率谱。由于噪声是变化的，各帧上的噪声实际上是不同的。所以，基于VAD的噪声估计方法用纯噪声帧上的平均噪声功率谱来估计所有帧上的噪声功率谱会导致较大的估计误差。A commonly used noise estimation method is a VAD-based noise estimation method. That is to detect pure noise frames in noisy speech, and use the average power spectrum on these frames to estimate the noise power spectrum on speech and noise mixed frames. Since the noise is variable, the noise is actually different on each frame. Therefore, the VAD-based noise estimation method uses the average noise power spectrum on pure noise frames to estimate the noise power spectrum on all frames, which will lead to a large estimation error.

针对这一情况，本发明步骤b)和步骤d)采用了一种基于带噪信号子空间分解的方法来估计噪声子空间的维度和噪声功率谱，在每一帧信号上都估计噪声功率谱，极大地减少了噪声估计误差。In response to this situation, step b) and step d) of the present invention adopt a method based on the decomposition of the subspace of the noisy signal to estimate the dimension and noise power spectrum of the noise subspace, and estimate the noise power spectrum on each frame signal , greatly reducing the noise estimation error.

在目标语音信号与噪声不相关的假设下，带噪语音信号在时帧t和频率k的功率谱矩阵Φ_XX(k，t)可表示为目标语音信号信号功率谱矩阵Φ_SS(k，t)和噪声信号功率谱矩阵Φ_NN(k，t)之和：Under the assumption that the target speech signal is uncorrelated with the noise, the power spectrum matrix Φ _XX (k, t) of the noisy speech signal at time frame t and frequency k can be expressed as the target speech signal signal power spectrum matrix Φ _SS (k, t ) and the sum of the noise signal power spectrum matrix Φ _NN (k, t):

Φ_XX(k，t)＝Φ_SS(k，t)+Φ_NN(k，t)(6)Φ _XX (k, t) = Φ _SS (k, t) + Φ _NN (k, t) (6)

对于麦克风阵列信号而言，可假设各阵元上噪声信号的自功率谱相等，而阵元间噪声信号不相关，则下式成立：For the microphone array signal, it can be assumed that the self-power spectrum of the noise signal on each array element is equal, and the noise signals between the array elements are not correlated, then the following formula holds:

Φ_NN(k，t)＝φ_NN(k，t)I (7)Φ _NN (k, t) = Φ _NN (k, t)I (7)

其中，I为L阶单位矩阵，φ_NN(k，t)为单通道噪声的自功率谱。Among them, I is the L-order identity matrix, and φ _NN (k, t) is the self-power spectrum of single-channel noise.

令目标语音信号功率谱矩阵的特征值分解为：Let the eigenvalue decomposition of the power spectrum matrix of the target speech signal be:

Φ_SS(k，t)＝UΛ_SSU^H (8)Φ _SS (k, t) = U Λ _SS U ^H (8)

其中，Λ_SS为特征值降序排列的特征值矩阵，U为对应的特征向量矩阵，Q为矩阵的秩，且Q≤L。Among them, _ΛSS is the eigenvalue matrix with the eigenvalues arranged in descending order, U is the corresponding eigenvector matrix, Q is the rank of the matrix, and Q≤L.

利用特征值分解可将带噪信号空间分为两个子空间：信号子空间(包含目标语音信号和噪声)和噪声子空间(只包含噪声)。设带噪信号功率谱矩阵特征值分解为：Using eigenvalue decomposition, the noisy signal space can be divided into two subspaces: signal subspace (including target speech signal and noise) and noise subspace (only noise). Let the eigenvalue decomposition of the power spectrum matrix of the noisy signal be:

Φ_XX(k，t)＝UΛ_XXU^H＝U(Λ_SS+φ_NN(k，t)I)U^H (9)Φ _XX (k, t) = UΛ _XX U ^H ＝ U(Λ _SS + φ _NN (k, t)I) U ^H (9)

Λ_XX为特征值降序排列的带噪语音信号功率谱特征值矩阵，I为L阶单位阵。 _ΛXX is the eigenvalue matrix of the power spectrum eigenvalue of the noisy speech signal arranged in descending order of eigenvalue, and I is the L-order unit matrix.

本发明提出了从噪声子空间中估计得到噪声自功率谱φ_NN的方法。首先需要确定信号子空间的维度Q和噪声子空间维度P。The present invention proposes a method for estimating and obtaining the noise autopower spectrum φ _NN from the noise subspace. First, the dimension Q of the signal subspace and the dimension P of the noise subspace need to be determined.

在步骤b)中，提供了一种通过极大化带噪语音信号中目标语音信号的存在概率来确定Q的方法，即取最合适的Q值使得目标语音信号存在的概率最大。In step b), a method for determining Q by maximizing the existence probability of the target speech signal in the noisy speech signal is provided, that is, taking the most appropriate Q value to maximize the probability of the target speech signal.

利用条件概率计算，定义互斥事件H₀和H₁：Using conditional probability calculation, define mutually exclusive events H ₀ and H ₁ :

$\underset{Q Q}{arg arg max max} P P ((S S ((k k,, t t)) | | {H h}_{11})) - - - - - - ((1010))$

步骤c)提供了一种基于谱的平稳性选择带噪语音信号中噪声功率谱分布模型的自适应方法。该方法包括下列步骤：Step c) provides an adaptive method for selecting a noise power spectrum distribution model in a noisy speech signal based on the stationarity of the spectrum. The method includes the following steps:

首先，定义判别函数ΩFirst, define the discriminant function Ω

$Ω Ω = = \frac{\sqrt[((L L - - Q Q))]{{Π Π}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}}}}{\frac{11}{L L - - Q Q} {Σ Σ}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}}} - - - - - - ((1111))$

即，Ω为几何平均

对算术平均

的比值其中，

是带噪语音信号功率谱特征值矩阵Λ_XX的第i个特征值，i∈{Q+1，…，L}是特征值的下标，Ω的值在0到1之间。That is, Ω is the geometric mean

arithmetic mean

The ratio of which,

is the i-th eigenvalue of the power spectrum eigenvalue matrix _ΛXX of the noisy speech signal, i∈{Q+1,...,L} is the subscript of the eigenvalue, and the value of Ω is between 0 and 1.

然后，确定两个预设阈值，Ω₁和Ω₂(Ω₁＜Ω₂)，比较判别函数与预设阈值，特别地，如果判别函数小于预设阈值Ω₁，则选用零均值高斯分布；如果判别大于预设阈值Ω₂，则选用伽玛分布；否则选用拉普拉斯分布。Then, determine two preset thresholds, Ω ₁ and Ω ₂ (Ω ₁ <Ω ₂ ), compare the discriminant function with the preset threshold, in particular, if the discriminant function is smaller than the preset threshold Ω ₁ , then use a zero-mean Gaussian distribution; If the discrimination is greater than the preset threshold Ω ₂ , the Gamma distribution is selected; otherwise, the Laplace distribution is selected.

在步骤d)中，提供了一种利用条件概率估计噪声功率谱的方法。对于每一帧带噪语音信号，它只含有噪声的概率是P(H₀|X)，即含有噪声又含有目标语音信号的概率是P(H₁|X)；针对这两种情况，分别估计噪声功率谱如下：In step d), a method for estimating the noise power spectrum using conditional probability is provided. For each frame of noisy speech signal, the probability that it only contains noise is P(H ₀ |X), that is, the probability that it contains both noise and target speech signal is P(H ₁ |X); for these two cases, respectively The estimated noise power spectrum is as follows:

$\{\begin{matrix} {H h}_{00} : : {φ φ}_{NN NN}^{00} = = \frac{11}{L L} {Σ Σ}_{i i = = 11}^{L L} {λ λ}_{{X x}_{i i}} \\ {H h}_{11} : : {φ φ}_{NN NN}^{11} = = \frac{11}{L L - - Q Q} {Σ Σ}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}} \end{matrix} - - - - - - ((1212))$

其中，i∈{1，…，L}是特征值的下标，φ_NN ⁰和φ_NN ¹分别是噪声在互斥事件H₀和H₁发生情况下的功率谱。where i∈{1,...,L} is the subscript of the eigenvalues, and φ _NN ⁰ and φ _NN ¹ are the power spectra of the noise when mutually exclusive events H ₀ and H ₁ occur, respectively.

根据条件概率公式，噪声功率谱估计方法如下：According to the conditional probability formula, the noise power spectrum estimation method is as follows:

${\overset{~ ~}{φ φ}}_{NN NN} = = P P (({H h}_{00} | | X x)) {φ φ}_{NN NN}^{00} + + P P (({H h}_{11} | | X x)) {φ φ}_{NN NN}^{11} - - - - - - ((1313))$

步骤e)提供了一种根据信号子空间维度和噪声功率谱估计，利用听觉掩蔽效应，基于信号子空间估计得到各频点的听觉掩蔽阈值的方法。Step e) provides a method for obtaining the auditory masking threshold of each frequency point based on the signal subspace estimation by using the auditory masking effect according to the dimension of the signal subspace and the estimation of the noise power spectrum.

听觉频率范围是0到15500Hz，覆盖了24个临界子频带，需要在每个子频带中计算听觉掩蔽阈值。首先计算各子频带上各频点的能量，再计算人耳基膜对于各频段声音的传播系数，然后将各子频带上各频点的能量和各频段声音的传播系数两者相乘得到人耳基膜上的激励能量值。最后，根据人耳基膜上的激励能量值与听觉掩蔽阈值的函数关系，再进一步计算得到掩蔽阈值。The auditory frequency range is 0 to 15500 Hz, covering 24 critical sub-bands, and the auditory masking threshold needs to be calculated in each sub-band. First calculate the energy of each frequency point on each sub-band, and then calculate the transmission coefficient of the human ear basement membrane for each frequency band sound, and then multiply the energy of each frequency point on each sub-band and the transmission coefficient of each frequency band sound to obtain the human Excitation energy values on the ear basilar membrane. Finally, according to the functional relationship between the excitation energy value on the basilar membrane of the human ear and the auditory masking threshold, the masking threshold is further calculated.

步骤f)提供了一种根据噪声功率谱、听觉掩蔽阈值，结合拉格朗日乘子估计后滤波器G(e^jω)的方法。使得增强语音中的残余噪声小于人耳的听觉掩蔽阈值，从而消除残余噪声影响，并使目标语音信号的畸变尽可能的小。完成麦克风阵列后滤波语音增强。Step f) provides a method for estimating the post-filter G(e ^jω ) according to the noise power spectrum, auditory masking threshold, and Lagrangian multipliers. The residual noise in the enhanced speech is made smaller than the auditory masking threshold of the human ear, thereby eliminating the influence of the residual noise and making the distortion of the target speech signal as small as possible. Filtered speech enhancement after completing the microphone array.

假设最小方差非畸变响应波束形成器的输出信号为

目标语音信号信号为S(e^jω)，后滤波增强后的语音信号与目标语音信号信号的误差可表述如下：Assume that the output signal of the minimum variance undistorted response beamformer is

The target speech signal is S(e ^jω ), and the error between the post-filtered and enhanced speech signal and the target speech signal can be expressed as follows:

$E E. (({e e}^{jω jω})) = = G G (({e e}^{jω jω})) \overset{~ ~}{S S} (({e e}^{jω jω})) - - S S (({e e}^{jω jω})) = = [[G G (({e e}^{jω jω})) - - 11]] S S (({e e}^{jω jω})) + + G G (({e e}^{jω jω})) \overset{~ ~}{N N} (({e e}^{jω jω})) - - - - - - ((1414))$

其中，

为

中的噪音。in,

for

in the noise.

式(14)中的第一项描述了增强语音中目标语音信号的畸变，第二项描述了增强语音中残余噪声的大小。可计算出一个合适的后滤波器G(e^jω)使得增强语音中的残余噪声小于人耳的听觉掩蔽阈值，从而消除其影响。针对式(14)，本发明提出如下目标约束：The first term in formula (14) describes the distortion of the target speech signal in the enhanced speech, and the second term describes the size of the residual noise in the enhanced speech. An appropriate post-filter G(e ^jω ) can be calculated to make the residual noise in the enhanced speech smaller than the auditory masking threshold of the human ear, thereby eliminating its influence. For formula (14), the present invention proposes the following target constraints:

$min min {E E.}_{T T} = = {[[G G (({e e}^{jω jω})) - - 11]]}^{22} S S {(({e e}^{jω jω}))}^{22} + + G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22} - - - - - - ((1515))$

约束条件：Restrictions:

$G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22} \leq \leq {C C}_{thr thr} - - - - - - ((1616))$

其中，C_thr为听觉掩蔽阈值。Among them, C _thr is the auditory masking threshold.

用拉格朗日乘子法求解，令：Solve using the Lagrange multiplier method, let:

$J J = = {E E.}_{T T} + + μ μ ((G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22} - - {C C}_{thr thr})) - - - - - - ((1717))$

其中，μ是拉格朗日乘子。where μ is the Lagrangian multiplier.

令J对G(e^jω)求导，并使其为零，可得：Let J take the derivative of G(e ^jω ) and make it zero, we can get:

$G G (({e e}^{jω jω})) = = \frac{S S {(({e e}^{jω jω}))}^{22}}{S S {(({e e}^{jω jω}))}^{22} + + ((11 + + μ μ)) \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}} - - - - - - ((1818))$

由式(18)可看出在本发明的目标约束下，基于听觉感知特性的后滤波器在表达形式上就是更合理地估计了噪声的维纳滤波器。It can be seen from formula (18) that under the objective constraints of the present invention, the post-filter based on auditory perception characteristics is a Wiener filter that estimates noise more reasonably in terms of expression.

令J对μ求导，并使其为零，可得：Let J take the derivative with respect to μ, and make it zero, we can get:

$G G (({e e}^{jω jω})) = = \sqrt{\frac{{C C}_{thr thr}}{\overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}}} - - - - - - ((1919))$

由(18)和(19)两式相等，可得：From (18) and (19) are equal, we can get:

$11 + + μ μ = = \frac{S S {(({e e}^{jω jω}))}^{22}}{\overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}} max max ((\sqrt{\frac{\overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}}{{C C}_{thr thr}}} - - 1,0 1,0)) - - - - - - ((2020))$

将(20)带入(18)，并用式(13)中的代替

得到本文所提的基于听觉感知特性的后滤波器如下：Bring (20) into (18), and use the formula (13) replace

The post-filter based on auditory perception characteristics proposed in this paper is obtained as follows:

$G G (({e e}^{jω jω})) = = \frac{11}{11 + + max max ((\sqrt{\frac{{\overset{~ ~}{φ φ}}_{NN NN}}{{C C}_{thr thr}}} - - 1,0 1,0))} - - - - - - ((21 twenty one))$

在图1中出一个应用基于多模型和听觉特性的麦克风阵列后滤波语音增强方法流程图。系统包括至少两个麦克风101的麦克风阵列。FIG. 1 shows a flow chart of a speech enhancement method based on multi-model and auditory characteristic-based microphone array post-filtering. The system comprises a microphone array of at least two microphones 101 .

麦克风阵列的麦克风可能有不同的排列，特别地，麦克风101被置于一排，其中每个麦克风和相邻近的麦克风有预设距离。例如，两个麦克风之间的距离可能大约是5厘米。对于不同的应用环境和技术要求，麦克风阵列可能被安装在适当的位置。The microphones of the microphone array may be arranged in different ways. In particular, the microphones 101 are arranged in a row, wherein each microphone has a predetermined distance from adjacent microphones. For example, the distance between two microphones may be about 5 cm. For different application environments and technical requirements, microphone arrays may be installed in appropriate locations.

从麦克风101采集的语音信号被送到信号处理单元102。在送往信号处理单元之前，语音信号可以经过低通滤波器来预处理语音信号。The voice signal collected from the microphone 101 is sent to the signal processing unit 102 . Before being sent to the signal processing unit, the speech signal may be preprocessed by passing through a low-pass filter.

信号处理单元102对不同麦克风输采集的语音信号进行延迟补偿以实现时域对齐。使用短时离散傅里叶变换将对齐后的各麦克风信号表示成复数值的频率信号形式，计算麦克风阵列采集的多路带噪语音信号在时帧t、频率k的功率谱矩阵Φ_XX(k，t)并对此矩阵进行特征值分解，得到特征值矩阵Λ_XX和特征向量矩阵U。The signal processing unit 102 performs delay compensation on the voice signals collected by different microphone inputs to achieve time domain alignment. Use the short-time discrete Fourier transform to express each microphone signal after alignment into a complex-valued frequency signal form, and calculate the power spectrum matrix Φ _XX (k , t) and perform eigenvalue decomposition on this matrix to obtain eigenvalue matrix Λ _XX and eigenvector matrix U.

在接下来的步骤103中，利用特征值矩阵Λ_XX，通过极大化带噪语音信号中目标语音信号的存在概率的方法，确定信号子空间的维度Q。In the next step 103, using the eigenvalue matrix Λ _XX , the dimension Q of the signal subspace is determined by maximizing the existence probability of the target speech signal in the noisy speech signal.

接着，步骤104利用信号子空间的维度Q，基于谱的平稳性，自适应选择带噪语音信号中噪声功率谱分布模型。Next, step 104 uses the dimension Q of the signal subspace to adaptively select a noise power spectrum distribution model in the noisy speech signal based on the stationarity of the spectrum.

步骤105利用信号子空间维度Q和噪声功率谱分布模型，根据条件概率估计噪声功率谱。Step 105 uses the signal subspace dimension Q and the noise power spectrum distribution model to estimate the noise power spectrum according to the conditional probability.

步骤106利用信号子空间维度和噪声功率谱估计，根据听觉掩蔽效应，基于信号子空间估计得到各频点的听觉掩蔽阈值。Step 106 uses the signal subspace dimension and the noise power spectrum estimation to obtain the auditory masking threshold of each frequency point based on the signal subspace estimation according to the auditory masking effect.

最后，步骤107利用噪声功率谱估计和听觉掩蔽阈值，结合拉格朗日乘子设计后滤波器。Finally, step 107 utilizes noise power spectrum estimation and auditory masking threshold, and combines Lagrangian multipliers to design a post-filter.

在图2，说明了一个确定信号子空间维度的方法的流程，该方法对应于图1中的步骤103。In FIG. 2 , a flowchart of a method for determining the dimension of a signal subspace is illustrated, and the method corresponds to step 103 in FIG. 1 .

经过步骤101和步骤102，麦克风阵列采集的语音信号已经通过时域对齐，短时傅里叶变换。并对多路带噪语音信号的功率谱Φ_XX进行特征值分解，得到特征值矩阵Λ_XX和特征向量矩阵U。由(9)式，带噪信号功率谱特征值矩阵被分解为信号功率谱特征值与噪声功率谱特征值的和，Q是信号子空间的维度。After step 101 and step 102, the voice signal collected by the microphone array has been aligned in the time domain and short-time Fourier transformed. And perform eigenvalue decomposition on the power spectrum Φ _XX of the multi-channel noisy speech signal to obtain the eigenvalue matrix Λ _XX and the eigenvector matrix U. According to (9), the noisy signal power spectrum eigenvalue matrix is decomposed into the sum of signal power spectrum eigenvalues and noise power spectrum eigenvalues, and Q is the dimension of the signal subspace.

在第一步骤201中，初始化信号子空间的维度Q，令其为1。In the first step 201, the dimension Q of the signal subspace is initialized to be 1.

接下来，步骤202更新噪声功率谱和目标语音信号功率谱。由于带噪语音信号功率谱特征值矩阵Λ_XX是降序排列，并假设信号强度大于噪声，所以当信号子空间的维度为Q时，噪声的功率为Next, step 202 updates the noise power spectrum and target speech signal power spectrum. Since the power spectrum eigenvalue matrix _ΛXX of the noisy speech signal is arranged in descending order, and it is assumed that the signal strength is greater than the noise, so when the dimension of the signal subspace is Q, the power of the noise is

${φ φ}_{NN NN} = = \frac{11}{L L - - Q Q} {Σ Σ}_{i i = = Q Q + + 11}^{L L} {λ λ}_{{X x}_{i i}} - - - - - - ((22 twenty two))$

其中，i∈{Q+1，…，L}是特征值的下标。where i∈{Q+1,...,L} is the subscript of the eigenvalues.

而目标语音信号的功率为The power of the target speech signal is

$S S = = \frac{11}{Q Q} {Σ Σ}_{i i = = 11}^{Q Q} {(({λ λ}_{{X x}_{i i}} - - {φ φ}_{NN NN}))}^{\frac{11}{22}} - - - - - - ((23 twenty three))$

其中，i∈{1，…，Q}是特征值的下标。where i ∈ {1,...,Q} is the subscript of the eigenvalues.

那么，目标语音信号的方差为Then, the variance of the target speech signal is

${v v}_{s the s} = = \{\begin{matrix} {λ λ}_{{X x}_{11}} - - {φ φ}_{NN NN} & Q Q = = 11 \\ \frac{11}{Q Q} {Σ Σ}_{i i = = 11}^{Q Q} {[[{(({λ λ}_{{X x}_{i i}} - - {φ φ}_{NN NN}))}^{\frac{11}{22}} - - S S]]}^{22} & Q Q > > 11 \end{matrix} - - - - - - ((24 twenty four))$

其中，其中，i∈{1，…，Q}是特征值的下标。where, i ∈ {1,...,Q} is the subscript of the eigenvalues.

步骤203从高斯模型、拉普拉斯模型和伽玛模型中任意选择一个来描述目标语音信号的谱分布。计算目标语音信号的条件概率P_G(S(k，t)|H₁)，特别地，当选择高斯模型时，Step 203 randomly selects one from Gaussian model, Laplacian model and Gamma model to describe the spectral distribution of the target speech signal. Calculate the conditional probability P _G (S(k,t)|H ₁ ) of the target speech signal, especially, when the Gaussian model is selected,

${P P}_{G G} ((S S ((k k,, t t)) | | {H h}_{11})) = = \frac{11}{\sqrt{22 π π {v v}_{s the s} ((k k,, t t))}} exp exp {{- - \frac{{S S}^{22} ((k k,, t t))}{22 {v v}_{s the s} ((k k,, t t))}}}$

步骤204实现变量Q和j的自加运算：Step 204 realizes the self-increment operation of variables Q and j:

Q＝Q+1Q=Q+1

接着步骤205判断循环终止条件Q＞L，特别地，当条件不满足时，返回步骤202；否则进行步骤206。Then step 205 judges the loop termination condition Q>L, especially, when the condition is not satisfied, return to step 202; otherwise, go to step 206.

步骤206利用本发明的(10)式，最终确定了信号子空间的维度Q，即Step 206 utilizes formula (10) of the present invention to finally determine the dimension Q of the signal subspace, namely

$\underset{Q Q}{arg arg max max} P P ((S S ((k k,, t t)) | | {H h}_{11})) . .$

在图3中，说明了一个确定带噪语音信号中噪声功率谱分布模型的流程图。该方法对应于图1中的步骤104。In Fig. 3, a flowchart for determining a noise power spectral distribution model in a noisy speech signal is illustrated. This method corresponds to step 104 in FIG. 1 .

高斯模型、拉普拉斯模型和伽玛模型都可以被用来描述语音信号和噪声信号的谱系数，但是对于不同的噪声类型其噪声特性也会有所不同，所以模型选择应根据目标噪声的特性有针对性的进行。在本示例中，根据计算机风扇噪声的统计数据给出了一种基于谱的平稳性进行模型选择的方法。Gaussian model, Laplacian model and Gamma model can all be used to describe the spectral coefficients of speech signals and noise signals, but the noise characteristics will be different for different noise types, so the model selection should be based on the target noise Features are targeted. In this example, a method for model selection based on spectral stationarity is presented based on statistics of computer fan noise.

在步骤301中，由(11)式计算出判别函数值Ω。In step 301, the discriminant function value Ω is calculated from the formula (11).

步骤302判断判别函数值Ω是否小于Ω₁，如果判断结果为真，则选择高斯模型；否则执行步骤303，判断判别函数值Ω是否小于Ω₂，如果判断结果为真，则选择拉普拉斯模型；否则选择伽玛模型。Step 302 judges whether the discriminant function value Ω is less than Ω ₁ , if the judgment result is true, then select the Gaussian model; otherwise, execute step 303, judge whether the discriminant function value Ω is smaller than Ω ₂ , if the judgment result is true, then select the Laplace model; otherwise the gamma model is chosen.

本发明体现的模型自适应选择算法，是基于在对大量计算机风扇噪声实验数据统计的结果。实验发现高斯模型在Ω取较小值时为最优模型，在Ω值较大时，拉普拉斯模型最优，而伽玛模型总的平均噪声估计误差是最小的。据此，本发明进行模型选择如下：The model self-adaptive selection algorithm embodied in the present invention is based on the statistical results of a large number of computer fan noise experiment data. Experiments have found that the Gaussian model is the optimal model when the value of Ω is small, the Laplace model is the best when the value of Ω is large, and the total average noise estimation error of the Gamma model is the smallest. Accordingly, the present invention selects the model as follows:

在图4中，说明了一个利用条件概率估计噪声功率谱的方法流程图。该方法对应于图1中的步骤105。In Fig. 4, a flowchart of a method for estimating the noise power spectrum using conditional probability is illustrated. This method corresponds to step 105 in FIG. 1 .

步骤401计算带噪语音信号起始段纯噪声帧的平均功率谱φ_NN ^pre。Step 401 calculates the average power spectrum φ _NN ^pre of the pure noise frame at the beginning of the noisy speech signal.

步骤402计算计算当前帧的功率谱Step 402 calculates and calculates the power spectrum of the current frame

${φ φ}_{NN NN}^{cur cur} = = \frac{11}{L L} {Σ Σ}_{i i = = 11}^{L L} {λ λ}_{{X x}_{i i}}$

其中，i∈{1，…，L}是特征值的下标。where i ∈ {1,...,L} is the subscript of the eigenvalues.

接下来步骤403计算当前帧功率谱与纯噪声功率谱的比值Next step 403 calculates the ratio of current frame power spectrum and pure noise power spectrum

$r r = = \frac{{φ φ}_{NN NN}^{cur cur}}{{φ φ}_{NN NN}^{pre pre}}$

步骤403到步骤408共同完成了条件概率P(H₀|X)的计算。首先比较r与设定阈值α的大小，α取略大于1的较小值，特别地，α取为1.2。当r＜α时，当前帧更可能为纯噪声帧，所以P(H₀|X)应取较大的值，本发明设置其下限为0.8。如果当r＞α，当前帧更可能是语音帧，此时P(H₀|X)应取一个合适的值。由于信号的能量在各个频率上分布式不均匀的，所以，这里根据不同的频率取不同的P(H₀|X)值。在低频时，P(H₀|X)的值应大于高频的值，因为信号的能量大多集中在低频区域。即Steps 403 to 408 jointly complete the calculation of the conditional probability P(H ₀ |X). First, compare r with the set threshold α, where α takes a smaller value slightly greater than 1, in particular, takes α as 1.2. When r<α, the current frame is more likely to be a pure noise frame, so P(H ₀ |X) should take a larger value, and the present invention sets its lower limit to 0.8. If r>α, the current frame is more likely to be a speech frame, and P(H ₀ |X) should take an appropriate value at this time. Since the energy of the signal is unevenly distributed on each frequency, different P(H ₀ |X) values are taken here according to different frequencies. At low frequency, the value of P(H ₀ |X) should be greater than that at high frequency, because most of the energy of the signal is concentrated in the low frequency region. Right now

$P P (({H h}_{00} | | X x)) = = \{\begin{matrix} max max ((\frac{11}{11 + + r r {β β}_{11}},, 0.8 0.8)) & r r \leq \leq 1.2 1.2 \\ \{\begin{matrix} \frac{11}{11 + + r r {β β}_{22}} & if if & f f \leq \leq {f f}_{thr thr} \\ \frac{11}{11 + + r r {β β}_{33}} & if if & f f > > {f f}_{thr thr} \end{matrix} & r r > > 1.2 1.2 \end{matrix} - - - - - - ((2626))$

其中，f_thr是高低频的界限频率，β₁和β₂是加权系数。Among them, f _thr is the limit frequency of high and low frequencies, and β ₁ and β ₂ are weighting coefficients.

步骤409计算条件概率P(H₁|X)＝1-P(H₀|X)。Step 409 calculates the conditional probability P(H ₁ |X)=1-P(H ₀ |X).

得到条件概率P(H₀|X)和P(H₁|X)以后，步骤410利用(13)式得到噪声功率谱的估计值

After obtaining the conditional probabilities P(H ₀ |X) and P(H ₁ |X), step 410 uses formula (13) to obtain the estimated value of the noise power spectrum

在图5中，说明了一种计算听觉掩蔽阈值的方法的流程图。该方法对应于图1中的步骤106。为了将信号中的噪声掩蔽掉，从而实现对目标语音信号信号的增强，需要将噪声限制在该阈值以下。In Fig. 5, a flow diagram of a method of calculating an auditory masking threshold is illustrated. This method corresponds to step 106 in FIG. 1 . In order to mask the noise in the signal so as to enhance the target speech signal, it is necessary to limit the noise below the threshold.

步骤501将0到15500Hz的人耳听觉范围划分为24个子频带，以便于在每个子频带中计算听觉掩蔽阈值。Step 501 divides the human hearing range from 0 to 15500 Hz into 24 sub-bands, so as to calculate the auditory masking threshold in each sub-band.

在步骤502中，利用步骤206所得的信号子空间维度，计算了各频点的能量。H(j，b)表示的是第j个子频带内第b个频点上的能量，可根据信号子空间特征值和特征向量计算出来。In step 502, the energy of each frequency point is calculated using the dimension of the signal subspace obtained in step 206. H(j, b) represents the energy at the bth frequency point in the jth subband, which can be calculated according to the signal subspace eigenvalue and eigenvector.

$H h ((j j,, b b)) = = mean mean ((\frac{11}{L L} {Σ Σ}_{i i = = 11}^{Q Q} {λ λ}_{{S S}_{i i}} {| | {U u}_{11,, i i} | |}^{22})) - - - - - - ((2727))$

其中， $λ_{S_{i}} = λ_{X_{i}} - {\tilde{φ}}_{NN}$ 为目标语音信号功率谱矩阵的特征值估计，U_1，i为信号子空间的第i个基，i∈{1，…，Q}是特征值的下标mean(·)为取均值算子。in, $λ_{S_{i}} = λ_{x_{i}} - {\tilde{φ}}_{NN}$ is the eigenvalue estimation of the power spectrum matrix of the target speech signal, U _{1, i} is the i-th basis of the signal subspace, i∈{1,...,Q} is the subscript of the eigenvalue mean(·) is the mean operator .

SF(j)是表达第j个子频带上人耳基膜传播特性的函数，j∈{1，…，24}。SF(j) is a function expressing the propagation characteristics of the basilar membrane of the human ear on the jth sub-band, j ∈ {1,...,24}.

在步骤503中，计算每个子频带的传播函数In step 503, the propagation function of each sub-band is calculated

$SF (j) = 15.81 + 7.5 (j + 0.474) - 17.5 \sqrt{1 + {(j + 0.474)}^{2}},$ j∈{1，…，24}(28) $SF (j) = 15.81 + 7.5 (j + 0.474) - 17.5 \sqrt{1 + {(j + 0.474)}^{2}},$ j ∈ {1,...,24} (28)

接下来，步骤504计算表征人耳基膜上能量的激励能量值Next, step 504 calculates the excitation energy value representing the energy on the basilar membrane of the human ear

C(j，b)＝SF(j)*H(j，b)，j∈{1，…，24}(29)C(j,b)=SF(j)*H(j,b), j∈{1,...,24} (29)

步骤505，计算听觉掩蔽阈值Step 505, calculating the auditory masking threshold

${C C}_{thr thr} = = 1010^{{log log}_{1010} | | C C ((j j,, b b)) | | - - | | \frac{O o ((j j))}{1010} | | - - | | \frac{{\overset{~ ~}{φ φ}}_{NN NN}}{1010} | |} - - - - - - ((3030))$

其中，O(j)是偏移量，j∈{1，…，24}表示第j个子频带。where O(j) is the offset, and j ∈ {1,...,24} denotes the jth subband.

在图6中，说明了一个设计后滤波器的流程图。该方法对应于图1中的步骤107。In Fig. 6, a flowchart for designing a post-filter is illustrated. This method corresponds to step 107 in FIG. 1 .

在保证增强后的语音中残余噪声的功率低于听觉掩蔽阈值的条件下，为使目标语音信号信号的畸变达到最小。Under the condition that the power of the residual noise in the enhanced speech is lower than the auditory masking threshold, the distortion of the target speech signal is minimized.

步骤601描述有约束的最优化问题，如下：Step 601 describes a constrained optimization problem as follows:

目标：Target:

$min min {E E.}_{T T} = = {[[G G (({e e}^{jω jω})) - - 11]]}^{22} S S {(({e e}^{jω jω}))}^{22} + + G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}$

约束条件：Restrictions:

$G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22} \leq \leq {C C}_{thr thr}$

步骤602利用拉格朗日乘子法求解，令：Step 602 utilizes Lagrange multiplier method to solve, make:

$J J = = {E E.}_{T T} + + μ μ ((G G {(({e e}^{jω jω}))}^{22} \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22} - - {C C}_{thr thr}))$

令J对G(e^jω)和μ分别求导，并使其为零，可得：Let J take derivatives of G(e ^jω ) and μ respectively, and make them zero, we can get:

$\{\begin{matrix} G G (({e e}^{jω jω})) = = \frac{S S {(({e e}^{jω jω}))}^{22}}{S S {(({e e}^{jω jω}))}^{22} + + ((11 + + μ μ)) \overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}} \\ G G (({e e}^{jω jω})) = = \sqrt{\frac{{C C}_{thr thr}}{\overset{~ ~}{N N} {(({e e}^{jω jω}))}^{22}}} \end{matrix}$

步骤603求解此方程子，得到后滤波器的最优估计，即：Step 603 solves this equation to obtain the optimal estimate of the post-filter, namely:

$G G (({e e}^{jω jω})) = = \frac{11}{11 + + max max ((\sqrt{\frac{{\overset{~ ~}{φ φ}}_{NN NN}}{{C C}_{thr thr}}} - - 1,0 1,0))}$

再将步骤410得到的噪声功率谱估计

和505得到的听觉掩蔽阈值C_thr带入，步骤604完成后滤波器的设计。Then estimate the noise power spectrum obtained in step 410

and the auditory masking threshold C _thr obtained in 505 are brought in, and step 604 completes the design of the post-filter.

根据本说明书，本发明进一步的修改和变化对于所述领域的技术人员是显而易见的。因此，本说明将被视为说明性的并且其目的是向所属领域技术人员讲授用于执行本发明的一般方法。应当理解，本说明书示出和描述的本发明的形式就被看作是当前的优选实施例。Further modifications and variations of the invention will be apparent to those skilled in the art from the present description. Accordingly, the description is to be regarded as illustrative and its purpose is to teach the general method for carrying out the invention to those skilled in the art. It should be understood that the form of the invention shown and described in this specification is to be considered as the presently preferred embodiment.

Claims

1. one kind based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, it is characterized in that, comprises the following steps:

Step a: the multi-path voice signal of the microphone array collection band noise of forming by L microphone, the voice signal of each road band noise is carried out time domain alignment, the frequency signal form of each the road signal indication value of pluralizing after using discrete Fourier transform in short-term to align is calculated the spectral power matrix of microphone array multiple signals and this spectral power matrix is carried out characteristic value decompose and obtain eigenvalue matrix and eigenvectors matrix;

Step b: by the probability that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace, and Q≤L;

Step c: based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal;

Steps d: utilize conditional probability estimating noise power spectrum;

Step e: estimate according to signal subspace dimension and noise power spectrum, utilize auditory masking effect, estimate to obtain the auditory masking threshold of each frequency based on signal subspace;

Step f: according to noise power spectrum, auditory masking threshold, estimate postfilter in conjunction with Lagrange multiplier, residual noise in the feasible enhancing voice is less than the auditory masking threshold of people's ear, thereby eliminate the residual noise influence, and make the distortion of target voice signal as much as possible little, the filtering voice strengthen after finishing microphone array, wherein:

Described step in conjunction with Lagrange multiplier estimation postfilter G is as follows:

Step fa: under the constraints of residual noise power less than masking threshold, minimize the distortion of target voice signal, set up optimization problem with this;

Step fb: find the solution in conjunction with Lagrange multiplier, obtain the optimal estimation of postfilter;

Step fc: bring auditory masking threshold and noise power spectrum into and estimate, finish the design of postfilter.

2. the method for claim 1 is characterized in that, describedly spectral power matrix is carried out characteristic value decomposes, and comprising:

Utilize characteristic value to decompose the Noisy Speech Signal space is divided into two sub spaces, i.e. signal subspace: comprise target voice signal and noise; Noise subspace: only comprise noise; The spectral power matrix Φ of Noisy Speech Signal X at time frame t and frequency k _XX(k, t) characteristic value is decomposed into:

Φ _XX(k，t)＝UΛ _XXU ^H＝U(Λ _SS+φ _NN(k，t)I)U ^H

Wherein, X=S+N, X are Noisy Speech Signal, and S is the target voice signal, and N is noise; Λ _XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, Λ _SSBe the target voice signal power spectrum characteristic value matrix of characteristic value descending, U is eigenvectors matrix, φ _NN(k t) is the noise power of time frame t and frequency k, and I is L rank unit matrix,

Be the conjugate transpose operator.

3. the method for claim 1 is characterized in that, described definite signal subspace dimension is to get the probability maximum that only Q value makes that the target voice signal exists in the noisy speech; Utilize conditional probability to calculate, step comprises:

Definition exclusive events H ₀And H ₁:

Event H ₀: in the Noisy Speech Signal, only there is noise, do not have the target voice signal;

Event H ₁: in the Noisy Speech Signal, target voice signal and noise exist simultaneously;

Signal subspace dimension Q is defined as:

\underset{Q}{\arg \max} P (S (k, t) | H_{1})

Wherein, (k t) is the power spectrum of target voice signal signal on k Frequency point of t frame to S, and P () is the distribution function of target voice signal spectrum, and argmax () is the operator of seeking the parameter value with maximum scores.

4. the method for claim 1 is characterized in that, described stationarity based on spectrum, and noise power spectrum distributed model in the adaptively selected Noisy Speech Signal may further comprise the steps:

Step c1: define a discriminant function Ω who is used for explaining the stationarity of power spectrum:

Ω = \frac{(L - Q) \sqrt{Π_{i = Q + 1}^{L} λ_{X_{i}}}}{\frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}}}

That is, Ω is geometric average

To arithmetic average

Ratio, wherein,

Be Noisy Speech Signal power spectrum characteristic value matrix Λ _XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1;

Step c2: compare according to discriminant score and predetermined threshold value, determine to be useful in the noise power spectrum distributed model in the Noisy Speech Signal.

5. method as claimed in claim 4 is characterized in that, described comparison step according to discriminant score and predetermined threshold value comprises:

Step c21: determine two predetermined threshold value Ω ₁And Ω ₂, Ω ₁＜Ω ₂

Step c22: compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω ₁, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω ₂, then select Gamma distribution for use; Otherwise select laplacian distribution for use.

6. the method for claim 1 is characterized in that, utilizes the step of conditional probability estimating noise power spectrum to comprise:

For each frame Noisy Speech Signal, the probability that it only contains noise is P (H ₀| X), namely containing the probability that noise contains the target voice signal again is P (H ₁| X); At both of these case, the estimating noise power spectrum is as follows respectively:

\{\begin{matrix} H_{0} : φ_{NN}^{0} = \frac{1}{L} Σ_{i = 1}^{L} λ_{X_{i}} \\ H_{1} : φ_{NN}^{1} = \frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}} \end{matrix}

Wherein,

With Be respectively that noise is at exclusive events H ₀And H ₁Power spectrum under a situation arises, i ∈ 1 ..., L} is the subscript of characteristic value;

According to condition probability formula, noise power spectrum is estimated as follows:

{\tilde{φ}}_{NN} = P (H_{0} | X) φ_{NN}^{0} + P (H_{1} | X) φ_{NN}^{1} .

7. the method for claim 1 is characterized in that, the step of described estimation auditory masking threshold comprises:

Step f1: auditory frequency range 0-15500Hz is divided into several crucial sub-bands;

Step f2: calculate the auditory masking threshold in each sub-band respectively.

8. method as claimed in claim 7, it is characterized in that, auditory masking threshold in each sub-band of described calculating is the energy that calculates each frequency on each sub-band, calculate people's ear basement membrane for the propagation coefficient of each frequency range sound, then the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound being multiplied each other obtains the epilamellar excitation energy value of people's ear, and the functional relation according to the epilamellar excitation energy value of people's ear and auditory masking threshold calculates masking threshold again.