CN104091593A

CN104091593A - Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Info

Publication number: CN104091593A
Application number: CN201410175090.8A
Authority: CN
Inventors: 吴迪; 赵鹤鸣; 陶智
Original assignee: Suzhou University
Current assignee: Suzhou Cheng Bang Energy Conservation Science & Technology Co Ltd
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2014-10-08
Anticipated expiration: 2034-04-29
Also published as: CN104091593B

Abstract

The invention belongs to the field of speech recognition and discloses a speech endpoint detection algorithm using perceptual spectrum structure boundary parameters (PSSB). After the speech enhancement based on the auditory perception characteristics of the noisy speech, according to the difference between the continuous distribution characteristics of the speech signal and the random distribution characteristics of the residual noise, two-dimensional enhancement is performed on the time-frequency spectrum of the enhanced speech, Thus further highlighting the spectral structure of continuous distribution of pure speech. Through the two-dimensional boundary detection of the enhanced speech spectrum structure, the PSSB parameters are proposed and used for endpoint detection. The experimental results show that the endpoint detection algorithm using PSSB parameters can detect the endpoint of the speech more effectively in various signal-to-noise ratio environments from -10dB to 10dB of white noise. Under the extremely low SNR of -10dB, the proposed method still has a correct rate of 75.2%.

Description

Speech Endpoint Detection Algorithm Using Perceptual Spectral Structure Boundary Parameters

技术领域 technical field

本发明属于语音识别领域，涉及一种语音端点检测算法，尤其涉及一种采用感知语谱结构边界参数的语音端点检测算法。 The invention belongs to the field of speech recognition and relates to a speech endpoint detection algorithm, in particular to a speech endpoint detection algorithm using perceptual language spectrum structure boundary parameters. the

背景技术 Background technique

作为语音识别和说话人识别的基础，正确有效的端点检测，可以大大提高说话人识别系统和语音识别系统的识别率。在实验室高信噪比环境下，传统的端点检测算法可以很好地检测出语音端点。然而在低信噪比环境下，大多数端点检测算法的性能均急剧下降。 As the basis of speech recognition and speaker recognition, correct and effective endpoint detection can greatly improve the recognition rate of speaker recognition system and speech recognition system. In the environment of high signal-to-noise ratio in the laboratory, the traditional endpoint detection algorithm can detect the voice endpoint very well. However, under low SNR environment, the performance of most endpoint detection algorithms drops sharply. the

近年来，很多学者对噪声鲁棒的端点检测进行了研究。Ganapathiraju(A. Ganapathiraju, et al. Comparison of Energy-Based Endpoint Detectors for Speech Signal Processing . In Proc. lEEE Publications, 1996; 500-503)等人采用短时能量和短时过零率相结合的方法(Energy and Zero-Crossing Rate，EZCR)进行端点检测的研究。这种方法相对于传统的能量方法，端点检测具有更好的鲁棒性。然而这种方法无法在更低信噪比的环境下发挥作用。陈振标等人(陈振标, 徐波。基于子带能量特征的最优化语音端点检测算法研究。声学学报, 2005;30(2):171-176)根据语音的频域能量分布特点，研究了子带幅度[Sub-Band Amplitude，SBA] 及能量，并采用更具区分性和抗噪性的多个子带能量和图像处理中常用的最优化边缘检测相结合的检测算法来进行端点检测，使得端点检测在复杂噪声环境下的性能有明显改善。此外，Zhang等人(Xueying Zhang ,et al. A Speech Endpoint Detection Method Based on Wavelet Coefficient Variance and Sub-Band Amplitude Variance. . In Proc. lEEE ICICIC, 2006; 105-109)提出了一种利用小波系数（Wavelet Coefficient，WC）的方法，利用小波分析的方法进行端点检测，由于该方法能够在各尺度分析信号，所以能够在一定程度上区分出语音段和噪声段。Wu等人(Bing-Fei Wu, Kun-Ching Wang. Robust Endpoint Detection Algorithm Based on the Adaptive Band-Partitioning Spectral Entropy in Adverse Environments. IEEE Transactions on Speech and Audio Processing, 2005; 13(5):762-775)把自适应子带谱熵（Adaptive Band-Partitioning Spectral, ABSE）的方法用于端点检测。该方法可以很好的区分语音的子带信号与噪声，并在含有噪声的环境下取得了较好的端点检测正确率。Li(Q.Li, et al. A Robust real-time endpoint detector with energy normalization for ASR in adverse environments. International Conference on Acoustics Speech and Signal Processing, 2001; 574-577)借鉴图像处理中最优化边缘检测的方法用于语音的端点检测，采用一个滤波器加上三态决策逻辑进行端点检测，因此在不同信噪比的情况下不需要调整门限。该方法结合了图像处理的算法，对端点检测起到了很好的辅助作用。然而，以上这些方法在低信噪比环境下，都无法得到较高的端点检测正确率。 In recent years, many scholars have conducted research on noise-robust endpoint detection. Ganapathiraju (A. Ganapathiraju, et al. Comparison of Energy-Based Endpoint Detectors for Speech Signal Processing . In Proc. lEEE Publications, 1996; 500-503) et al. adopted a method combining short-term energy and short-term zero-crossing rate ( Energy and Zero-Crossing Rate, EZCR) for endpoint detection research. Compared with traditional energy methods, this method has better robustness for endpoint detection. However, this method cannot work in an environment with a lower signal-to-noise ratio. Chen Zhenbiao et al. (Chen Zhenbiao, Xu Bo. Research on Optimal Speech Endpoint Detection Algorithm Based on Subband Energy Features. Acoustica Sinica, 2005;30(2):171-176) studied subband Amplitude [Sub-Band Amplitude, SBA] and energy, and use a detection algorithm that combines more distinguishable and anti-noise multiple sub-band energies and optimized edge detection commonly used in image processing to perform endpoint detection, making endpoint detection Performance in complex noise environments has been significantly improved. In addition, Zhang et al. ( Xueying Zhang , et al. A Speech Endpoint Detection Method Based on Wavelet Coefficient Variance and Sub-Band Amplitude Variance. . In Proc. lEEE ICICIC, 2006; 105-109) proposed a method using wavelet coefficients ( Wavelet Coefficient, WC) method uses wavelet analysis method for endpoint detection. Since this method can analyze signals at various scales, it can distinguish speech segments and noise segments to a certain extent. Wu et al. (Bing-Fei Wu, Kun-Ching Wang. Robust Endpoint Detection Algorithm Based on the Adaptive Band-Partitioning Spectral Entropy in Adverse Environments. IEEE Transactions on Speech and Audio Processing, 2005; 13(5):562-7 The adaptive band-partitioning spectral entropy (Adaptive Band-Partitioning Spectral, ABSE) method is used for endpoint detection. This method can distinguish the speech sub-band signal and noise very well, and achieves a good accuracy rate of endpoint detection in the environment containing noise. Li (Q.Li, et al. A Robust real-time endpoint detector with energy normalization for ASR in adverse environments. International Conference on Acoustics Speech and Signal Processing, 2001; 574-577) draws on the method of optimizing edge detection in image processing It is used for endpoint detection of speech, and a filter plus three-state decision logic is used for endpoint detection, so there is no need to adjust the threshold in the case of different signal-to-noise ratios. This method combines the algorithm of image processing, which plays a very good auxiliary role in the detection of endpoints. However, none of the above methods can obtain a high accuracy rate of endpoint detection in a low signal-to-noise ratio environment. the

发明内容 Contents of the invention

要解决的技术问题：低信噪比环境下，常规的端点检测方法的端点检测正确率非常低的问题。 Technical problem to be solved: In the environment of low signal-to-noise ratio, the correct rate of endpoint detection of conventional endpoint detection methods is very low. the

技术方案：针对低信噪比下语音信号与噪声信号在时-频域二维空间的不同特征，并结合基于听觉感知特性的语音增强算法，提出感知语谱结构边界参数PSSB (Perception Spectrogram Structure Boundary)，并将其用于端点检测。首先，对低信噪比语音进行基于听觉掩蔽特性的语音增强。与传统的语音增强算法相比，这种方法更有效地保留住人耳可感知的语音成分。在此基础之上，在二维层面中考虑纯净语音语谱在时间轴上的连续分布特性，对含噪语音进行二维增强，使语音的语谱结构更进一步突显出来，同时抑制了噪声的语谱结构。最后寻找出连续分布的纯净语音语谱结构的二维边界，并提出PSSB参数用于端点检测。 Technical solution: Aiming at the different characteristics of speech signals and noise signals in the two-dimensional space of time-frequency domain under low signal-to-noise ratio, combined with the speech enhancement algorithm based on auditory perception characteristics, the perceptual spectrum structure boundary parameter PSSB (Perception Spectrogram Structure Boundary) is proposed ), and use it for endpoint detection. Firstly, speech enhancement based on auditory masking characteristics is performed on speech with low signal-to-noise ratio. Compared with traditional speech enhancement algorithms, this method more effectively preserves the speech components perceivable by the human ear. On this basis, considering the continuous distribution characteristics of the pure speech spectrum on the time axis in the two-dimensional level, two-dimensional enhancement is performed on the noisy speech, so that the spectral structure of the speech is further highlighted, and the noise is suppressed at the same time. spectral structure. Finally, the two-dimensional boundary of the continuous distribution of the pure speech spectral structure is found, and the PSSB parameters are proposed for endpoint detection. the

1．基于听觉感知特性的语音增强 1. Speech enhancement based on auditory perception characteristics

低信噪比环境下，大多数端点检测算法无法很好地检测出语音端点，甚至完全失效。而人类却可以在噪音较强的环境中识别出语音段。在噪音环境下，人耳的听觉感知特性起到了重要的作用。采用人耳听觉感知特性中的听觉掩蔽特性，可以在一定程度上抑制噪声而更多的保留语音成分。本发明提出的PSSB参数，先采用基于听觉掩蔽特性的语音增强，在保护语音的基础上尽可能的抑制噪声。这种语音增强方法，最重要的是计算掩蔽阈值。掩蔽阈值的计算以及语音增强系统如下： In a low SNR environment, most endpoint detection algorithms cannot detect voice endpoints well, or even fail completely. Humans, on the other hand, can recognize speech segments in noisy environments. In a noisy environment, the auditory perception characteristics of the human ear play an important role. Using the auditory masking characteristic in the auditory perception characteristics of the human ear can suppress noise to a certain extent and retain more speech components. The PSSB parameters proposed by the present invention first adopt the speech enhancement based on the auditory masking characteristic, and suppress the noise as much as possible on the basis of protecting the speech. In this speech enhancement method, the most important thing is to calculate the masking threshold. The calculation of the masking threshold and the speech enhancement system are as follows:

(1) Bark阈功率谱 (1) Bark threshold power spectrum

语音信号x(n)经过快速傅立叶变换(FFT)变成频域信号，信号功率谱为： The speech signal x(n) is converted into a frequency domain signal by fast Fourier transform (FFT) , the signal power spectrum is:

(1) (1)

Bark功率谱为： The Bark power spectrum is:

$B_{i} = Σ_{k = b_{li}}^{b_{hi}} P (k) - - - (2)$ 其中表示第i段Bark频带的能量, 表示第i段最低的频率, 表示第i段最高的频率。 $B_{i} = Σ_{k = b_{li}}^{b_{hi}} P (k) - - - (2)$ in Indicates the energy of the i-th Bark band, Indicates the lowest frequency of segment i, Indicates the highest frequency of segment i.

(2) 扩散Bark域功率谱 (2) Diffused Bark domain power spectrum

引入扩散函数,它是一个矩阵,满足条件： Introduce the spread function , which is a matrix that satisfies the condition:

(3) (3)

定义式如下： The definition formula is as follows:

(4) (4)

表示两个频带的频带号之差。 Indicates the difference between the band numbers of the two bands.

${C C}_{i i} = = {Σ Σ}_{j j = = 11}^{{j j}_{max max}} {S S}_{ij ij} \cdot &Center Dot; {B B}_{i i},, i i = = 1,2 1,2 . . . . . . {i i}_{max max} - - - - - - ((55))$

(3) 掩蔽能量的偏移函数及掩蔽阈值的计算 (3) Offset function of masking energy and masking threshold calculation

(6) (6)

$T_{i} = 10^{\log_{10} (C_{i}) - (O_{i} / 10)} - - - (7)$ 取值在0和1之间，由语音含量决定。是第i段Bark频带的掩蔽阈值，将其改称为，其中b的含义与前面的i相同。 $T_{i} = 10^{\log_{10} (C_{i}) - (o_{i} / 10)} - - - (7)$ The value is between 0 and 1, determined by the speech content. is the masking threshold of the i-th Bark band, which is renamed as , where b has the same meaning as the previous i.

和安静听阈的阈值： and the threshold of the quiet hearing threshold:

(8) (8)

相比较，取其最大值，作为最终拟合的掩蔽阈值。其中为相应的Bark掩蔽曲线。 In comparison, the maximum value is taken as the masking threshold for the final fitting. in for The corresponding Bark masking curve.

(4)谱相减和减参数的调节 (4) Adjustment of spectrum subtraction and subtraction parameters

谱相减算法采用的增益函数如下： The gain function used by the spectral subtraction algorithm is as follows:

$H (k) = \{\begin{matrix} {(1 - α \cdot {[\frac{| D (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & {[\frac{| D (k) |}{| Y (k) |}]}^{γ} < \frac{1}{α + β} \\ {(β \cdot {[\frac{| D (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & else \end{matrix} - - - (9)$ 首先计算每一帧语音的不同Bark域的噪声掩蔽阈值，然后根据噪声掩蔽阈值得到自适应的减参数、：若掩蔽阈值较高，残留噪声会很自然地被掩蔽而使人耳听不见，在这种情况下，减参数取它们的最小值；掩蔽阈值较低时，残留噪声对人耳的影响很大，有必要去减少它。对于每一帧m，掩蔽阈值的最小值与每帧的减参数和的最大值有关。减参数的应用有如下关系式： $h (k) = \{\begin{matrix} {(1 - α \cdot {[\frac{| D. (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & {[\frac{| D. (k) |}{| Y (k) |}]}^{γ} < \frac{1}{α + β} \\ {(β &Center Dot; {[\frac{| D. (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & else \end{matrix} - - - (9)$ First calculate the noise masking threshold of different Bark domains of each frame of speech, and then get the adaptive subtraction parameter according to the noise masking threshold , : If the masking threshold is high, the residual noise will be masked naturally to make it inaudible to the human ear. In this case, the subtraction parameters take their minimum value; when the masking threshold is low, the residual noise will have a great influence on the human ear. large, it is necessary to reduce it. For each frame m, the masking threshold The minimum value of and the subtraction parameter per frame and about the maximum value of . The application of the subtraction parameter has the following relation:

， ,

(10) (10)

其中，和分别为的最小值和最大值。，和，分别是参数、的最小值和最大值。当时，；当时，。式中和分别是逐帧得到的掩蔽阈值的最小值和最大值。实验中，我们对各个参数的取值如下： in, and respectively minimum and maximum values of . , and , are the parameters , minimum and maximum values of . when hour, ;when hour, . In the formula and are the minimum and maximum masking thresholds obtained frame by frame, respectively. In the experiment, we set the values of each parameter as follows:

(5)实时噪声功率谱估计 (5) Real-time noise power spectrum estimation

语音增强需要实时性特别高的噪声谱估计方法。采用基于约束方差频谱平滑和最小值跟踪的噪声功率谱估计方法。该算法的核心是约束方差的平滑滤波器，它控制了短时平滑功率谱的方差，使得对最小值的跟踪更为准确。该方法估计的噪声谱能及时追踪噪声突变，不产生明显噪声谱延时，且精确度优于其它方法估计的噪声谱。 Speech enhancement requires noise spectrum estimation methods with particularly high real-time performance. Noise power spectrum estimation method based on constrained variance spectral smoothing and minimum value tracking is adopted. The core of the algorithm is the variance-constrained smoothing filter, which controls the variance of the short-term smooth power spectrum, making the tracking of the minimum more accurate. The noise spectrum estimated by this method can track the noise mutation in time without obvious noise spectrum delay, and its accuracy is better than that estimated by other methods.

(6)语音增强系统 (6) Speech Enhancement System

根据掩蔽阈值得到自适应的减参数、。语音增强系统如图1所示。 Get an adaptive subtraction parameter according to the masking threshold ,. The voice enhancement system is shown in Figure 1.

2 语音的二维增强 2 Two-dimensional enhancement of speech

低信噪比的语音经过语音增强之后，由于谱相减的作用，噪声和语音同时被衰减。然而，由于语音中浊音段含有能量较高的共振峰等结构，在二维时-频域中，语音语谱的低频区域即使在噪声干扰下，还是具有较高的信噪比。并且这些含有较高语音能量的结构在时间上通常是连续分布的。因此，只要我们在语音信号二维的语谱中，找到这些连续分布的高能量区域，并且由此找出相连的清音段，就可以得到语音的起始和终止端点。边界检测，在我们的方法中是个寻找连续分布二维数据结构的算法。 After the speech with low signal-to-noise ratio is enhanced, the noise and the speech are attenuated at the same time due to the effect of spectral subtraction. However, because voiced segments in speech contain structures such as formants with high energy, in the two-dimensional time-frequency domain, the low-frequency region of the speech spectrum has a high signal-to-noise ratio even under noise interference. And these structures containing higher speech energy are usually distributed continuously in time. Therefore, as long as we find these continuously distributed high-energy regions in the two-dimensional speech spectrum of the speech signal, and thus find out the connected unvoiced segments, we can get the start and end endpoints of the speech. Boundary detection, in our approach, is an algorithm that finds continuously distributed 2D data structures.

然而，不论低信噪比的语音信号是否经过语音增强，噪声（经过语音增强后为残留音乐噪声）都将在边界检测中，留下噪声语谱结构的边界。纯净语音的语谱结构将被噪声的语谱结构干扰混淆，这将对寻找纯净语音的语谱结构产生极大的干扰作用。如图2和图3所示。 However, no matter whether the speech signal with low SNR is speech enhanced or not, the noise (residual music noise after speech enhancement) will leave the boundary of the noise spectral structure in the boundary detection. The spectral structure of pure speech will be confused by the spectral structure of noise, which will have a great interference effect on finding the spectral structure of pure speech. As shown in Figure 2 and Figure 3. the

图2是含有-5dB白噪声的语音的语谱图。图中可以看到，连续分布的黑色横条纹是语音信号（在高频段，能量较低的语音信号已经被噪声掩蔽掉，从语谱图中已经看不到高频区域的共振峰结构），而黑色雪花状背景是白噪声。图3是经过语音增强后的语谱图，噪声经过语音增强之后，被大大地削弱，但是仍然存在残留的强弱不一的音乐噪声。本发明把这些残留噪声分为能量较强的残留噪声和能量较弱的残留噪声，如图3。这些噪声，都将极大地干扰求取语音的端点。因此，在求取语音端点之前，针对残留噪声的语谱结构和纯净语音的语谱结构之间的不同，本发明对语音进行二维增强，包括二维噪声腐蚀算法和二维语音膨胀算法。 Figure 2 is a spectrogram of a speech containing -5dB white noise. As can be seen in the figure, the continuous black horizontal stripes are speech signals (in the high-frequency band, the speech signal with lower energy has been masked by noise, and the formant structure in the high-frequency region cannot be seen from the spectrogram), And the black snowflake-like background is white noise. Figure 3 is a spectrogram after speech enhancement. After speech enhancement, the noise is greatly weakened, but there are still residual musical noises of different strengths. The present invention divides these residual noises into residual noises with stronger energy and residual noises with weaker energy, as shown in FIG. 3 . These noises will greatly interfere with the endpoints for obtaining voice. Therefore, before obtaining the speech endpoint, the present invention performs two-dimensional enhancement on the speech, including a two-dimensional noise erosion algorithm and a two-dimensional speech expansion algorithm, for the difference between the spectral structure of the residual noise and the pure speech. the

二维噪声腐蚀算法2D Noise Erosion Algorithm

在二维数据的增强处理算法中，腐蚀算法可以减弱或消除特定的二维结构。我们发现，在语音增强之后的语音语谱中，能量较弱的残留噪声（灰暗的雪花状结构），通常都是随机分布的，如图3所示。而且它们具有较小的尺寸和能量。这些结构虽然不如图3中的白噪声强，但仍然干扰求取纯净语音的语谱结构边界。本发明针对以上特点，提出二维噪声腐蚀算法，用于削弱这样的二维结构。 Among the enhancement processing algorithms for 2D data, corrosion algorithms can weaken or eliminate specific 2D structures. We found that in the speech spectrum after speech enhancement, residual noise with weak energy (dark snowflake-like structure) is usually randomly distributed, as shown in Figure 3. And they have smaller size and energy. Although these structures are not as strong as the white noise in Figure 3, they still interfere with the spectral structure boundary of pure speech. Aiming at the above characteristics, the present invention proposes a two-dimensional noise erosion algorithm for weakening such a two-dimensional structure.

对语音语谱的二维噪声腐蚀算法，由以下过程决定。首先，对语音进行短时傅立叶变换，每一帧的频谱由下式计算： The two-dimensional noise erosion algorithm for the speech spectrum is determined by the following process. First, perform a short-time Fourier transform on the speech, and the spectrum of each frame Calculated by the following formula:

(11) (11)

是第m帧语音信号，是第m帧语音信号的频谱。N为帧的长度和短时傅立叶变换点数。是Hamming窗。每帧的语音信号功率谱可以表示为： is the speech signal of the mth frame, is the frequency spectrum of the speech signal of the mth frame. N is the length of the frame and the number of short-time Fourier transform points. It is a Hamming window. The speech signal power spectrum of each frame can be expressed as:

(12) (12)

即定义为语音信号的语谱。 That is, it is defined as the spectrum of the speech signal.

对的二维噪声腐蚀被定义为： right The 2D noise erosion of is defined as:

(13) (13)

其中是结构元素，是的定义域，是的定义域。平移参数必须在的定义域内，且必须在的定义域之内。对信号进行二维噪声腐蚀，作用是双重的：（1）如果所有元素都为正，则输出的信号趋向于比原始信号更弱；（2）输入的语谱信号中，噪声语谱结构如果和结构元素类似，则它将被削弱，削弱的程度取决于噪声的语谱结构形状以及结构元素的形状。 in is a structural element, yes domain of definition, yes domain of definition. translation parameter gotta be within the domain of definition, and gotta be within the domain of definition. The effect of two-dimensional noise erosion on the signal is two-fold: (1) if all elements are positive, the output signal tends to be weaker than the original signal; (2) in the input spectral signal, the noise spectral structure if Similar to structural elements, it will be attenuated, and the degree of attenuation depends on the shape of the spectral structure of the noise and the shape of the structural elements.

在语音的语谱结构中，腐蚀算法同时削弱噪声和语音。本发明提出的二维噪声腐蚀算法的目的，就是能够相对更多地削弱噪声，而更好地保留语音。针对能量较弱的残留噪声语谱的结构形态，二维噪声腐蚀算法的结构元素被定义为下式： In the spectral structure of speech, erosion algorithms attenuate both noise and speech. The purpose of the two-dimensional noise erosion algorithm proposed by the present invention is to attenuate the noise relatively more and preserve the speech better. Structural elements of the two-dimensional noise erosion algorithm for the structural shape of the residual noise spectrum with weak energy is defined as:

(14) (14)

这样的结构元素比较接近能量较弱的残留噪声的语谱结构（较小的点）。因此用结构元素对语谱进行二维噪声腐蚀，可以在一定程度上削弱这种噪声。 Such structural elements Compare the spectral structure (smaller dots) closer to the less energetic residual noise. So with the structuring element Carrying out two-dimensional noise erosion on the speech spectrum can weaken this noise to a certain extent.

二维语音膨胀算法Two-dimensional Speech Expansion Algorithm

语音经过二维噪声腐蚀算法，能量较弱的残留噪声被很好的抑制。然而，由于能量较强的残留噪声（如图3）和纯净语音之间，在能量上有近似性，如果过度地腐蚀，将会同时削弱纯净语音的二维结构。膨胀算法可以使和结构元素相似的二维语谱结构得到增强，不相似的二维语谱结构被相对削弱。因此，本发明针对能量较强的残留噪声和纯净语音结构之间的不同，提出二维语音膨胀算法。本发明把结构元素定义为与连续分布的纯净语音相似的结构。这样就可以相对的抑制这种噪声结构。 After the speech is passed through the two-dimensional noise erosion algorithm, the residual noise with weak energy is well suppressed. However, due to the similarity in energy between the residual noise with strong energy (as shown in Figure 3) and the pure speech, if the corrosion is excessive, the two-dimensional structure of the pure speech will be weakened at the same time. The expansion algorithm can enhance the two-dimensional spectral structure similar to the structural elements, and relatively weaken the dissimilar two-dimensional spectral structure. Therefore, the present invention proposes a two-dimensional speech expansion algorithm for the difference between residual noise with strong energy and pure speech structure. The present invention defines a structural element as a structure similar to a continuous distribution of pure speech. In this way, the noise structure can be relatively suppressed.

针对二维噪声腐蚀的结果，二维语音膨胀算法由下式定义： Results for 2D noise erosion , two-dimensional speech expansion algorithm is defined by:

(15) (15)

其中是结构元素，是的定义域，是的定义域。从理论上讲，可以认为结构元素在语谱中的所有位置平移，结构元素的值与二维信号的值相加，并且计算最大值。对语音信号进行二维语音膨胀是双重作用的：（1）如果所有元素都为正，则输出的信号趋向于比原始信号更强；（2）输入的语谱信号中，某种结构是否被相对增强，取决于膨胀所用的结构元素的值和形状。 in is a structural element, yes domain of definition, yes domain of definition. Theoretically, it can be considered that the structure element is translated at all positions in the spectrum, the value of the structure element is added to the value of the two-dimensional signal, and the maximum value is calculated. Two-dimensional speech dilation on a speech signal has a dual effect: (1) if all elements are positive, the output signal tends to be stronger than the original signal; (2) whether a certain structure in the input spectral signal is Relative enhancement, depending on the value and shape of the structuring elements used for dilation.

膨胀算法，在增强语音结构的同时，也会增强相应的噪声结构。本发明提出的二维语音膨胀算法的目的是，尽量的增强语音结构，而相对抑制噪音结构。纯净语音信号浊音的语谱结构通常都是沿着时间轴伸展的长条形，而能量较强的残留噪声的语谱结构通常都是大小不一的正方形或圆形，如图3所示。因此，把结构元素定义为沿着时间轴伸展的长条形状，以此来增强所有类似结构，同时可以相对削弱结构不同的噪声结构。 The expansion algorithm, while enhancing the speech structure, will also enhance the corresponding noise structure. The purpose of the two-dimensional speech expansion algorithm proposed by the present invention is to enhance the speech structure as much as possible and relatively suppress the noise structure. The spectral structure of voiced sounds of pure speech signals is usually a long strip extending along the time axis, while the spectral structure of residual noise with strong energy is usually square or circular in different sizes, as shown in Figure 3. Therefore, the structure element is defined as a strip shape extending along the time axis, so as to enhance all similar structures, and at the same time, it can relatively weaken noise structures with different structures. the

所以，二维语音膨胀算法中的结构元素被定义为如下形状： Therefore, the structural elements in the two-dimensional speech expansion algorithm is defined as the following shape:

(16) (16)

这里的是水平的沿着时间方向伸展的结构元素。所有跟它相似的结构，都将得到增强。由于纯净语音的语谱结构通常在时间上是连续分布的，它类似于，因此纯净语音的结构得到加强。而能量较强的残留噪声的语谱结构，通常是大的圆点或方点状，它的结构被相对削弱了。 here It is a horizontal structural element that extends along the time direction. All structures similar to it will be enhanced. Since the spectral structure of pure speech is usually distributed continuously in time, it is similar to , so the structure of pure speech is strengthened. The spectral structure of residual noise with strong energy is usually in the shape of large round or square dots, and its structure is relatively weakened.

3 感知语谱结构边界 (PSSB) 参数与端点检测算法 3 Perceptual Spectral Structure Boundary (PSSB) Parameters and Endpoint Detection Algorithm

3.1 感知语谱结构边界 (PSSB) 参数 3.1 Perceptual Spectral Structure Boundary (PSSB) Parameters

本发明在二维层面上考虑纯净语音语谱在时间轴上的连续分布特性，对含噪语音进行二维增强，使语音的语谱结构，更进一步突显出来，同时抑制了噪声的语谱结构。之后，本发明将寻找出纯净语音连续分布的语谱结构边界，并提出感知语谱结构边界参数PSSB用于端点检测。 The present invention considers the continuous distribution characteristics of the pure speech spectrum on the time axis on the two-dimensional level, and performs two-dimensional enhancement on the noisy speech, so that the spectral structure of the speech is further highlighted, and the spectral structure of the noise is suppressed at the same time . Afterwards, the present invention will find out the spectral structure boundary of the continuous distribution of pure speech, and propose a perceptual spectral structure boundary parameter PSSB for endpoint detection.

对于感知语谱结构边界参数PSSB来讲，要首先求解出语谱结构的边界信息。边界检测是求解二维结构边界的重要方法。连续二维信号的边界可以用一阶导数确定的梯度表示。本发明用公式(17)中的邻域模型逼近语音二维增强的结果的梯度。 For the perceptual spectral structure boundary parameter PSSB, the boundary information of the spectral structure must be solved first. Boundary detection is an important method to solve the boundary of 2D structures. The boundary of a continuous two-dimensional signal can be represented by a gradient determined by the first derivative. The present invention uses the neighborhood model in the formula (17) to approach the result of speech two-dimensional enhancement gradient.

(17) (17)

是此邻域模型的中心点。而中心邻域的梯度，可以由下式表示： is the center point of this neighborhood model. The gradient of the central neighborhood can be expressed by the following formula:

(18) (18)

和由公式(19)和公式(20)确定： and Determined by formula (19) and formula (20):

(19) (19)

(20) (20)

即为的边界，它可以描述含噪语音语谱中的语音信号连续分布的边界信息。 that is The boundary, which can describe the boundary information of the continuous distribution of the speech signal in the noisy speech spectrum.

通过对和语音语谱的分析，我们发现在低信噪比的环境下，语音高频区域的信号及语谱特征都被噪声掩蔽掉，而在低频区域，语音浊音段的语谱结构仍然相对噪声有很高的能量，具有可求解的语谱边界。而且越往低频处，这种现象越明显。这是因为语音浊音段的能量主要集中在中低频前几个共振峰处。因此，在求得了语音语谱的边界之后，在语谱每一帧的频率轴上对所有的进行加权求和，使低频区域得到更高的权重，从而得到感知语谱结构边界参数PSSB。 by right And the analysis of the speech spectrum, we found that in the environment of low SNR, the signal and spectral features of the high-frequency region of the speech are masked by the noise, while in the low-frequency region, the spectral structure of the voiced segment of the speech is still relative to the noise. Very high energy, with solvable spectral boundaries. And the lower the frequency, the more obvious this phenomenon is. This is because the energy of the speech voiced segment is mainly concentrated in the first few formants of the middle and low frequencies. Therefore, after obtaining the boundary of the speech spectrum After that, on the frequency axis of each frame of the spectrum for all The weighted summation is performed to make the low-frequency region get a higher weight, so as to obtain the perceptual spectral structure boundary parameter PSSB.

提出感知语谱结构边界参数PSSB如下式： The perceptual spectral structure boundary parameter PSSB is proposed as follows:

(21) (twenty one)

其中是第m帧的PSSB参数，M是总帧数。 in is the PSSB parameter of the mth frame, and M is the total number of frames.

PSSB参数可以很好的体现出一帧中语音浊音段信号的相对含量，对噪声具有很好的鲁棒性。 PSSB parameters It can well reflect the relative content of the speech voiced segment signal in a frame, and has good robustness to noise.

3.2 语音端点检测 3.2 Voice endpoint detection

语音中浊音段通常具有较长的连续分布时间。而清音段有两种分布类型：(1)清音分布在语音段中间；(2)清音分布在语音段起始处。 Voiced segments in speech usually have a longer continuous distribution time. The unvoiced segment has two distribution types: (1) the unvoiced sound is distributed in the middle of the speech segment; (2) the unvoiced sound is distributed at the beginning of the speech segment.

通过实验发现，语音段中间的清音可以被很好的识别成语音段（PSSB参数大于阈值0.5）。这是由于，一个语音单词中间的清音通常比较短，而本发明采用的是重叠50%的帧移方法。这种方法可以把单词中间的清音和旁边的浊音联合起来进行语谱分析，从而在此清音帧中体现出旁边浊音帧的信息。 Through experiments, it is found that the unvoiced sound in the middle of the speech segment can be well recognized as a speech segment (PSSB parameter is greater than the threshold 0.5). This is because the unvoiced sound in the middle of a phonetic word is usually shorter, and what the present invention adopts is the frame shifting method of overlapping 50%. This method can combine the unvoiced sound in the middle of the word and the voiced sound next to it for spectral analysis, so that the information of the voiced sound frame next to it can be reflected in the unvoiced sound frame. the

然而，随着信噪比的降低，特别是低于0dB时，语音段起始处的清音的PSSB区分特性减弱（数值较小）。若单纯以某一固定阈值进行端点划分，针对清音的检测，性能会急剧下降。但是，尽管清音的PSSB相对浊音比较小，但是它通常仍然有一定的PSSB区分特性（数值较小但不为零）。因此本发明采用了针对语音连续性分布特点的检测方法，以此来区别对待浊音段和端点处的清音段。具体端点检测方法如下： However, as the signal-to-noise ratio decreases, especially below 0dB, the PSSB discrimination characteristic of unvoiced sounds at the beginning of the speech segment weakens (small value). If only a fixed threshold is used to divide the endpoints, the performance of unvoiced sound detection will drop sharply. However, although the PSSB of unvoiced sounds is relatively small relative to voiced sounds, it usually still has some PSSB distinguishing properties (small but not zero). Therefore, the present invention adopts a detection method aimed at the distribution characteristics of speech continuity, so as to treat voiced segments and unvoiced segments at endpoints differently. The specific endpoint detection method is as follows:

(1)首先检测出PSSB参数大于阈值a并且连续分布m帧的语音段，此段为检测到的浊音段。 (1) First detect the speech segment whose PSSB parameter is greater than the threshold a and is continuously distributed in m frames, and this segment is the detected voiced segment.

(2)以此段为基础，所有跟此段连在一起并且连续大于等于阈值b的段，定义为语音段。阈值b的值取的较小，实验中，b的值取0.01到0.05都具有较好的识别结果。这样可以把PSSB数值较小的清音段识别出来。 (2) Based on this segment, all segments connected with this segment and continuously greater than or equal to the threshold b are defined as speech segments. The value of the threshold b is small. In the experiment, the value of b is between 0.01 and 0.05, which has better recognition results. In this way, unvoiced segments with smaller PSSB values can be identified. the

(3)此语音段的起点和终点即为语音端点。 (3) The start and end points of this speech segment are the speech endpoints. the

经过实验测试，对于白噪声，当a=0.5，b=0.01，m=20时，系统的性能较好。 After experimental testing, for white noise, when a=0.5, b=0.01, m=20, the performance of the system is better. the

本发明的端点检测算法的框图如图4所示。 The block diagram of the endpoint detection algorithm of the present invention is shown in FIG. 4 . the

有益效果： Beneficial effect:

实验设计在不同信噪比环境下。输入的低信噪比语音是16k采样，16位量化。使用汉明窗，帧长256，帧移128。语音选自TIMIT语音数据库，白噪声来自NoiseX-92 噪声数据库。图5是数据库中的一段语音实例（artists）的波形图，图6是加入白噪声使信噪比达到-10dB的低信噪比语音波形。 The experiments were designed under different signal-to-noise ratio environments. The input speech with low signal-to-noise ratio is 16k samples and 16-bit quantization. Using the Hamming window, the frame length is 256, and the frame shift is 128. The speech is selected from the TIMIT speech database, and the white noise is from the NoiseX-92 noise database. Figure 5 is a waveform diagram of a speech example (artists) in the database, and Figure 6 is a speech waveform with a low SNR of -10dB by adding white noise.

图5中，语音的起始点是第40帧，终点是87帧。而当语音信号加入白噪声，使信噪比达到-10dB时，语音信号已经完全被淹没在白噪声之中。传统的端点检测算法，无法从这样的语音信号中有效地提取出语音端点。 In Fig. 5, the starting point of the voice is the 40th frame, and the ending point is the 87th frame. And when the voice signal is added with white noise to make the signal-to-noise ratio reach -10dB, the voice signal has been completely submerged in the white noise. Traditional endpoint detection algorithms cannot effectively extract speech endpoints from such speech signals. the

图7是纯净语音实例（artists）的语谱图，图8此低信噪比语音的语谱图，而图9是经过基于听觉掩蔽特性的语音增强之后的语谱图。 Figure 7 is the spectrogram of pure speech examples (artists), Figure 8 is the spectrogram of this low SNR speech, and Figure 9 is the spectrogram after speech enhancement based on auditory masking characteristics. the

从图8中可以看出，-10dB低信噪比下的语音，大部分语谱结构已经被噪声淹没掉，只有在低频区域的共振峰结构还能和噪声区分开来。经过语音增强之后，从图9中可以看出，噪声信号和语音信号同时被语音增强的作用削弱了，而且还残留有随机分布的音乐噪声。这是由于谱减类算法本身固有的特性决定的。 It can be seen from Figure 8 that most of the spectral structure of the speech at -10dB low SNR has been submerged by the noise, and only the formant structure in the low frequency region can be distinguished from the noise. After speech enhancement, it can be seen from Figure 9 that the noise signal and the speech signal are weakened by speech enhancement at the same time, and there are still randomly distributed music noises. This is due to the inherent characteristics of the spectral subtraction algorithm itself. the

如果直接从图9的语谱求取语谱的边界，噪声和语音仍然难以区分开。因此需要在语音的语谱中再做二维增强。如图10和图11所示。 If the boundary of the spectrum is obtained directly from the spectrum in Figure 9, it is still difficult to distinguish noise from speech. Therefore, it is necessary to do two-dimensional enhancement in the speech spectrum. As shown in Figure 10 and Figure 11. the

图10是图9经过二维噪声腐蚀算法后的结果。相对于图9可以看出，除了能量较强的残留噪声和低频处语音的共振峰结构之外，其他残留噪声在一定程度上被抑制了。图11是对图10中语音的语谱结构进行二维语音膨胀算法后的结果。可以看出，随机分布的能量较强的噪声语谱结构，被相对削弱。语音的语谱结构被相对增强。 Figure 10 is the result of Figure 9 after the two-dimensional noise erosion algorithm. It can be seen from FIG. 9 that, except for the residual noise with strong energy and the formant structure of speech at low frequencies, other residual noises are suppressed to a certain extent. Fig. 11 is the result of performing the two-dimensional speech expansion algorithm on the spectral structure of the speech in Fig. 10 . It can be seen that the noise spectral structure with strong energy of random distribution is relatively weakened. The spectral structure of speech is relatively enhanced. the

之后，对图11边界检测，如图12。可以看到，40帧到85帧之间，低频区域的语音语谱边界结构被很好的求解出来。然而，由于仍然残留少量噪声的二维结构，在非语音区域，有很多中高频噪声的边界结构被表示出来。这是不希望被看到的。因此，在PSSB参数中，低频区域的边界结构赋予了更高的权重。这样，语音和噪声，就被很好地区分开来。如图13。 Afterwards, the boundary detection of Figure 11 is performed, as shown in Figure 12. It can be seen that between frame 40 and frame 85, the speech spectrum boundary structure in the low-frequency region is well solved. However, in non-speech regions, a lot of boundary structures with mid- and high-frequency noise are represented due to the 2D structure with a small amount of noise still remaining. This is not expected to be seen. Therefore, in the PSSB parameters, the boundary structure in the low-frequency region is given higher weight. In this way, speech and noise are well distinguished. Figure 13.

图13是由图12求解出的PSSB参数。很明显，在-10dB的情况下，语音信号的PSSB参数仍然能在时间轴上有很突出的区分特性。在做端点检测的时候，对PSSB参数做连续性检测，如果PSSB参数数值连续大于0，并且，连续大于阈值0.5的帧数大于20帧，则把此段数值连续大于0的PSSB参数判断为语音段。 FIG. 13 is the PSSB parameters obtained from FIG. 12 . Obviously, in the case of -10dB, the PSSB parameters of the speech signal can still have outstanding distinguishing characteristics on the time axis. When doing endpoint detection, the PSSB parameter is tested for continuity. If the value of the PSSB parameter is continuously greater than 0, and the number of frames continuously greater than the threshold 0.5 is greater than 20 frames, then the PSSB parameter whose value is continuously greater than 0 is judged as voice part.

实验中，本发明的端点检测算法(PSSB)对比其它四种端点检测算法，并比较它们的正确率。这四种方法分别是：1，能量-短时过零率(EZCR)；2，子带幅度法(SBA)；3，小波系数法(WC)；4，子带谱熵法(ABSE)。本发明选取TIMIT语音数据库中70个单词作为端点检测的对象，每个单词做3次端点检测。按一定权值加入NoiseX-92 噪声数据库中白噪声，得到不同信噪比的语音。我们设定，误差小于4帧的端点检测为正确的结果。定义端点检测正确率=正确的结果/总的用于端点检测的语音段数量。表1和图14显示了各种算法在不同信噪比下的端点检测正确率。 In the experiment, the endpoint detection algorithm ( PSSB ) of the present invention is compared with other four endpoint detection algorithms, and their accuracy rates are compared. The four methods are: 1. Energy-short time zero crossing rate (EZCR); 2. Subband amplitude method (SBA); 3. Wavelet coefficient method (WC); 4. Subband spectral entropy method (ABSE). The present invention selects 70 words in the TIMIT speech database as the object of endpoint detection, and performs endpoint detection 3 times for each word. Add the white noise in the NoiseX-92 noise database according to a certain weight to get speech with different SNR. We set an endpoint detection with an error of less than 4 frames as the correct result. Define endpoint detection accuracy rate = correct result/total number of speech segments used for endpoint detection. Table 1 and Figure 14 show the accuracy of endpoint detection for various algorithms at different SNRs.

the

表1 在不同信噪比下的端点检测正确率（%） Table 1 The correct rate of endpoint detection under different signal-to-noise ratios (%)

表1中的“*”，表示该算法在此条件下失效，此时我们认为正确率为零。由表1和图14和可以看出，在10dB的情况下，EZCR、SBA和WC三种传统方法，端点检测正确率已经低于86%。当信噪比低于零时，这三种方法完全失效，说明这些方法对噪声没有很好的鲁棒性能。ABSE方法正确率相对较高，这是因为该方法也是分析纯净语音的高能量成分，并做出端点检测。本发明的采用PSSB参数的方法相对与ABSE有着更高的端点识别率。在-10dB的情况下，仍然有75.2%的正确识别率。 "*" in Table 1 indicates that the algorithm fails under this condition, and we believe that the correct rate is zero. It can be seen from Table 1 and Figure 14 that in the case of 10dB, the correct rate of endpoint detection of the three traditional methods of EZCR, SBA and WC is already lower than 86%. When the SNR is below zero, the three methods fail completely, indicating that these methods are not robust to noise. The accuracy rate of the ABSE method is relatively high, because this method also analyzes the high-energy components of pure speech and makes endpoint detection. Compared with ABSE, the method using PSSB parameters in the present invention has a higher endpoint recognition rate. In the case of -10dB, there is still a correct recognition rate of 75.2%.

the

附图说明: Description of drawings:

图1为基于听觉特性的语音增强系统； Fig. 1 is the speech enhancement system based on auditory characteristics;

图2含有-5dB白噪声语音的语谱图； Fig. 2 contains the spectrogram of -5dB white noise speech;

图3语音增强之后的语谱图； Spectrogram after Fig. 3 speech enhancement;

图4为采用PSSB参数的端点检测算法； Fig. 4 is the endpoint detection algorithm adopting PSSB parameter;

图5为纯净语音； Fig. 5 is pure voice;

图6为-10dB低信噪比语音； Fig. 6 is -10dB low SNR voice;

图7为纯净语音信号语谱图； Fig. 7 is pure speech signal spectrogram;

图8为-10dB低信噪比语音信号语谱图； Fig. 8 is -10dB low SNR speech signal spectrogram;

图9为语音增强结果； Fig. 9 is speech enhancement result;

图10为经过二维噪声腐蚀算法后的语谱图； Fig. 10 is the spectrogram after two-dimensional noise erosion algorithm;

图11为经过二维语音膨胀算法后的语谱图； Fig. 11 is the spectrogram after two-dimensional speech expansion algorithm;

图12为语谱边界； Fig. 12 is spectrum boundary;

图13为PSSB参数及端点检测 Figure 13 shows the PSSB parameters and endpoint detection

图14为端点检测结果对比。 Figure 14 is a comparison of endpoint detection results.

the

具体实施方式 Detailed ways

实施例1 Example 1

第一步：基于听觉感知特性的语音增强；采用基于听觉掩蔽特性的语音增强，在保护语音的基础上尽可能的抑制噪声；所述的语音增强方法中掩蔽阈值的计算以及语音增强系统如下： The first step: speech enhancement based on auditory perception characteristics; adopt speech enhancement based on auditory masking characteristics, suppress noise as much as possible on the basis of protecting speech; the calculation of masking threshold and the speech enhancement system in the described speech enhancement method are as follows:

ⅰ.Bark阈功率谱 ⅰ. Bark threshold power spectrum

(1) (1)

Bark功率谱为： The Bark power spectrum is:

$B_{i} = Σ_{k = b_{li}}^{b_{hi}} P (k) - - - (2)$ 其中表示第i段Bark频带的能量, 表示第i段最低的频率, 表示第i段最高的频率； $B_{i} = Σ_{k = b_{li}}^{b_{hi}} P (k) - - - (2)$ in Indicates the energy of the i-th Bark band, Indicates the lowest frequency of segment i, Indicates the highest frequency of the i segment;

ⅱ.扩散Bark域功率谱 ⅱ. Diffused Bark Domain Power Spectrum

(3) (3)

定义式如下： The definition formula is as follows:

(4) (4)

表示两个频带的频带号之差； Indicates the difference between the band numbers of the two bands;

$C_{i} = Σ_{j = 1}^{j_{\max}} S_{ij} \cdot B_{i}, i = 1,2 . . . i_{\max} - - - (5)$ ⅲ. 掩蔽能量的偏移函数及掩蔽阈值的计算 $C_{i} = Σ_{j = 1}^{j_{\max}} S_{ij} \cdot B_{i}, i = 1,2 . . . i_{\max} - - - (5)$ ⅲ. Offset function of masking energy and masking threshold calculation

(6) (6)

$T_{i} = 10^{\log_{10} (C_{i}) - (O_{i} / 10)} - - - (7)$ 取值在0和1之间，由语音含量决；是第i段Bark频带的掩蔽阈值，将其改称为，其中b的含义与前面的i相同； $T_{i} = 10^{\log_{10} (C_{i}) - (o_{i} / 10)} - - - (7)$ The value is between 0 and 1, depending on the voice content; is the masking threshold of the i-th Bark band, which is renamed as , where the meaning of b is the same as the previous i;

和安静听阈的阈值： and thresholds for quiet hearing:

(8) (8)

相比较，取其最大值，作为最终拟合的掩蔽阈值；其中为相应的Bark掩蔽曲线； Compare and take the maximum , as the masking threshold for the final fit; where for The corresponding Bark masking curve;

ⅳ.谱相减和减参数的调节 ⅳ. Spectral subtraction and adjustment of subtraction parameters

$H (k) = \{\begin{matrix} {(1 - α \cdot {[\frac{| D (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & {[\frac{| D (k) |}{| Y (k) |}]}^{γ} < \frac{1}{α + β} \\ {(β \cdot {[\frac{| D (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & else \end{matrix} - - - (9)$ 首先计算每一帧语音的不同Bark域的噪声掩蔽阈值，然后根据噪声掩蔽阈值得到自适应的减参数、：若掩蔽阈值较高，残留噪声会很自然地被掩蔽而使人耳听不见，在这种情况下，减参数取它们的最小值；掩蔽阈值较低时，残留噪声对人耳的影响很大，有必要去减少它；对于每一帧m，掩蔽阈值的最小值与每帧的减参数和的最大值有关；减参数的应用有如下关系式： $h (k) = \{\begin{matrix} {(1 - α \cdot {[\frac{| D. (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & {[\frac{| D. (k) |}{| Y (k) |}]}^{γ} < \frac{1}{α + β} \\ {(β &Center Dot; {[\frac{| D. (k) |}{| Y (k) |}]}^{γ})}^{1 / γ}, & else \end{matrix} - - - (9)$ First calculate the noise masking threshold of different Bark domains of each frame of speech, and then get the adaptive subtraction parameter according to the noise masking threshold , : If the masking threshold is high, the residual noise will be masked naturally to make it inaudible to the human ear. In this case, the subtraction parameters take their minimum value; when the masking threshold is low, the residual noise will have a great influence on the human ear. large, it is necessary to reduce it; for each frame m, the masking threshold The minimum value of and the subtraction parameter per frame and is related to the maximum value; the application of the subtraction parameter has the following relationship:

， ,

(10) (10)

其中，和分别为的最小值和最大值；，和，分别是参数、的最小值和最大值；当时，；当时，；式中和分别是逐帧得到的掩蔽阈值的最小值和最大值；实验中，我们对各个参数的取值如下： in, and respectively The minimum and maximum values of ; , and , are the parameters , The minimum and maximum values of ; when hour, ;when hour, ; where and They are the minimum and maximum values of the masking threshold obtained frame by frame; in the experiment, we set the values of each parameter as follows:

ⅴ.实时噪声功率谱估计；采用基于约束方差频谱平滑和最小值跟踪的噪声功率谱估计方法。 v. Real-time noise power spectrum estimation; using a noise power spectrum estimation method based on constrained variance spectral smoothing and minimum value tracking.

ⅵ.语音增强系统；根据掩蔽阈值得到自适应的减参数、; ⅵ. Speech enhancement system; according to the masking threshold, the adaptive subtraction parameter, ;

第二步：语音的二维增强； The second step: two-dimensional enhancement of speech;

2.1二维噪声腐蚀算法 2.1 Two-dimensional noise erosion algorithm

对语音语谱的二维噪声腐蚀算法，由以下过程决定；首先，对语音进行短时傅立叶变换，每一帧的频谱由下式计算： The two-dimensional noise erosion algorithm for the speech spectrum is determined by the following process; first, the short-time Fourier transform is performed on the speech, and the spectrum of each frame Calculated by the following formula:

(11) (11)

是第m帧语音信号，是第m帧语音信号的频谱；N为帧的长度和短时傅立叶变换点数；是Hamming窗；每帧的语音信号功率谱可以表示为： is the speech signal of the mth frame, is the frequency spectrum of the m- th frame speech signal; N is the length of the frame and the number of short-time Fourier transform points; Is the Hamming window; the speech signal power spectrum of each frame can be expressed as:

(12) (12)

即定义为语音信号的语谱； That is, it is defined as the spectrum of the speech signal;

(13) (13)

其中是结构元素，是的定义域，是的定义域；平移参数必须在的定义域内，且必须在的定义域之内； in is a structural element, yes domain of definition, yes domain of definition; translation parameters gotta be within the domain of definition, and gotta be within the domain of definition;

针对能量较弱的残留噪声语谱的结构形态，二维噪声腐蚀算法的结构元素被定义为下式： Structural elements of the two-dimensional noise erosion algorithm for the structural shape of the residual noise spectrum with weak energy is defined as:

(14) (14)

2.2 二维语音膨胀算法 2.2 Two-dimensional Speech Expansion Algorithm

(15) (15)

其中是结构元素，是的定义域，是的定义域； in is a structural element, yes domain of definition, yes domain of definition;

(16) (16)

第三步：感知语谱结构边界 (PSSB) 参数与端点检测算法 Step 3: Perceptual Spectral Structure Boundary (PSSB) Parameters and Endpoint Detection Algorithm

3.1感知语谱结构边界（PSSB）参数 3.1 Perceptual Spectral Structure Boundary (PSSB) Parameters

本发明用公式(17)中的邻域模型逼近语音二维增强的结果的梯度； The present invention uses the neighborhood model in the formula (17) to approach the result of speech two-dimensional enhancement gradient;

(17) (17)

是此邻域模型的中心点；而中心邻域的梯度，可以由下式表示： is the center point of this neighborhood model; and the gradient of the center neighborhood can be expressed by the following formula:

(18) (18)

(19) (19)

(20) (20)

(21) (twenty one)

其中是第m帧的PSSB参数，M是总帧数； in is the PSSB parameter of the mth frame, and M is the total number of frames;

3.2 语音端点检测 3.2 Voice endpoint detection

采用了针对语音连续性分布特点的检测方法，以此来区别对待浊音段和端点处的清音段；具体端点检测方法如下： A detection method aimed at the distribution characteristics of speech continuity is used to distinguish between voiced segments and unvoiced segments at the endpoints; the specific endpoint detection method is as follows:

(1)首先检测出PSSB参数大于阈值a并且连续分布m帧的语音段，此段为检测到的浊音段； (1) at first detect the speech segment that PSSB parameter is greater than threshold value a and distributes m frame continuously, and this segment is the voiced sound segment that detects;

(2)以此段为基础，所有跟此段连在一起并且连续大于等于阈值b的段，定义为语音段；阈值b的值取的较小，实验中，b的值取0.01到0.05都具有较好的识别结果。这样可以把PSSB数值较小的清音段识别出来； (2) Based on this segment, all segments that are connected with this segment and are continuously greater than or equal to the threshold b are defined as speech segments; the value of the threshold b is smaller, and in the experiment, the value of b is between 0.01 and 0.05. have better recognition results. In this way, unvoiced segments with smaller PSSB values can be identified;

(3)此语音段的起点和终点即为语音端点。 (3) The start and end points of this speech segment are the speech endpoints.

实验设计在不同信噪比环境下；输入的低信噪比语音是16k采样，16位量化；使用汉明窗，帧长256，帧移128；语音选自TIMIT语音数据库，白噪声来自NoiseX-92 噪声数据库。 The experimental design is in different SNR environments; the input speech with low SNR is 16k samples, 16-bit quantization; using the Hamming window, the frame length is 256, and the frame shift is 128; the speech is selected from the TIMIT speech database, and the white noise is from NoiseX- 92 Noise Database. the

Claims

1. A speech endpoint detection algorithm using perceptual spectrum structure boundary parameters, characterized in that the algorithm steps are as follows: (1) speech enhancement based on auditory perception characteristics; (2) two-dimensional enhancement of speech, including two-dimensional Noise erosion algorithm and two-dimensional speech expansion algorithm; (3) Perceptual Spectral Structure Boundary (PSSB) parameters and speech endpoint detection.

2. a kind of speech endpoint detection algorithm that adopts perceptual spectrum structure boundary parameter according to claim 1, it is characterized in that described described algorithm step is as follows:

The first step: speech enhancement based on auditory perception characteristics; adopt speech enhancement based on auditory masking characteristics, suppress noise as much as possible on the basis of protecting speech; the calculation of masking threshold and the speech enhancement system in the described speech enhancement method are as follows:

ⅰ. Bark threshold power spectrum

The speech signal x(n) is converted into a frequency domain signal by fast Fourier transform (FFT) , the signal power spectrum is:

(1)

The Bark power spectrum is:

in Indicates the energy of the i-th Bark band, Indicates the lowest frequency of segment i, Indicates the highest frequency of the i segment;

ⅱ. Diffused Bark Domain Power Spectrum

Introduce the spread function , which is a matrix that satisfies the condition:

(3)

The definition formula is as follows:

(4)

Indicates the difference between the band numbers of the two bands;

ⅲ. Offset function of masking energy and masking threshold calculation

(6)

The value is between 0 and 1, depending on the voice content; is the masking threshold of the i-th Bark band, which is renamed as , where the meaning of b is the same as the previous i;

and thresholds for quiet hearing:

(8)

Compare and take the maximum , as the masking threshold for the final fit; where for The corresponding Bark masking curve;

ⅳ. Spectral subtraction and adjustment of subtraction parameters

The gain function used by the spectral subtraction algorithm is as follows:

First calculate the noise masking threshold of different Bark domains of each frame of speech, and then get the adaptive subtraction parameter according to the noise masking threshold , : If the masking threshold is high, the residual noise will be masked naturally to make it inaudible to the human ear. In this case, the subtraction parameters take their minimum value; when the masking threshold is low, the residual noise will have a great influence on the human ear. large, it is necessary to reduce it; for each frame m, the masking threshold The minimum value of and the subtraction parameter per frame and is related to the maximum value; the application of the subtraction parameter has the following relationship:

,

(10)

in, and respectively The minimum and maximum values of ; , and , are the parameters , The minimum and maximum values of ; when hour, ;when hour, ; where and They are the minimum and maximum values of the masking threshold obtained frame by frame; in the experiment, we set the values of each parameter as follows:

v. Real-time noise power spectrum estimation; using a noise power spectrum estimation method based on constrained variance spectrum smoothing and minimum value tracking;

ⅵ. Speech enhancement system; get adaptive subtraction parameters according to the masking threshold , ;

The second step: two-dimensional enhancement of speech;

2.1 Two-dimensional noise erosion algorithm

The two-dimensional noise erosion algorithm for the speech spectrum is determined by the following process; first, the short-time Fourier transform is performed on the speech, and the spectrum of each frame Calculated by the following formula:

(11)

is the speech signal of the mth frame, is the frequency spectrum of the m- th frame speech signal; N is the length of the frame and the number of short-time Fourier transform points; Is the Hamming window; the speech signal power spectrum of each frame can be expressed as:

(12)

That is, it is defined as the spectrum of the speech signal;

right The 2D noise erosion of is defined as:

(13)

in is a structural element, yes domain of definition, yes domain of definition; translation parameters gotta be within the domain of definition, and gotta be within the domain of definition;

Structural elements of the two-dimensional noise erosion algorithm for the structural shape of the residual noise spectrum with weak energy is defined as:

(14)

2.2 Two-dimensional Speech Expansion Algorithm

Results for 2D noise erosion , two-dimensional speech expansion algorithm is defined by:

(15)

in is a structural element, yes domain of definition, yes domain of definition;

Therefore, the structural element in the two-dimensional speech expansion algorithm is defined as the following shape:

(16)

Step 3: Perceptual Spectral Structure Boundary (PSSB) Parameters and Endpoint Detection Algorithm

3.1 Perceptual Spectral Structure Boundary (PSSB) Parameters

The present invention uses the neighborhood model in the formula (17) to approach the result of speech two-dimensional enhancement gradient;

(17)

is the center point of this neighborhood model; and the gradient of the center neighborhood can be expressed by the following formula:

(18)

and Determined by formula (19) and formula (20):

(19)

(20)

that is The boundary of , which can describe the boundary information of the continuous distribution of the speech signal in the noisy speech spectrum;

The perceptual spectral structure boundary parameter PSSB is proposed as follows:

(twenty one)

in is the PSSB parameter of the mth frame, and M is the total number of frames;

3.2 Voice endpoint detection

A detection method aimed at the distribution characteristics of speech continuity is used to distinguish between voiced segments and unvoiced segments at the endpoints; the specific endpoint detection method is as follows:

(1) at first detect the speech segment that PSSB parameter is greater than threshold value a and distributes m frame continuously, and this segment is the voiced sound segment that detects;

(2) Based on this segment, all segments that are connected with this segment and are continuously greater than or equal to the threshold b are defined as speech segments; the value of the threshold b is relatively small. In the experiment, the value of b is between 0.01 and 0.05. It has a better recognition result; in this way, the unvoiced segment with a smaller PSSB value can be recognized;

(3) The start and end points of this speech segment are the speech endpoints.

3. a kind of speech endpoint detection algorithm that adopts perceptual spectrum structure boundary parameter according to claim 2, it is characterized in that: experimental design is under different SNR environments; The low SNR speech of input is 16k sampling, 16 bit quantization.

4. A speech endpoint detection algorithm using perceptual spectrum structure boundary parameters according to claim 2, characterized in that: Hamming window is used, the frame length is 256, and the frame shift is 128.

5. a kind of speech endpoint detection algorithm that adopts perceptual spectrum structure boundary parameter according to claim 2, is characterized in that: speech is selected from TIMIT speech database, and white noise is from NoiseX-92 noise database.