CN103440872B

CN103440872B - Denoising Method of Transient Noise

Info

Publication number: CN103440872B
Application number: CN201310357211.6A
Authority: CN
Inventors: 陈喆; 殷福亮; 周文颖
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2013-08-15
Filing date: 2013-08-15
Publication date: 2016-06-01
Anticipated expiration: 2033-08-15
Also published as: CN103440872A

Abstract

The invention discloses a denoising method of transient noise, belonging to the technical field of signal processing. The invention firstly calculates the Mel cepstrum coefficient of the frame signal, simultaneously predicts the pitch period of the frame signal, then uses the Mel cepstrum coefficient to detect whether the noise exists in the frame signal, if so, uses the pitch period predicted value to rebuild the frame signal.

Description

Denoising Method of Transient Noise

技术领域technical field

本发明涉及瞬态噪声的去噪方法，属于信号处理技术领域。The invention relates to a denoising method for transient noise, and belongs to the technical field of signal processing.

背景技术Background technique

音频信号中的瞬态加性噪声，也称为暂态噪声，或脉冲噪声。通常，瞬态噪声在时域中是非连续、间歇、脉冲式的，噪声能量主要集中在较短的时间区间内，在该区间内瞬态噪声的能量比纯净信号的能量要明显大很多。典型的瞬态噪声如桌子敲击声、关门声、鼓掌声、键盘击键声、鼠标按键声、锤子击打声等，它们常出现在很多应用场合，如助听器、手机、视讯会议设备等。瞬态噪声的存在严重影响音频质量，因此，有必要采取措施对瞬态噪声进行抑制，以增强音频的质量。目前的噪声抑制算法大多是针对稳态噪声和连续噪声情况，通常使用文献《语音增强及其相关技术的研究》中所述方法进行语音增强，如谱减法、自适应滤波法等，但是这些算法对上述瞬态噪声却无能为力，基本没有抑制效果。Transient additive noise in audio signals, also known as transient noise, or impulse noise. Usually, the transient noise is discontinuous, intermittent, and pulse-like in the time domain, and the noise energy is mainly concentrated in a short time interval, and the energy of the transient noise in this interval is significantly larger than that of the pure signal. Typical transient noises such as table knocking, door closing, applause, keyboard keystrokes, mouse button sounds, hammer strikes, etc., they often appear in many applications, such as hearing aids, mobile phones, video conferencing equipment, etc. The existence of transient noise seriously affects the audio quality, therefore, it is necessary to take measures to suppress the transient noise to enhance the audio quality. Most of the current noise suppression algorithms are aimed at steady-state noise and continuous noise. Usually, the methods described in the literature "Research on Speech Enhancement and Related Technologies" are used for speech enhancement, such as spectral subtraction, adaptive filtering, etc., but these algorithms However, it is powerless to the above-mentioned transient noise, and basically has no suppression effect.

发明内容Contents of the invention

本发明针对以上问题的提出，而研制瞬态噪声的去噪方法。In view of the above problems, the present invention develops a transient noise denoising method.

本发明采取的技术方案为：首先计算本帧信号的梅尔倒谱系数，同时预测本帧信号的基音周期，然后使用梅尔倒谱系数来检测本帧信号是否存在噪声即进行噪声检测，若存在噪声，则使用基音周期预测值来进行波形重建。The technical scheme adopted by the present invention is: first calculate the Mel cepstrum coefficient of the frame signal, predict the pitch period of the frame signal at the same time, then use the Mel cepstrum coefficient to detect whether there is noise in the frame signal, and then perform noise detection. In the presence of noise, the pitch period prediction is used for waveform reconstruction.

本发明的有益效果：使用20首纯净语音音频(包含成年男子、成年女子、儿童语音音频)与4种类型的噪声音频进行实验，噪声类型分别为：鼠标声、敲击声、节拍器声、键盘声。四种噪声的持续时间分别为：鼠标声为10ms，敲击声、节拍器声为20ms，键盘声为30ms。对每首纯净音频分别加入这4种噪声，得到80首含噪音频。每首音频加入噪声的个数为30个，噪声之间的距离相等。所有音频的采样率为f_s=48kHz，帧长为N=480。MFCC计算阶段，做NFFT=1024点FFT，梅尔滤波器组的滤波器个数为M=24，求取L=12维MFCC；瞬态噪声检测阶段，自适应门限设置为Thres=const·ener，为使门限适合所有噪声，常数const设置为10，ener为每一帧输入信号的能量，最小值设置为60.0；门限更新时，遗忘因子b设置为0.4；基音周期估计阶段，在(2ms,12ms)内搜索基音周期，对应点数为(76,576)；波形重建阶段，淡入淡出点数N₁，N₂均为32，缓冲区buf(n)长度为2240。使用本发明对含噪语音进行去噪后，大幅度提高了语音的可懂度，减少了听者的疲劳感。使用分段信噪比SNR_Seg和PEAQ两种指标对本方法去噪效果进行评估结果见附图说明里的图12与图13所示。Beneficial effect of the present invention: use 20 pure speech audio frequency (comprising adult man, adult woman, child speech audio frequency) and 4 kinds of noise audio frequency to carry out experiment, noise type is respectively: mouse sound, percussion sound, metronome sound, keyboard sound. The durations of the four kinds of noises are respectively: 10ms for mouse sound, 20ms for percussion sound and metronome sound, and 30ms for keyboard sound. These 4 kinds of noises were added to each pure audio, and 80 noise-containing audios were obtained. The number of noises added to each audio is 30, and the distance between the noises is equal. The sampling rate of all audio is f _s =48kHz, and the frame length is N=480. In the MFCC calculation stage, NFFT=1024-point FFT is performed, the number of filters in the Mel filter bank is M=24, and L=12-dimensional MFCC is obtained; in the transient noise detection stage, the adaptive threshold is set to Thres=const ener , in order to make the threshold suitable for all noises, the constant const is set to 10, ener is the energy of the input signal in each frame, and the minimum value is set to 60.0; when the threshold is updated, the forgetting factor b is set to 0.4; in the pitch period estimation stage, in (2ms, Search for the pitch period within 12ms), and the corresponding points are (76,576); in the waveform reconstruction stage, the fade-in and fade-out points N ₁ and N ₂ are both 32, and the length of the buffer buf(n) is 2240. After the noise-containing speech is denoised by using the invention, the intelligibility of the speech is greatly improved, and the fatigue of the listener is reduced. The results of evaluating the denoising effect of this method using the two indicators of segmental signal-to-noise ratio SNR _Seg and PEAQ are shown in Figure 12 and Figure 13 in the description of the drawings.

附图说明Description of drawings

图1梅尔频率与线性频率的关系。Figure 1 Mel frequency versus linear frequency.

图2现有技术一的技术方案流程。Fig. 2 is the flow chart of the technical solution of prior art 1.

图3现有技术二的技术方案流程。Fig. 3 is the flow chart of the technical solution of the second prior art.

图4本技术方案框图。Figure 4 is a block diagram of the technical solution.

图5MFCC特征提取框图。Figure 5 MFCC feature extraction block diagram.

图6梅尔频率滤波器组。Figure 6. Mel frequency filter bank.

图7基音周期估计框图。Figure 7 is a block diagram of pitch period estimation.

图8两点间的线性插值。Figure 8 Linear interpolation between two points.

图9(a)当前帧未修复时信号。Figure 9(a) Signal when the current frame is not repaired.

图9(b)新基音周期波形pw^(p)(n)。Figure 9(b) New pitch cycle waveform pw ^(p) (n).

图9(c)当前帧修复后信号。Figure 9(c) The signal of the current frame after restoration.

图10(a)当前帧未修复时信号。Figure 10(a) Signal when the current frame is not repaired.

图10(b)当前帧信号。Figure 10(b) Current frame signal.

图10(c)修复后信号。Fig. 10(c) Signal after repair.

图11(a)去噪前信号。Fig. 11(a) Signal before denoising.

图11(b)去噪后信号。Figure 11(b) Signal after denoising.

图12去噪效果评估表(SNR)。Figure 12 Denoising effect evaluation table (SNR).

图13去噪效果评估(PEAQ)。Figure 13 Denoising Effect Evaluation (PEAQ).

具体实施方式detailed description

下面结合附图对本发明做进一步说明：The present invention will be further described below in conjunction with accompanying drawing:

梅尔倒谱系数：Mel cepstral coefficients:

对人的听觉机理的研究发现，人耳对不同频率的声波具有不同的听觉灵敏度，且在200Hz到5kHz之间的语音信号对语音的清晰度影响最大。此外，人耳具有掩蔽效应，即能量大的语音信号对较弱的语音信号具有一定的掩盖作用。通常，较低频率的音频掩蔽较高频率的音频容易，反之则比较困难，也就是说，在低频处的声音掩蔽的临界带宽较高频端要小。据此，人们按照临界带宽的大小由密到稀安排一组带通滤波器，对输入信号进行滤波。如果将每个带通滤波器输出信号的能量作为信号的基本特征，则对此特征进一步处理后，就可作为语音的特征，这就是梅尔倒谱系数(MFCC)。这种特征不依赖于信号的性质，即对输入信号不做任何假设和限制，同时又利用了人耳的听觉感知特性，因此，与基于声道模型的线性预测倒谱系数(LPCC)相比，它具有更好的鲁棒性，且当信噪比较低时，仍具有较好的语音识别性能。Research on the human hearing mechanism has found that the human ear has different hearing sensitivities to sound waves of different frequencies, and the voice signal between 200Hz and 5kHz has the greatest impact on the clarity of voice. In addition, the human ear has a masking effect, that is, a speech signal with high energy has a certain covering effect on a weaker speech signal. In general, it is easier for lower frequency audio to mask higher frequency audio, and vice versa, that is, the critical bandwidth for sound masking at low frequencies is smaller at higher frequencies. Accordingly, people arrange a group of bandpass filters from dense to sparse according to the critical bandwidth to filter the input signal. If the energy of the output signal of each band-pass filter is taken as the basic feature of the signal, after further processing this feature, it can be used as the feature of speech, which is the Mel cepstral coefficient (MFCC). This feature does not depend on the nature of the signal, that is, it does not make any assumptions and restrictions on the input signal, and at the same time utilizes the auditory perception characteristics of the human ear. Therefore, compared with the linear predictive cepstral coefficient (LPCC) based on the channel model , it has better robustness, and when the signal-to-noise ratio is low, it still has better speech recognition performance.

MFCC是在梅尔标度频率域提取出来的倒谱参数。梅尔标度描述了人耳频率的非线性特性，它与频率的关系可近似表示为MFCC is a cepstrum parameter extracted in the Mel-scale frequency domain. The Mel scale describes the nonlinear characteristics of the human ear frequency, and its relationship with frequency can be approximately expressed as

${f f}_{mel mel} = = 25952595 {log log}_{1010} ((11 + + \frac{{f f}_{linear linear}}{700700})) - - - - - - ((1818))$

式中，f为频率，单位为Hz。图1所示即为梅尔频率与线性频率的关系，随着f_linear的线性增长，f_mel对数的形式增长。In the formula, f is the frequency, and the unit is Hz. Figure 1 shows the relationship between Mel frequency and linear frequency. As f _linear increases linearly, f _mel increases logarithmically.

信包丢失隐藏：Envelope Lost Concealment:

在基于IP协议的语音通信系统，比如基于IP网的语音(VoIP)中，由于网络拥塞或者传输过程延迟抖动，会造成信包丢失，即某些信包不能按时出现在接收端，严重影响接收端的语音质量。因此，在接收端必须采取一些措施，以减少因信包丢失而造成的语音失真。通常，这种处理丢包问题的措施称为信包丢失隐藏算法(PLC)算法。In voice communication systems based on IP protocols, such as voice over IP networks (VoIP), due to network congestion or delay and jitter in the transmission process, envelopes will be lost, that is, some envelopes cannot appear at the receiving end on time, seriously affecting reception. end voice quality. Therefore, some measures must be taken at the receiving end to reduce the voice distortion caused by packet loss. Usually, this measure to deal with the packet loss problem is called a packet loss concealment algorithm (PLC) algorithm.

PLC算法主要分为基于发送端的处理算法和基于接收端的处理算法两类。基于发送端PLC算法由收、发两端共同参与完成；基于收端PLC算法，则仅根据接收端正常接收到的信包、丢失信包编号以及预先知道的编码方式，尽可能恢复出原来的语音。由于基于接收端的PLC技术不需要发送端的有关数据，因此不会增加网络的流量和时延。常用的基于接收端的PLC方法有静音替代方法、前一信包重复方法、模板匹配方法、基音波形复制方法和线性预测方法等。The PLC algorithm is mainly divided into two types: the processing algorithm based on the sending end and the processing algorithm based on the receiving end. Based on the PLC algorithm of the sending end, both the receiving and sending ends participate in the completion; based on the PLC algorithm of the receiving end, only based on the normally received envelopes at the receiving end, the number of lost envelopes, and the encoding method known in advance, the original data can be restored as much as possible. voice. Since the PLC technology based on the receiving end does not need the relevant data of the sending end, it will not increase the traffic and delay of the network. Commonly used PLC methods based on the receiving end include mute substitution method, previous packet repetition method, template matching method, pitch waveform replication method and linear prediction method.

本文提出的基音周期波形复制(PWR)方法，属于基于接收端的PLC方法。The pitch cycle waveform replication (PWR) method proposed in this paper belongs to the PLC method based on the receiving end.

与本发明相关的现有技术一Prior art relevant to the present invention one

现有技术一的技术方案Technical solution of prior art one

何志勇等在论文“脉冲噪声环境下基于卡尔曼滤波的语音增强”中，提出了一种瞬态噪声环境下的语音增强方法。该方法的流程图如图2所示，首先找出瞬态噪声样本能量与含噪信号样本能量之比最大的频段，然后利用该频段的能量分布情况，逐帧判别语音信号是否被瞬态噪声干扰；在此基础上，该方法针对瞬态噪声干扰的语音帧，应用卡尔曼滤波算法进行去噪；此外，该方法对自回归(AR)模型参数估计过程进行了改进。In the paper "Speech Enhancement Based on Kalman Filter in Impulse Noise Environment", He Zhiyong et al. proposed a speech enhancement method in transient noise environment. The flow chart of this method is shown in Figure 2. First, the frequency band with the largest ratio of the energy of the transient noise sample to the energy of the noise-containing signal sample is found, and then the energy distribution of the frequency band is used to judge whether the speech signal is overwhelmed by the transient noise frame by frame. Interference; on this basis, the method applies the Kalman filter algorithm to denoise the speech frame interfered by transient noise; in addition, the method improves the parameter estimation process of the autoregressive (AR) model.

现有技术一的缺点The shortcoming of prior art one

(1)对于拖尾较长的噪声，拖尾部分有可能检测不出来。(1) For noise with a long tail, the tail part may not be detected.

(2)在去噪时，所用的卡尔曼滤波适合对稳态噪声进行去噪，不适合非平稳的瞬态噪声，因此去噪效果有限，噪声残留较多，影响了语音质量。(2) When denoising, the Kalman filter used is suitable for denoising steady-state noise, but not suitable for non-stationary transient noise, so the denoising effect is limited, and there are many residual noises, which affect the voice quality.

与本发明相关的现有技术二Related prior art 2 of the present invention

现有技术二的技术方案Technical scheme of prior art 2

Hetherington等在发明专利“Repetitivetransientnoiseremoval”中，提出一种瞬态噪声抑制方法。Hetherington方法的流程图如图3所示。该方法先根据噪声特点进行建模，然后利用建模信号与待检测信号的相关系数来确定待检测数据是否含有噪声，若存在噪声，则根据建模信号将待检测信号中的噪声成分移除。Hetherington et al. proposed a transient noise suppression method in the invention patent "Repetitive transient noise removal". The flowchart of the Hetherington method is shown in Figure 3. This method first performs modeling according to the characteristics of the noise, and then uses the correlation coefficient between the modeling signal and the signal to be detected to determine whether the data to be detected contains noise. If there is noise, the noise component in the signal to be detected is removed according to the modeling signal. .

现有技术二的缺点The shortcoming of prior art two

Hetherington方法可有效地对重复出现的噪声进行去噪，但由于瞬态噪声类型多种多样，当短时间内存在多种不同类型的瞬态噪声时，会造成建模不准确，此时Hetherington方法的去噪效果较差。The Hetherington method can effectively denoise the repeated noise, but due to the variety of transient noise types, when there are many different types of transient noises in a short period of time, the modeling will be inaccurate. At this time, the Hetherington method The denoising effect is poor.

本发明技术方案的详细阐述Detailed elaboration of the technical solution of the present invention

本发明所要解决的技术问题Technical problem to be solved by the present invention

对瞬态噪声干扰的音频进行语音增强，抑制瞬态噪声，改进语音质量，提高音频可懂度。Perform speech enhancement on audio disturbed by transient noise, suppress transient noise, improve speech quality, and increase audio intelligibility.

本发明提供的完整技术方案：Complete technical scheme provided by the present invention:

本发明技术方案框图见图4：利用输入音频信号，提取出MFCC参数；然后用MFCC参数来检测音频信号中是否含有噪声；若检测结果为含有噪声，则使用PWR方法来替换含噪帧数据，进行波形重建；若检测结果为不含噪声，音频信号则原样输出。The block diagram of the technical scheme of the present invention is shown in Fig. 4: utilize input audio signal, extract MFCC parameter; Then use MFCC parameter to detect whether noise is contained in the audio signal; If the detection result contains noise, then use PWR method to replace the frame data containing noise, Perform waveform reconstruction; if the detection result is noise-free, the audio signal is output as it is.

本发明技术方案实现步骤：Implementation steps of the technical solution of the present invention:

输入单声道音频信号的采样率为f_s=48kHz。输入含噪音频信号x(n)可表示为x(n)=s(n)+d(n)，其中s(n)表示纯净语音信号，d(n)表示瞬态噪声信号。The sampling rate of the input mono audio signal is f _s =48kHz. The input noise-containing audio signal x(n) can be expressed as x(n)=s(n)+d(n), where s(n) represents a pure speech signal and d(n) represents a transient noise signal.

(1)音频信号的MFCC特征提取(1) MFCC feature extraction of audio signal

MFCC的提取过程如图5所示，灰度图能更好的理解本发明的技术效果，特提供灰度图来说明本发明的技术效果。为了让审查员更清楚的了解本发明的技术效果特提供灰度图图5来说明本发明的技术效果。以供参考。首先将时域音频信号进行时频变换，计算其能量谱；然后将该能量谱与梅尔标度的三角形滤波器组相乘，再将相乘结果的对数能量做离散余弦变换(DCT)，这样得到的前L维向量称为MFCC，计算MFCC的具体步骤：The extraction process of MFCC is shown in Figure 5, the grayscale image can better understand the technical effect of the present invention, and the grayscale image is specially provided to illustrate the technical effect of the present invention. In order to let the examiner understand the technical effect of the present invention more clearly, a grayscale image Fig. 5 is provided to illustrate the technical effect of the present invention. for reference. First, the time-domain audio signal is subjected to time-frequency transformation to calculate its energy spectrum; then the energy spectrum is multiplied by a Mel-scale triangular filter bank, and then the logarithmic energy of the multiplication result is subjected to discrete cosine transform (DCT) , the front L-dimensional vector obtained in this way is called MFCC, the specific steps of calculating MFCC:

1)输入信号分帧，帧长设为10ms，由于采样频率为f_s=48kHz，所以一帧的数据长度为N=480点；然后将数据进行归一化：若信号量化位数为16bit，则将数据除以2¹⁵，将数据的范围缩小到(-1,1)，即完成数据的归一化。设当前帧信号为第p帧信号，则有1) The input signal is divided into frames, and the frame length is set to 10ms. Since the sampling frequency is f _s =48kHz, the data length of one frame is N=480 points; then the data is normalized: if the signal quantization bit is 16bit, Then divide the data by 2 ¹⁵ , narrow the range of the data to (-1,1), that is, complete the normalization of the data. Let the current frame signal be the pth frame signal, then

x^(p)(n)=x[p·(N-1)+n],n=0,1,…,N-1(19)x ^(p) (n)=x[p·(N-1)+n],n=0,1,…,N-1(19)

2)预处理。对当前帧信号进行预加重和加窗处理，即2) Pretreatment. Perform pre-emphasis and windowing processing on the current frame signal, namely

y^(p)(n)=x^(p)(n)-βx^(p)(n-1)(20)y ^(p) (n)=x ^(p) (n)-βx ^(p) (n-1)(20)

${y the y}_{w w}^{((p p))} ((n no)) = = {y the y}^{((p p))} ((n no)) w w ((n no)) - - - - - - ((21 twenty one))$

其中预加重因子β=0.938；w(n)为汉明窗，即w(n)=0.54-0.46cos(nπ/N)。Among them, the pre-emphasis factor β=0.938; w(n) is the Hamming window, that is, w(n)=0.54-0.46cos(nπ/N).

3)对预处理后的信号做N=1024点FFT，得到频域信号Y^(p)(k)。3) Perform N=1024-point FFT on the preprocessed signal to obtain the frequency domain signal Y ^(p) (k).

4)计算频域信号Y^(p)(k)的能量谱|Y^(p)(k)|²。4) Calculate the energy spectrum |Y ^(p) (k)| ² of the frequency domain signal Y ^(p) (k).

5)将频域信号的能量谱通过一组梅尔标度的三角形滤波器组H，进行频域滤波。5) Pass the energy spectrum of the frequency-domain signal through a set of Mel-scale triangular filter banks H to perform frequency-domain filtering.

在滤波器组中，有M个滤波器，每个滤波器都是三角滤波器，滤波器之间相互重叠，如图6所示：各个滤波器的中心频率为f(m),m=1,2,…,M，本发明取M=24。滤波器设计方法：将输入信号末端频率f_s/2，即24kHz，通过式(1)变换到梅尔标度频率域，得到F_smel；将区间(0,F_smel)平均分成25份，除去0与F_smel两个端点，剩下的24个分割点分别作为24个滤波器的中心频率。各分割点f(m)在梅尔标度频率中呈均匀分布，再通过式(1)变换到线性频率标度。变换后，f(m)之间的间隔随着m值的减小而缩小，随着m值的增大而增宽。In the filter bank, there are M filters, each filter is a triangular filter, and the filters overlap each other, as shown in Figure 6: the center frequency of each filter is f(m), m=1 ,2,...,M, the present invention takes M=24. Filter design method: transform the terminal frequency f _s /2 of the input signal, that is, 24kHz, into the Mel-scale frequency domain through formula (1), and obtain F _smel ; divide the interval (0, F _smel ) into 25 parts on average, and remove 0 and F _smel two endpoints, and the remaining 24 split points are used as the center frequencies of the 24 filters. Each split point f(m) is uniformly distributed in the Mel-scale frequency, and then converted to a linear frequency scale by formula (1). After transformation, the interval between f(m) shrinks as the value of m decreases and widens as the value of m increases.

根据频率分割点f(m)，可求出三角滤波器组H(m,k)的频率响应为According to the frequency division point f(m), the frequency response of the triangular filter bank H(m,k) can be obtained as

$H h ((m m,, k k)) = = \{\begin{matrix} 00,, & f f ((k k)) < < f f ((m m + + 11)) \\ \frac{22 [[f f ((k k)) - - f f ((m m - - 11))]]}{[[f f ((m m + + 11)) - - f f ((m m - - 11))]] [[f f ((m m)) - - f f ((m m - - 11))]]},, & f f ((m m - - 11)) \leq \leq f f ((k k)) < < f f ((m m)) \\ \frac{22 [[f f ((m m + + 11)) - - f f ((k k))]]}{[[f f ((m m + + 11)) - - f f ((m m - - 11))]] [[f f ((m m + + 11)) - - f f ((m m))]]},, & f f ((m m)) \leq \leq f f ((k k)) \leq \leq f f ((m m + + 11)) \\ 00,, & f f ((k k)) > > f f ((m m + + 11)) \end{matrix} - - - - - - ((22 twenty two))$

6)计算各个滤波器H(m,k)输出的能量和对数，得到E(m)，即6) Calculate the energy and logarithm output by each filter H(m,k) to get E(m), namely

$E E. ((m m)) = = {log log}_{1010} [[\underset{k k}{Σ Σ} H h ((m m,, k k)) {| | {Y Y}^{((p p))} ((k k)) | |}^{22}]],, m m = = 1,2 1,2,, \cdot &Center Dot; \cdot \cdot \cdot \cdot,, M m - - - - - - ((23 twenty three))$

对E(m)做离散余弦变换DCT，即可得到L=12阶MFCC，记为C(l)Do discrete cosine transform DCT on E(m), and you can get L=12th order MFCC, denoted as C(l)

$C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{M - 1} E (m), l = 0$ (24) $C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{m - 1} E. (m), l = 0$ (twenty four)

${C C}^{((p p))} ((l l)) = = \frac{22}{\sqrt{L L}} {Σ Σ}_{m m = = 00}^{M m - - 11} E E. ((m m)) cos cos ((\frac{πl πl ((22 m m + + 11))}{22 M m})),, 11 \leq \leq l l \leq \leq L L - - 11$

(2)噪声检测：(2) Noise detection:

计算当前帧信号的MFCC与前一帧信号的MFCC之间的欧氏距离distCalculate the Euclidean distance dist between the MFCC of the current frame signal and the MFCC of the previous frame signal

$dist dist = = \sqrt{{Σ Σ}_{l l = = 00}^{L L} {[[{C C}^{((p p))} ((l l)) - - {C C}^{((p p - - 11))} ((l l))]]}^{22}},, - - - - - - ((2525))$

根据距离值与门限值Thres来判断当前帧是否含有噪声。门限值Thres由下式自适应确定Whether the current frame contains noise is judged according to the distance value and the threshold value Thres. The threshold value Thres is adaptively determined by the following formula

Thres=10·ener，(26)Thres=10 ener, (26)

其中ener为每一帧信号归一化后的能量，将其最小值设为60.0。Where ener is the normalized energy of each frame signal, and its minimum value is set to 60.0.

检测完成后，更新当前帧的MFCC特征，即After the detection is completed, update the MFCC feature of the current frame, namely

C^(p)(l)=b·C^(p-1)(l)+(1-b)·C^(p-1)(l),(27)C ^(p) (l)=b C ^(p-1) (l)+(1-b) C ^(p-1 )(l),(27)

其中遗忘因子b=0.4。当噪声帧的下一帧为语音帧时，此更新方法可防止误检。Among them, the forgetting factor b=0.4. This update method prevents false detection when the next frame of a noise frame is a speech frame.

(3)基音周期预测：(3) Pitch period prediction:

对每一帧语音信号估计基音周期。若当前帧为噪声帧，则根据前两帧信号的基音周期来预测当前帧基音周期。基音周期估计框图如图7所示：对于不同说话人，基音周期一般在2-12ms内，因此，本文在2-12ms内搜索基音周期。设PMAX为12ms所对应的数据个数，即PMAX=576；PMIN为2ms所对应的数据个数，即PMIN=96。使用长度为3PMAX+N=2208的缓冲区buf(n)来估计基音周期，其中缓冲区buf(n)用来存储已输出的数据。The pitch period is estimated for each frame of the speech signal. If the current frame is a noise frame, the pitch period of the current frame is predicted according to the pitch periods of the previous two frame signals. The block diagram of pitch period estimation is shown in Figure 7: for different speakers, the pitch period is generally within 2-12ms, so this paper searches for the pitch period within 2-12ms. Let PMAX be the number of data corresponding to 12ms, that is, PMAX=576; PMIN be the number of data corresponding to 2ms, that is, PMIN=96. Use a buffer buf(n) with a length of 3PMAX+N=2208 to estimate the pitch period, where the buffer buf(n) is used to store the output data.

基音周期估计方法如下：The pitch period estimation method is as follows:

1)对buf(n)进行低通滤波，得到buf_d(n)。其中低通滤波器(LPF)的截止频率为900Hz。1) Perform low-pass filtering on buf(n) to obtain buf _d (n). Among them, the cutoff frequency of the low-pass filter (LPF) is 900Hz.

2)对buf_d(n)进行中心削波，得到buf_c(n)，即2) Perform center clipping on buf _d (n) to obtain buf _c (n), namely

${buf buf}_{c c} ((n no)) = = \{\begin{matrix} {buf buf}_{d d} ((n no)) - - {C C}_{L L},, & {buf buf}_{d d} ((n no)) {> > C C}_{L L} \\ {buf buf}_{d d} ((n no)) + + {C C}_{L L},, & {buf buf}_{d d} ((n no)) < < - - {C C}_{L L} \\ 00,, & | | {buf buf}_{d d} ((n no)) | | {\leq \leq C C}_{L L} \end{matrix},, - - - - - - ((2828))$

其中C_L为限幅电平，通常设为归一化数据最大值的68%。Among them, _CL is the clipping level, which is usually set to 68% of the maximum value of the normalized data.

3)对buf_c(n)进行自相关运算，在(96,576)范围中搜索自相关的最大值位置，将其作为基音周期估计值Pitch。3) Carry out autocorrelation calculation on buf _c (n), search for the maximum position of autocorrelation in the range of (96,576), and use it as the pitch period estimated value Pitch.

${r r}_{{buf buf}_{c c}} ((n no)) = = {Σ Σ}_{m m = = 00}^{22 PMAX PMAX - - 11} {buf buf}_{c c} ((m m)) {buf buf}_{c c} ((m m + + n no)),, PMIN PMIN \leq \leq n no \leq \leq PMAX PMAX - - - - - - ((2929))$

$Pitch pitch = = arg arg \underset{PMIN PMIN \leq \leq n no \leq \leq PMAX PMAX}{max max} {r r}_{{buf buf}_{c c}} ((n no)) - - - - - - ((3030))$

4)为防止倍频出现，用式(13)对前两帧基音周期预测值Pitch^(p-1)和Pitch^(p-2)进行平滑处理，即4) In order to prevent frequency doubling, use formula (13) to smooth the predicted values Pitch ^(p-1) and Pitch ^(p-2) of the pitch period of the first two frames, namely

根据平滑后的两个基音周期来预测当前帧基音周期Pitch^(p)，即Predict the current frame pitch period Pitch ^(p) according to the smoothed two pitch periods, namely

Pitch^(p)＝Pitch^(p-1)+(Pitch^(p-1)-Pitch^(p-2))。(32)Pitch ^(p) = Pitch ^(p-1) + (Pitch ^(p-1) - Pitch ( ^p-2) ). (32)

(4)波形重建：(4) Waveform reconstruction:

提取出前一帧的最后一个基音周期波形，对其进行线性插值，得到新基音周期波形。The last pitch cycle waveform of the previous frame is extracted and linearly interpolated to obtain a new pitch cycle waveform.

1)由于buf(n)中存储已输出帧数据，所以可从buf(n)中提取前一帧的基音周期波形，即前一帧输出信号的最后Pitch^(p-1)个点，将其波形数据记为pw^(p-1)(n)。对pw^(p-1)(n)进行线性插值，得到长度为Pitch^(p)的新波形，记为pw^(p)(n)。两点间的线性插值如图8所示，插值公式为1) Since the output frame data is stored in buf(n), the pitch cycle waveform of the previous frame can be extracted from buf(n), that is, the last Pitch ^(p-1) points of the output signal of the previous frame, and its Waveform data is denoted as pw ^(p-1) (n). Perform linear interpolation on pw ^(p-1 )(n) to obtain a new waveform with length Pitch ^(p) , denoted as pw ^(p) (n). The linear interpolation between two points is shown in Figure 8, and the interpolation formula is

${pw pw}^{((p p))} (({n no}^{' '})) = = ((\frac{{pitch pitch}^{((p p - - 11))}}{{pitch pitch}^{((p p))}} \cdot \cdot {n no}^{' '} - - n no + + 11)) \cdot \cdot [[{pw pw}^{((p p - - 11))} ((n no)) - - {pw pw}^{((p p - - 11))} ((n no - - 11))]] + + {pw pw}^{((p p - - 11))},, n no - - 11 \leq \leq \frac{{pitct pitct}^{((p p - - 11))}}{{pitch pitch}^{((p p))}} \cdot \cdot {n no}^{' '} < < n no - - - - - - ((3333))$

2)使用新波形进行波形周期复制：2) Use the new waveform for waveform cycle replication:

d.波形周期复制的原理如图9(a)至图9(c)：若当前帧为噪声帧(不论前一帧是噪声帧还是纯净语音帧)，处理过程为：按照式(15)，将buf(n)中AB段数据与CD段数据进行重叠相加，并进行淡入淡出处理，以保证D两侧的数据具有连续性，即d. The principle of waveform cycle replication is shown in Figure 9(a) to Figure 9(c): if the current frame is a noise frame (no matter whether the previous frame is a noise frame or a pure speech frame), the processing process is: according to formula (15), Overlap and add the AB segment data and CD segment data in buf(n), and perform fade-in and fade-out processing to ensure the continuity of the data on both sides of D, that is

buf_CD(n)＝α·buf_CD(n)+(1-α)·buf_AB(n)(34)buf _CD (n) = α·buf _CD (n)+(1-α)·buf _AB (n)(34)

＝α·buf_CD(n)+(1-α)·buf_CD(n-Pitch)0≤n＜N₁ =α·buf _CD (n)+(1-α)·buf _CD (n-Pitch)0≤n<N ₁

$α α = = \frac{{N N}_{11} - - i i}{{N N}_{11}},, i i = = 0,1 0,1,, \cdot \cdot \cdot \cdot \cdot &Center Dot;,, {N N}_{11} - - 11,, - - - - - - ((3535))$

其中，α为衰减因子，从1线性衰减到0；AB段与CD段数据长度N₁=32。Among them, α is the attenuation factor, which decays linearly from 1 to 0; the data length of the AB segment and the CD segment is N ₁ =32.

e.根据周期Pitch^(p)，用新波形pw^(p)(n)不断复制到DF区域内。其中，DE段是修复后的当前帧；EF段数据长度为N₂=32，其作用在于，当下一帧为语音帧时，用于数据淡入淡出，以保证E两端即帧与帧之间的连续性。e. According to the period Pitch ^(p) , use the new waveform pw ^(p) (n) to continuously copy to the DF area. Among them, the DE segment is the current frame after repair; the data length of the EF segment is N ₂ =32, and its function is that when the next frame is a voice frame, it is used for data fade-in and fade-out, so as to ensure that the two ends of E are between frames continuity.

f.输出buf(n)中以C点开始的一帧数据。此方法输出存在延迟，延迟时间即CD段长度。再将buf(n)所有数据前移N点(一帧长度)。f. Output a frame of data starting from point C in buf(n). There is a delay in the output of this method, and the delay time is the length of the CD segment. Then move all the data of buf(n) forward by N points (one frame length).

如图10(a)到图10(c)所示，图10(a)为当前帧为待修复帧时，将当前帧丢弃信号图示；图10(b)为使用本专利方法重建的当前帧信号；图10(c)为修复后信号。若当前帧为纯净语音帧，而前一帧是噪声帧，处理过程如下，As shown in Figure 10(a) to Figure 10(c), Figure 10(a) is an illustration of the discarded signal of the current frame when the current frame is a frame to be repaired; Figure 10(b) is the current frame reconstructed using the patented method Frame signal; Figure 10(c) is the repaired signal. If the current frame is a pure speech frame and the previous frame is a noise frame, the processing is as follows,

d.此时buf(n)中DG段数据即为上一帧的EF段数据。将DG段与当前帧输入的前N₂个数据点进行数据融合(计算与式(15)类似)，存储到DG中。d. At this time, the DG segment data in buf(n) is the EF segment data of the previous frame. Perform data fusion between the DG segment and the first N ₂ data points input in the current frame (calculation is similar to formula (15)), and store it in DG.

e.将当前帧剩余数据点原样复制到buf(n)中的G点后。e. Copy the remaining data points of the current frame to point G in buf(n) as they are.

f.输出以C点开始的一帧长度的数据，再将buf(n)所有数据前移一帧信号的数据长度，即N点。f. Output the data of one frame length starting from point C, and then move all the data of buf(n) forward by the data length of one frame signal, that is, N points.

若当前帧与前一帧都为纯净语音帧，则将当前帧输入数据原样复制到buf(n)中待修复区域，即图8中的DE区域；输出以C点开始的一帧长度的数据。If the current frame and the previous frame are all pure speech frames, then copy the input data of the current frame to the area to be repaired in buf(n), that is, the DE area in Figure 8; output the data of a frame length starting from point C .

本发明技术方案带来的有益效果：The beneficial effects brought by the technical solution of the present invention:

使用20首纯净语音音频(包含成年男子、成年女子、儿童语音音频)与4种类型的噪声音频进行实验，噪声类型分别为：鼠标声、敲击声、节拍器声、键盘声。四种噪声的持续时间分别为：鼠标声为10ms，敲击声、节拍器声为20ms，键盘声为30ms。对每首纯净音频分别加入这4种噪声，得到80首含噪音频。每首音频加入噪声的个数为30个，噪声之间的距离相等。Experiments were conducted with 20 pure voice audios (including adult men, adult women, and children) and 4 types of noise audio. The noise types are: mouse sound, knocking sound, metronome sound, and keyboard sound. The durations of the four kinds of noises are respectively: 10ms for mouse sound, 20ms for percussion sound and metronome sound, and 30ms for keyboard sound. These 4 kinds of noises were added to each pure audio, and 80 noise-containing audios were obtained. The number of noises added to each audio is 30, and the distance between the noises is equal.

所有音频的采样率为f_s=48kHz，帧长为N=480。MFCC计算阶段，做NFFT=1024点FFT，梅尔滤波器组的滤波器个数为M=24，求取L=12维MFCC；瞬态噪声检测阶段，自适应门限设置为Thres=const·ener，为使门限适合所有噪声，常数const设置为10，ener为每一帧输入信号的能量，最小值设置为60.0；门限更新时，遗忘因子b设置为0.4；基音周期估计阶段，在(2ms,12ms)内搜索基音周期，对应点数为(76,576)；波形重建阶段，淡入淡出点数N₁，N₂均为32，缓冲区buf(n)长度为2240。The sampling rate of all audio is f _s =48kHz, and the frame length is N=480. In the MFCC calculation stage, NFFT=1024-point FFT is performed, the number of filters in the Mel filter bank is M=24, and L=12-dimensional MFCC is obtained; in the transient noise detection stage, the adaptive threshold is set to Thres=const ener , in order to make the threshold suitable for all noises, the constant const is set to 10, ener is the energy of the input signal in each frame, and the minimum value is set to 60.0; when the threshold is updated, the forgetting factor b is set to 0.4; in the pitch period estimation stage, in (2ms, Search for the pitch period within 12ms), and the corresponding points are (76,576); in the waveform reconstruction stage, the fade-in and fade-out points N ₁ and N ₂ are both 32, and the length of the buffer buf(n) is 2240.

使用本发明对含噪语音进行去噪后，大幅度提高了语音的可懂度，减少了听者的疲劳感。使用分段信噪比SNR_Seg和PEAQ两种指标对本方法去噪效果进行评估，其中分段信噪比计算方法为After the noise-containing speech is denoised by using the invention, the intelligibility of the speech is greatly improved, and the fatigue of the listener is reduced. Two indicators of SNR _Seg and PEAQ are used to evaluate the denoising effect of this method, and the calculation method of SNR is as follows:

${SNR SNR}_{seg seg}^{in in} = = \frac{11}{R R} {Σ Σ}_{i i = = 11}^{R R} 1010 {log log}_{1010} \frac{\underset{n no &Element; &Element; {frame frame}_{i i}}{Σ Σ} {| | s the s ((n no)) | |}^{22}}{\underset{n no &Element; &Element; {frame frame}_{i i}}{Σ Σ} {| | x x ((n no)) - - s the s ((n no)) | |}^{22}},, - - - - - - ((3636))$

${SNR SNR}_{seg seg}^{out out} = = \frac{11}{R R} {Σ Σ}_{i i = = 11}^{R R} 1010 {log log}_{1010} \frac{\underset{n no &Element; &Element; {frame frame}_{i i}}{Σ Σ} {| | s the s ((n no)) | |}^{22}}{\underset{n no &Element; &Element; {frame frame}_{i i}}{Σ Σ} {| | \overset{^^}{s the s} ((n no)) - - s the s ((n no)) | |}^{22}},, - - - - - - ((3737))$

用两种指标对本方法去噪效果进行评估，结果如图12与图13所示，图12为使用信噪比对含噪信号去噪前与去噪后的客观音频质量进行比较；图13为使用PEAQ对含噪信号去噪前与去噪后的客观音频质量进行比较。Two indicators are used to evaluate the denoising effect of this method, and the results are shown in Figure 12 and Figure 13. Figure 12 uses the signal-to-noise ratio to compare the objective audio quality of the noisy signal before and after denoising; Figure 13 is Use PEAQ to compare the objective audio quality of noisy signals before and after denoising.

含噪信号与用本方案去噪后信号的语谱图如图11（a）及图11（b）及所示；灰度图能更好的理解本发明的技术效果，特提供灰度图来说明本发明的技术效果。为了让审查员更清楚的了解本发明的技术效果特提供灰度图即图11（a）及图11（b）来说明本发明的技术效果。以供参考。图11(a)为受鼠标点击声污染的音频的语谱图；图11(b)为对图11(a)所示带噪音频进行去噪后的音频语谱图。The spectrograms of the noise-containing signal and the signal after denoising by this scheme are shown in Figure 11(a) and Figure 11(b); the grayscale image can better understand the technical effect of the present invention, and the grayscale image is specially provided To illustrate the technical effect of the present invention. In order for the examiner to understand the technical effect of the present invention more clearly, the grayscale images, namely Figure 11(a) and Figure 11(b) are provided to illustrate the technical effect of the present invention. for reference. Figure 11(a) is the spectrogram of the audio polluted by mouse click sound; Figure 11(b) is the audio spectrogram after denoising the noisy frequency shown in Figure 11(a).

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, any person familiar with the technical field within the technical scope disclosed in the present invention, according to the technical solution of the present invention Any equivalent replacement or change of the inventive concepts thereof shall fall within the protection scope of the present invention.

本发明涉及的缩略语和关键术语定义Definitions of abbreviations and key terms involved in the present invention

AR：AutoregressiveModel，自回归模型。AR: AutoregressiveModel, autoregressive model.

DCT：DiscreteCosineTransform，离散余弦变换。DCT: DiscreteCosineTransform, discrete cosine transform.

FFT：FastFourierTransform，快速傅里叶变换。FFT: FastFourierTransform, Fast Fourier Transform.

LPF：LowPassFilter，低通滤波器。LPF: LowPassFilter, low-pass filter.

LPCC：LinearPredictionCepstrumCoefficient，线性预测倒谱系数。LPCC: LinearPredictionCepstrumCoefficient, linear prediction cepstrum coefficient.

MFCC：MelFrequencyCepstrumCoefficient，梅尔倒谱系数。MFCC: MelFrequencyCepstrumCoefficient, Mel cepstrum coefficient.

VoIP：VoiceoverIP，基于IP网的语音。VoIP: VoiceoverIP, voice based on IP network.

PLC：PacketLossConcealment，信包丢失隐藏算法。PLC: PacketLossConcealment, envelope loss concealment algorithm.

PWR：PitchWaveformReplication，基音周期波形复制。PWR: PitchWaveformReplication, pitch cycle waveform replication.

SNR：Signal_to_NoiseRatio，信噪比。SNR: Signal_to_NoiseRatio, signal-to-noise ratio.

PEAQ：PerceptualEvaluationofAudioQuality，ITU-RBS.1387建议的一种针对音频质量感知的客观评价标准。PEAQ: PerceptualEvaluationofAudioQuality, an objective evaluation standard for audio quality perception suggested by ITU-RBS.1387.

Claims

1. The denoising method of transient noise is characterized in that: first calculate the Mel cepstrum coefficient of this frame signal, predict the pitch period of this frame signal simultaneously, then use Mel cepstrum coefficient to detect whether there is noise in this frame signal That is, noise detection is performed, and if there is noise, the predicted value of the pitch period is used for waveform reconstruction;

The method of pitch period prediction is as follows:

1) Perform low-pass filtering to buf(n) to obtain buf _d (n); wherein the cut-off frequency of the low-pass filter (LPF) is 900Hz;

2) Perform center clipping on buf _d (n) to obtain buf _c (n), namely

{buf buf}_{c c} ((n no)) = = \{\begin{matrix} {buf buf}_{d d} ((n no)) - - {C C}_{L L},, & {buf buf}_{d d} ((n no)) > > {C C}_{L L} \\ {buf buf}_{d d} ((n no)) + + {C C}_{L L},, & {buf buf}_{d d} ((n no)) < < - - {C C}_{L L} \\ 00,, & | | {buf buf}_{d d} ((n no)) | | \leq \leq {C C}_{L L} \end{matrix} - - - - - - ((11))

Among them, _CL is the clipping level, which is usually set to 68% of the maximum value of the normalized data;

3) Perform an autocorrelation operation on buf _c (n), search for the maximum position of the autocorrelation in the range of (96,576), and use it as the pitch period estimated value Pitch;

{r r}_{{buf buf}_{c c}} ((n no)) = = {Σ Σ}_{m m = = 00}^{22 P P M m A A X x - - 11} {buf buf}_{c c} ((m m)) {buf buf}_{c c} ((m m + + n no)),, P P M m I I N N \leq \leq n no \leq \leq P P M m A A X x - - - - - - ((22))

P P i i t t c c h h = = arg arg \underset{P P M m I I N N \leq \leq n no \leq \leq P P M m A A X x}{m m a a x x} {r r}_{{buf buf}_{c c}} ((n no)) - - - - - - ((33))

4) In order to prevent frequency doubling, use equation (4) to smooth the predicted values Pitch ^(p-1) and Pitch ^(p-2) of the pitch period of the first two frames, namely

Predict the current frame pitch period Pitch ^(p) according to the smoothed two pitch periods, namely

Pitch ^(p) = Pitch ^(p-1) + (Pitch ^(p-1) -Pitch ^(p-2) )(5)

The method of waveform reconstruction is:

1) Since the output frame data is stored in buf(n), the pitch cycle waveform of the previous frame can be extracted from buf(n), that is, the last Pitch ^(p-1) points of the output signal of the previous frame, and its The waveform data is recorded as pw ^(p-1) (n); linear interpolation is performed on pw ^(p-1) (n) to obtain a new waveform with a length of Pitch ^(p) , which is recorded as pw ^(p) (n) interpolation formula for

\begin{matrix} {pw pw}^{((p p))} (({n no}^{' '})) = = ((\frac{{pitch pitch}^{((p p - - 11))}}{{pitch pitch}^{((p p))}} \cdot &Center Dot; {n no}^{' '} - - n no + + 11)) \cdot &Center Dot; [[{pw pw}^{((p p - - 11))} ((n no)) - - {pw pw}^{((p p - - 11))} ((n no - - 11))]] + + {pw pw}^{((p p - - 11))} ((n no - - 11)),, \\ n no - - 11 \leq \leq \frac{{pitch pitch}^{((p p - - 11))}}{{pitch pitch}^{((p p))}} \cdot &Center Dot; {n no}^{' '} < < n no,, \end{matrix} - - - - - - ((66))

2) Use the new waveform to copy the waveform cycle; the method for the new waveform to copy the waveform cycle is as follows:

a. If the current frame is a noise frame and no matter whether the previous frame is a noise frame or a pure speech frame, the processing process is: according to formula (7), the AB segment data and the CD segment data in buf (n) are overlapped and added, and Perform fade-in and fade-out processing to ensure that the data on both sides of D has continuity, that is,

\begin{matrix} {buf buf}_{C C D D.} ((n no)) = = α α \cdot &Center Dot; {buf buf}_{C C D D.} ((n no)) + + ((11 - - α α)) \cdot &Center Dot; {buf buf}_{A A B B} ((n no)) \\ = = α α \cdot &Center Dot; {buf buf}_{C C D D.} ((n no)) + + ((11 - - α α)) \cdot &Center Dot; {buf buf}_{C C D D.} ((n no - - P P i i t t c c h h)),, 00 \leq \leq n no < < {N N}_{11} \end{matrix} - - - - - - ((77))

α α = = \frac{{N N}_{11} - - i i}{{N N}_{11}},, i i = = 00,, 11,, ... ...,, {N N}_{11} - - 11;; - - - - - - ((88))

Among them, α is the attenuation factor, which decays linearly from 1 to 0; the data length of AB segment and CD segment N ₁ =32;

b. According to the cycle Pitch ^(p) , use the new waveform pw ^(p) (n) to continuously copy to the DF area; wherein, the DE segment is the current frame after repair; the data length of the EF segment is N ₂ =32, and its function is , when the next frame is a voice frame, it is used for data fade-in and fade-out, so as to ensure the continuity between the two ends of E, that is, between frames;

c. Output a frame of data starting from point C in buf(n); there is a delay in the output of this method, and the delay time is the length of the CD segment, and then move all the data in buf(n) forward by N points;

If the current frame is a pure speech frame and the previous frame is a noise frame, the processing is as follows,

a. At this time, the DG segment data in buf(n) is the EF segment data of the previous frame; the calculation method for data fusion of the DG segment and the first N ₂ data points input in the current frame is the same as that of the AB segment in formula (7) The process of overlapping and adding the CD segment data is the same and stored in DG;

b. Copy the remaining data points of the current frame to point G in buf(n) as they are;

c. Output the data of one frame length starting from point C, and then move all the data of buf(n) forward by N points, which is one frame length; if the current frame and the previous frame are both pure voice frames, then input the data of the current frame Copy it to the area to be repaired in buf(n) as it is; output the data of one frame length starting from point C.

2. the denoising method of transient noise according to claim 1, is characterized in that: Mel cepstral coefficient calculating method is as follows:

1) The input signal is divided into frames, and the frame length is set to N=480, that is, the data length is 10ms, and the data is normalized; if the current frame signal is the pth frame signal, then there is

x ^(p) (n)=x[p·(N-1)+n], n=0,1,...,N-1; (9)

2) Preprocessing, pre-emphasizing and windowing the current frame signal, namely

y ^(p) (n)=x ^(p) (n)-βx ^(p) (n-1);(10)

Among them, the pre-emphasis factor β=0.938; w(n) is the Hamming window, that is, w(n)=0.54-0.46cos(nπ/N);

3) N=1024 point FFT is done to the preprocessed signal to obtain the frequency domain signal Y ^(p) (k);

4) Calculate the energy spectrum |Y ^(p) (k)| ² of the frequency domain signal Y ^(p) (k);

5) Pass the energy spectrum of the frequency-domain signal through a set of Mel-scale triangular filter banks H to perform frequency-domain filtering;

In the filter bank, there are M filters, each filter is a triangular filter, the filters overlap each other, and the center frequency of each filter is f(m), m=1,2,...,M , M=24;

Filter design method: the input signal terminal frequency f _s /2, namely 24kHz, through the formula

{f f}_{m m e e l l} = = 25952595 {log log}_{1010} ((11 + + \frac{{f f}_{l l i i n no e e a a r r}}{700700})),, - - - - - - ((1212))

In the formula, f is the frequency, and the unit is Hz; transformed into the Mel scale frequency domain, F _smel is obtained; the interval (0, F _smel ) is divided into 25 parts on average, and the two endpoints of 0 and F _smel are removed, and the remaining 24 split points as the center frequencies of the 24 filters; each split point f(m) is uniformly distributed in the Mel scale frequency, and then transformed to a linear frequency scale by formula (12); after transformation, f(m ) The interval between decreases with the decrease of the value of m, and widens with the increase of the value of m; according to the frequency division point f(m), the frequency of the triangular filter bank H(m,k) can be obtained response is

H h ((m m,, k k)) = = \{\begin{matrix} 00,, & f f ((k k)) < < f f ((m m + + 11)) \\ \frac{22 [[f f ((k k)) - - f f ((m m - - 11))]]}{[[f f ((m m + + 11)) - - f f ((m m - - 11))]] [[f f ((m m)) - - f f ((m m - - 11))]]},, & f f ((m m - - 11)) \leq \leq f f ((k k)) < < f f ((m m)) \\ \frac{22 [[f f ((m m + + 11)) - - f f ((k k))]]}{[[f f ((m m + + 11)) - - f f ((m m - - 11))]] [[f f ((m m + + 11)) - - f f ((m m))]]},, & f f ((m m)) \leq \leq f f ((k k)) \leq \leq f f ((m m + + 11)) \\ 00,, & f f ((k k)) > > f f ((m m + + 11)) \end{matrix} - - - - - - ((1313))

6) Calculate the energy and logarithm output by each filter H(m,k) to obtain E(m), namely

E E. ((m m)) = = {log log}_{1010} [[\underset{k k}{Σ Σ} H h ((m m,, k k)) | | {Y Y}^{((p p))} ((k k)) {| |}^{22}]],, m m = = 11,, 22,, ... ...,, M m - - - - - - ((1414))

Do discrete cosine transform DCT to E(m), then L=12th order MFCC can be obtained, denoted as C(l)

\begin{matrix} {C C}^{((p p))} ((00)) = = \sqrt{\frac{22}{L L}} {Σ Σ}_{m m = = 00}^{M m - - 11} E E. ((m m)),, & l l = = 00 \\ {C C}^{((p p))} ((l l)) = = \frac{22}{\sqrt{L L}} {Σ Σ}_{m m = = 00}^{M m - - 11} E E. ((m m)) cos cos ((\frac{π π l l ((22 m m + + 11))}{22 M m})),, & 11 \leq \leq l l \leq \leq L L - - 1. 1. \end{matrix} - - - - - - ((1515))

3. the denoising method of transient noise according to claim 1, is characterized in that: the process of noise detection is as follows:

Calculate the Euclidean distance dist between the MFCC of the current frame signal and the MFCC of the previous frame signal

d d i i s the s t t = = \sqrt{{Σ Σ}_{l l = = 00}^{L L} {[[{C C}^{((p p))} ((l l)) - - {C C}^{((p p - - 11))} ((l l))]]}^{22}},, - - - - - - ((1616))

Determine whether the current frame contains noise according to the distance value and the threshold value Thres; the threshold value Thres is adaptively determined by the following formula

Thres=10 ener, (17)

Among them, ener is the normalized energy of each frame signal, and its minimum value is set to 60.0; after the detection is completed, the MFCC feature of the current frame is updated, that is

C ^(p) (l)=b·C ^(p-1) (l)+(1-b)·C ^(p-1) (l),(18)

Among them, the forgetting factor b=0.4; when the next frame of the noise frame is a speech frame, this update method can prevent false detection.