CN103440872A

CN103440872A - Transient state noise removing method

Info

Publication number: CN103440872A
Application number: CN2013103572116A
Authority: CN
Inventors: 陈喆; 殷福亮; 周文颖
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2013-08-15
Filing date: 2013-08-15
Publication date: 2013-12-11
Anticipated expiration: 2033-08-15
Also published as: CN103440872B

Abstract

The invention discloses a transient state noise removing method and belongs to the technical field of signal processing. The transient state noise removing method comprises the steps of calculating the Mel-frequency cepstrum coefficient of current frame signals, and predicting the pitch period of the current frame signals; utilizing the Mel-frequency cepstrum coefficient to detect whether noise exist in the current frame signals, and utilizing the pitch period predicting value to rebuild the current frame signals if noise exists.

Description

The denoising method of transient noise

Technical field

The present invention relates to the denoising method of transient noise, belong to the signal processing technology field.

Background technology

Transient state additive noise in sound signal, also referred to as transient noise, or impulsive noise.Usually, transient noise in time domain, be discontinuous, intermittently, pulsed, noise energy mainly concentrates in shorter time interval, in this interval, the energy of the energy Ratios purified signal of transient noise is obviously much larger.Typical transient noise impacts sound etc. as desk knock, the sound of closing the door, brouhaha, keyboard keystroke sound, mousebutton sound, hammer, and they often appear at a lot of application scenarios, as osophone, mobile phone, video conference equipment etc.The existence of transient noise has a strong impact on audio quality, therefore, is necessary to take measures transient noise is suppressed, to strengthen the quality of audio frequency.Current noise suppression algorithm is for steady-state noise and continuing noise situation mostly, usually use method described in document " research of voice enhancing and correlation technique thereof " to carry out the voice enhancing, as spectrum-subtraction, auto adapted filtering method etc., but these algorithms are helpless to above-mentioned transient noise, substantially there is no inhibition.

Summary of the invention

The present invention is directed to the proposition of above problem, and the denoising method of development transient noise.

The technical scheme that the present invention takes is: the Mel cepstrum coefficient that at first calculates this frame signal, predict the pitch period of this frame signal simultaneously, whether then with the Mel cepstrum coefficient, detect this frame signal exists noise to carry out walkaway, if there is noise, by the pitch period predicted value, carry out waveform reconstruction.

Beneficial effect of the present invention: use 20 first clean speech audio frequency (comprising man, woman, children speech audio frequency) and the noise audio frequency of 4 types to be tested, noise type is respectively: mouse sound, knock, metronome sound, keyboard sound.The duration of four kinds of noises is respectively: mouse sound is 10ms, and knock, metronome sound are 20ms, and keyboard sound is 30ms.The pure audio frequency of every head is added respectively to this 4 kinds of noises, obtain 80 first noisy audio frequency.It is 30 that every first audio frequency adds the number of noise, and the distance between noise equates.The sampling rate of all audio frequency is f _s=48kHz, frame length is N=480.The MFCC calculation stages, be NFFT=1024 point FFT, and the number of filter of Mel bank of filters is M=24, asks for L=12 dimension MFCC; The transient noise detection-phase, adaptive threshold is set to Thres=constener, for making thresholding, is applicable to all noises, and constant const is set to the energy that 10, ener is each frame input signal, and minimum value is set to 60.0; When thresholding upgrades, forgetting factor b is set to 0.4; The pitch period estimation stages, search pitch period in (2ms, 12ms), correspondence is counted as (76,576); The waveform reconstruction stage, the points N of being fade-in fade-out ₁, N ₂be 32, buffer zone buf (n) length is 2240.After using the present invention to carry out denoising to noisy speech, increase substantially the intelligibility of voice, reduced hearer's sense of fatigue.Use segmental signal-to-noise ratio SNR _segwith two kinds of indexs of PEAQ, this method denoising effect being carried out to assessment result sees shown in the Figure 12 and Figure 13 in the accompanying drawing explanation.

The accompanying drawing explanation

The relation of Fig. 1 Mel frequency and linear frequency.

The technical scheme flow process of Fig. 2 prior art one.

The technical scheme flow process of Fig. 3 prior art two.

Fig. 4 the technical program block diagram.

Fig. 5 MFCC feature extraction block diagram.

Fig. 6 Mel frequency filter group.

Fig. 7 pitch period is estimated block diagram.

The linear interpolation of Fig. 8 point-to-point transmission.

Signal when Fig. 9 (a) present frame is not repaired.

The new pitch cycle waveform pw of Fig. 9 (b) ^(p)(n).

Signal after the reparation of Fig. 9 (c) present frame.

Signal when Figure 10 (a) present frame is not repaired.

Figure 10 (b) current frame signal.

Signal after Figure 10 (c) repairs.

Figure 11 (a) denoising front signal.

Signal after Figure 11 (b) denoising.

Figure 12 denoising effect evaluation form (SNR).

Figure 13 denoising effect assessment (PEAQ).

Embodiment

Below in conjunction with accompanying drawing, the present invention will be further described:

The Mel cepstrum coefficient:

Research to people's hearing mechanism finds, people's ear has different ear sensitivities to the sound wave of different frequency, and to the voice signal between 5kHz, the sharpness to voice has the greatest impact at 200Hz.In addition, people's ear has masking effect, and the voice signal that energy is large has certain effect of covering to weak voice signal.Usually, the audio frequency of the audio masking upper frequency of lower frequency is easy, otherwise more difficult, that is to say, little at the critical bandwidth higher-frequency end of the sound mask at low frequency place.Accordingly, people according to the size of critical bandwidth by close to one group of bandpass filter of rare arrangement, input signal is carried out to filtering.If by the signal essential characteristic of the energy of each bandpass filter output signal, after this feature further being processed, just can be used as the feature of voice, Here it is Mel cepstrum coefficient (MFCC).This feature does not rely on the character of signal, input signal is not done to any hypothesis and restriction, utilized again the auditory perception property of people's ear simultaneously, therefore, with the linear prediction cepstrum coefficient coefficient (LPCC) based on channel model, compare, it has better robustness, and, when signal to noise ratio (S/N ratio) is low, still has speech recognition performance preferably.

MFCC is the cepstrum parameter extracted in Mel scale frequency territory.The Mel scale has been described the nonlinear characteristic of people's ear frequency, but the relation approximate representation of it and frequency is

f_{mel} = 2595 \log_{10} (1 + \frac{f_{linear}}{700}) - - - (18)

In formula, f is frequency, and unit is Hz.Be the relation of Mel frequency and linear frequency shown in Fig. 1, along with f _linearlinear growth, f _melthe form of logarithm increases.

The letter packet loss concealment:

At the voice communication system based on the IP agreement, in the voice based on IP network (VoIP), due to network congestion or transmitting procedure delay jitter, can cause the letter packet loss, be that some letter bag can not appear at receiving end on time, have a strong impact on the voice quality of receiving end.Therefore, at receiving end, must take some measures, to reduce the voice distortion caused because of the letter packet loss.Usually, the measure of this processing packet loss problem is called letter packet loss concealment algorithm (PLC) algorithm.

The PLC algorithm mainly is divided into Processing Algorithm and two classes of the Processing Algorithm based on receiving end based on transmitting terminal.Based on transmitting terminal PLC algorithm, by the sending and receiving two ends, jointly participated in; Based on receiving end PLC algorithm, the letter bag only normally received according to receiving end, the coded system of losing the letter packet number and knowing in advance, recover original voice as far as possible.Because the PLC technology based on receiving end does not need the relevant data of transmitting terminal, so can not increase flow and the time delay of network.PLC method based on receiving end commonly used has quiet alternative method, last letter bag repetition methods, template matching method, pitch waveform clone method and linear prediction method etc.

Pitch cycle waveform in this paper copies (PWR) method, belongs to the PLC method based on receiving end.

Prior art one related to the present invention

The technical scheme of prior art one

What will is brave waits in paper " voice based on Kalman filtering under impulse noise environment strengthen ", has proposed the sound enhancement method under a kind of transient noise environment.Whether the process flow diagram of the method as shown in Figure 2, is at first found out the frequency range of transient noise sample energy and the ratio maximum of signals and associated noises sample energy, then utilizes the energy distribution situation of this frequency range, differentiate frame by frame voice signal and disturbed by transient noise; On this basis, the speech frame that the method is disturbed for transient noise, the application Kalman filtering algorithm carries out denoising; In addition, the method is improved autoregression (AR) model parameter estimation process.

The shortcoming of prior art one

(1), for the longer noise that trails, the hangover part likely detects not out.

(2) when denoising, Kalman filtering used is applicable to steady-state noise is carried out to denoising, be not suitable for the transient noise of non-stationary, so denoising effect is limited, and noise is residual more, has affected voice quality.

Prior art two related to the present invention

The technical scheme of prior art two

Hetherington etc., in patent of invention " Repetitive transient noise removal ", propose a kind of transient noise inhibition method.The process flow diagram of Hetherington method as shown in Figure 3.The method is first carried out modeling according to noise behavior, then utilizes the related coefficient of modeling signal and signal to be detected to determine whether data to be tested contain noise, if there is noise, according to the modeling signal, the noise contribution in signal to be detected is removed.

The shortcoming of prior art two

The Hetherington method can be carried out denoising to the noise repeated effectively, but because the transient noise type is varied, while having the transient noise of number of different types within the short time, can cause modeling inaccurate, now the denoising effect of Hetherington method is poor.

Elaborating of technical solution of the present invention

Technical matters to be solved by this invention

The audio frequency that transient noise is disturbed carries out the voice enhancing, and the transient suppression noise improves voice quality, improves the audio frequency intelligibility.

Complete skill scheme provided by the invention:

Fig. 4 is shown in by the technical solution of the present invention block diagram: utilize input audio signal, extract the MFCC parameter; Then detect in sound signal whether contain noise by the MFCC parameter; If testing result, for containing noise, is replaced noisy frame data by the PWR method, carry out waveform reconstruction; If testing result is Noise not, former state output of sound signal.

The technical solution of the present invention performing step:

The sampling rate of input monophonic audio signal is f _s=48kHz.Input noisy sound signal x (n) and can be expressed as x (n)=s (n)+d (n), wherein s (n) means the clean speech signal, and d (n) means the transient noise signal.

(1) the MFCC feature extraction of sound signal

As shown in Figure 5, gray-scale map can better be understood technique effect of the present invention to the leaching process of MFCC, and the spy provides gray-scale map that technique effect of the present invention is described.In order to allow the clearer understanding of auditor technique effect spy of the present invention provide gray-scale map Fig. 5 that technique effect of the present invention is described.For your guidance.At first time-domain audio signal is carried out to time-frequency conversion, calculate its energy spectrum; Then the triangle filter group of this energy spectrum and Mel scale is multiplied each other, then the logarithm energy of multiplied result is done to discrete cosine transform (DCT), the front L dimensional vector obtained like this is called MFCC, calculates the concrete steps of MFCC:

1) input signal divides frame, and frame length is made as 10ms, because sample frequency is f _s=48kHz, so the data length of a frame is the N=480 point; Then data are carried out to normalization: if the signal quantization figure place is 16bit, by data divided by 2 ¹⁵, the scope of data is narrowed down to (1,1), complete the normalization of data.If current frame signal is the p frame signal, have

x ^(p)(n)=x[p·(N-1)+n],n=0,1,…,N-1 (19)

2) pre-service.Current frame signal is carried out to pre-emphasis and windowing process,

y ^(p)(n)=x ^(p)(n)-βx ^(p)(n-1) (20)

y_{w}^{(p)} (n) = y^{(p)} (n) w (n) - - - (21)

Pre-emphasis factor-beta=0.938 wherein; W (n) is Hamming window, i.e. w (n)=0.54-0.46cos (n π/N).

3) pretreated signal is done to N=1024 point FFT, obtain frequency-region signal Y ^(p)(k).

4) calculate frequency-region signal Y ^(p)(k) energy spectrum | Y ^(p)(k) | ².

5) energy spectrum of frequency-region signal is passed through to the triangle filter group H of one group of Mel scale, carry out frequency domain filtering.

In bank of filters, M wave filter arranged, each wave filter is triangular filter, overlapped between wave filter, as shown in Figure 6: the centre frequency of each wave filter is f (m), m=1,2 ..., M, the present invention gets M=24.Filter design method: by input signal end frequency f _s/ 2, i.e. 24kHz, through type (1) transforms to Mel scale frequency territory, obtains F _smel; By interval (0, F _smel) be divided into 25 parts, remove 0 and F _smeltwo end points, 24 remaining cut-points are respectively as the centre frequency of 24 wave filters.Each cut-point f (m) is evenly distributed in the Mel scale frequency, then through type (1) transforms to linear frequency scale.After conversion, the interval between f (m) dwindles along with reducing of m value, the broadening along with the increase of m value.

According to frequency division point f (m), the frequency response that can obtain triangular filter group H (m, k) is

H (m, k) = \{\begin{matrix} 0, & f (k) < f (m + 1) \\ \frac{2 [f (k) - f (m - 1)]}{[f (m + 1) - f (m - 1)] [f (m) - f (m - 1)]}, & f (m - 1) \leq f (k) < f (m) \\ \frac{2 [f (m + 1) - f (k)]}{[f (m + 1) - f (m - 1)] [f (m + 1) - f (m)]}, & f (m) \leq f (k) \leq f (m + 1) \\ 0, & f (k) > f (m + 1) \end{matrix} - - - (22)

6) calculate energy and the logarithm of each filters H (m, k) output, obtain E (m),

E (m) = \log_{10} [\underset{k}{Σ} H (m, k) {| Y^{(p)} (k) |}^{2}], m = 1,2, \cdot \cdot \cdot, M - - - (23)

E (m) is done to discrete cosine transform, can obtain L=12 rank MFCC, be designated as C (l)

C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{M - 1} E (m), l = 0

(24)

C^{(p)} (l) = \frac{2}{\sqrt{L}} Σ_{m = 0}^{M - 1} E (m) \cos (\frac{πl (2 m + 1)}{2 M}), 1 \leq l \leq L - 1

(2) walkaway:

Euclidean distance dist between the MFCC of calculating current frame signal and the MFCC of former frame signal

dist = \sqrt{Σ_{l = 0}^{L} {[C^{(p)} (l) - C^{(p - 1)} (l)]}^{2}}, - - - (25)

Judge according to distance value and threshold T hres whether present frame contains noise.Threshold T hres is determined by the following formula self-adaptation

Thres=10·ener， (26)

Wherein ener is the energy after each frame signal normalization, and its minimum value is made as to 60.0.

After detection completes, upgrade the MFCC feature of present frame,

C ^(p)(l)=b·C ^(p-1)(l)+(1-b)·C ^(p-1)(l), (27)

Forgetting factor b=0.4 wherein.When the next frame of noise frame is speech frame, this update method can prevent flase drop.

(3) pitch period prediction:

Each frame voice signal is estimated to pitch period.If present frame is noise frame, according to the pitch period of front cross frame signal, predict the present frame pitch period.Pitch period is estimated block diagram as shown in Figure 7: for different speakers, pitch period is generally in 2-12ms, and therefore, this paper searches for pitch period in 2-12ms.If PMAX is the corresponding data amount check of 12ms, i.e. PMAX=576; PMIN is the corresponding data amount check of 2ms, i.e. PMIN=96.Use the buffer zone buf (n) that length is 3PMAX+N=2208 to estimate pitch period, wherein buffer zone buf (n) is used for storing the data of having exported.

The pitch period method of estimation is as follows:

1) buf (n) is carried out to low-pass filtering, obtain buf _d(n).Wherein the cutoff frequency of low-pass filter (LPF) is 900Hz.

2) to buf _d(n) carry out center clipping, obtain buf _c(n),

{buf}_{c} (n) = \{\begin{matrix} {buf}_{d} (n) - C_{L}, & {buf}_{d} (n) {> C}_{L} \\ {buf}_{d} (n) + C_{L}, & {buf}_{d} (n) < - C_{L} \\ 0, & | {buf}_{d} (n) | {\leq C}_{L} \end{matrix}, - - - (28)

C wherein _lfor clipping lever, usually be made as normalization data peaked 68%.

3) to buf _c(n) carry out auto-correlation computation, the autocorrelative maximum value position of search in (96,576) scope, using it as pitch period estimated value Pitch.

r_{{buf}_{c}} (n) = Σ_{m = 0}^{2 PMAX - 1} {buf}_{c} (m) {buf}_{c} (m + n), PMIN \leq n \leq PMAX - - - (29)

Pitch = \arg \max_{PMIN \leq n \leq PMAX} r_{{buf}_{c}} (n) - - - (30)

4) for preventing that frequency multiplication from occurring, use formula (13) to front cross frame pitch period predicted value Pitch ^(p-1)and Pitch ^(p-2)carry out smoothing processing,

Predict present frame pitch period Pitch according to two pitch periods after level and smooth ^(p),

Pitch ^(p)＝Pitch ^(p-1)+(Pitch ^(p-1)-Pitch ^(p-2))。(32)

(4) waveform reconstruction:

Extract last pitch cycle waveform of former frame, it is carried out to linear interpolation, obtain new pitch cycle waveform.

1) owing to storing output frame data in buf (n), so can from buf (n), extract the pitch cycle waveform of former frame, the i.e. last Pitch of former frame output signal ^(p-1)individual, its Wave data is designated as to pw ^(p-1)(n).To pw ^(p-1) (n) carry out linear interpolation, obtaining length is Pitch ^(p)new waveform, be designated as pw ^(p)(n).As shown in Figure 8, interpolation formula is the linear interpolation of point-to-point transmission

{pw}^{(p)} (n^{'}) = (\frac{{pitch}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} - n + 1) \cdot [{pw}^{(p - 1)} (n) - {pw}^{(p - 1)} (n - 1)] + {pw}^{(p - 1)}, n - 1 \leq \frac{{pitct}^{(p - 1)}}{{pitch}^{(p)}} \cdot n^{'} < n - - - (33)

2) using new waveform to carry out wave period copies:

D. the principle that wave period copies as Fig. 9 (a) to Fig. 9 (c): if present frame is noise frame (no matter former frame is noise frame or clean speech frame), processing procedure is: according to formula (15), AB segment data in buf (n) and CD segment data are carried out to overlap-add, and the processing of being fade-in fade-out, there is continuity with the data that guarantee the D both sides,

buf _CD(n)＝α·buf _CD(n)+(1-α)·buf _AB(n) (34)

＝α·buf _CD(n)+(1-α)·buf _CD(n-Pitch)0≤n＜N ₁

α = \frac{N_{1} - i}{N}, i = 0,1, \cdot \cdot \cdot, N_{1} - 1, - - - (35)

Wherein, α is decay factor, from 1 linear attenuation to 0; AB section and CD segment data length N ₁=32.

E. according to cycle Pitch ^(p), with new waveform pw ^(p)(n) constantly copy in the DF zone.Wherein, the DE section is the present frame after repairing; EF segment data length is N ₂=32, its role is to, when next frame is speech frame, for data, be fade-in fade-out, to guarantee that the E two ends are the continuity between frame and frame.

F. export frame data that start with the C point in buf (n).The method output exists and postpones, and be the CD segment length time delay.Again by all data reaches of buf (n) N point (frame length).

To as shown in Figure 10 (c), when Figure 10 (a) is multi-frame to be repaired for present frame, present frame is abandoned to the signal diagram as Figure 10 (a); The current frame signal of Figure 10 (b) for using this patent method to rebuild; Figure 10 (c) is signal after repairing.If present frame is the clean speech frame, and former frame is noise frame, processing procedure is as follows,

D. now in buf (n) the DG segment data be the EF segment data of previous frame.Front N by DG section and present frame input ₂individual data point is carried out data fusion (calculating with formula (15) similar), stores in DG.

E. after the strong point slavish copying of present frame remainder being arrived to the G point in buf (n).

F. export the data of the frame length started with the C point, then by all data of buf (n) the move forward data length of a frame signal, i.e. N point.

If present frame and former frame are all the clean speech frame, present frame is inputted to the data slavish copying to buf (n) middle area to be repaired, i.e. DE zone in Fig. 8; The data of the frame length that output starts with the C point.

The beneficial effect that technical solution of the present invention is brought:

Use 20 first clean speech audio frequency (comprising man, woman, children speech audio frequency) and the noise audio frequency of 4 types to be tested, noise type is respectively: mouse sound, knock, metronome sound, keyboard sound.The duration of four kinds of noises is respectively: mouse sound is 10ms, and knock, metronome sound are 20ms, and keyboard sound is 30ms.The pure audio frequency of every head is added respectively to this 4 kinds of noises, obtain 80 first noisy audio frequency.It is 30 that every first audio frequency adds the number of noise, and the distance between noise equates.

The sampling rate of all audio frequency is f _s=48kHz, frame length is N=480.The MFCC calculation stages, be NFFT=1024 point FFT, and the number of filter of Mel bank of filters is M=24, asks for L=12 dimension MFCC; The transient noise detection-phase, adaptive threshold is set to Thres=constener, for making thresholding, is applicable to all noises, and constant const is set to the energy that 10, ener is each frame input signal, and minimum value is set to 60.0; When thresholding upgrades, forgetting factor b is set to 0.4; The pitch period estimation stages, search pitch period in (2ms, 12ms), correspondence is counted as (76,576); The waveform reconstruction stage, the points N of being fade-in fade-out ₁, N ₂be 32, buffer zone buf (n) length is 2240.

After using the present invention to carry out denoising to noisy speech, increase substantially the intelligibility of voice, reduced hearer's sense of fatigue.Use segmental signal-to-noise ratio SNR _segwith two kinds of indexs of PEAQ, this method denoising effect is assessed, wherein the segmental signal-to-noise ratio computing method are

{SNR}_{seg}^{in} = \frac{1}{R} Σ_{i = 1}^{R} 10 \log_{10} \frac{\underset{n &Element; {frame}_{i}}{Σ} {| s (n) |}^{2}}{\underset{n &Element; {frame}_{i}}{Σ} {| x (n) - s (n) |}^{2}}, - - - (36)

{SNR}_{seg}^{out} = \frac{1}{R} Σ_{i = 1}^{R} 10 \log_{10} \frac{\underset{n &Element; {frame}_{i}}{Σ} {| s (n) |}^{2}}{\underset{n &Element; {frame}_{i}}{Σ} {| \hat{s} (n) - s (n) |}^{2}}, - - - (37)

By two kinds of indexs, this method denoising effect is assessed, result as shown in Figure 12 and Figure 13, Figure 12 for before using signal to noise ratio (S/N ratio) to the signals and associated noises denoising with denoising after objective audio quality compare; Figure 13 for before using PEAQ to the signals and associated noises denoising with denoising after objective audio quality compare.

Signals and associated noises and sound spectrograph with signal after this programme denoising as Figure 11 (a) and Figure 11 (b) and as shown in; Gray-scale map can better be understood technique effect of the present invention, and the spy provides gray-scale map that technique effect of the present invention is described.In order to allow the clearer understanding of auditor technique effect spy of the present invention that gray-scale map is provided, be that Figure 11 (a) and Figure 11 (b) illustrate technique effect of the present invention.For your guidance.Figure 11 (a) is for being subject to mouse to click the sound spectrograph of the audio frequency of sound pollution; Figure 11 (b) is for to frequently carrying out the audio frequency sound spectrograph after denoising with noise shown in Figure 11 (a).

The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

The abbreviation the present invention relates to and Key Term definition

AR:Autoregressive Model, autoregressive model.

DCT:Discrete Cosine Transform, discrete cosine transform.

FFT:Fast Fourier Transform, Fast Fourier Transform (FFT).

LPF:Low Pass Filter, low-pass filter.

LPCC:Linear Prediction Cepstrum Coefficient, the linear prediction cepstrum coefficient coefficient.

MFCC:Mel Frequency Cepstrum Coefficient, the Mel cepstrum coefficient.

VoIP:Voice over IP, the voice based on IP network.

PLC:Packet Loss Concealment, letter packet loss concealment algorithm.

PWR:Pitch Waveform Replication, pitch cycle waveform copies.

SNR:Signal_to_Noise Ratio, signal to noise ratio (S/N ratio).

PEAQ:Perceptual Evaluation of Audio Quality, a kind of standard of the objective evaluation for the audio quality perception of ITU-R BS.1387 suggestion.

Claims

1. the denoising method of transient noise, it is characterized in that: the Mel cepstrum coefficient that at first calculates this frame signal, predict the pitch period of this frame signal simultaneously, whether then with the Mel cepstrum coefficient, detect this frame signal exists noise to carry out walkaway, if there is noise, by the pitch period predicted value, carry out waveform reconstruction.

2. the denoising method of transient noise according to claim 1, it is characterized in that: Mel cepstrum coefficient computing method are as follows:

1) input signal divides frame, and it is that data length is 10ms that frame length is made as N=480, and data are carried out to normalization; If current frame signal is the p frame signal, have

x ^(p)(n)=x[p·(N-1)+n],n=0,1,…,N-1； (1)

2) pre-service, carry out pre-emphasis and windowing process to current frame signal,

y ^(p)(n)=x ^(p)(n)-βx ^(p)(n-1)； (2)

Pre-emphasis factor-beta=0.938 wherein; W (n) is Hamming window, i.e. w (n)=0.54-0.46cos (n π/N);

3) pretreated signal is done to N=1024 point FFT, obtain frequency-region signal Y ^(p)(k);

4) calculate frequency-region signal Y ^(p)(k) energy spectrum | Y ^(p)(k) | ²;

5) energy spectrum of frequency-region signal is passed through to the triangle filter group H of one group of Mel scale, carry out frequency domain filtering;

In bank of filters, M wave filter arranged, each wave filter is triangular filter, overlapped between wave filter, the centre frequency of each wave filter is f (m), m=1,2 ..., M, M=24;

Filter design method: by input signal end frequency f _s/ 2, i.e. 24kHz, through type

In formula, f is frequency, and unit is Hz; Transform to Mel scale frequency territory, obtain F _smel; By interval (0, F _smel) be divided into 25 parts, remove 0 and F _smeltwo end points, 24 remaining cut-points are respectively as the centre frequency of 24 wave filters; Each cut-point f (m) is evenly distributed in the Mel scale frequency, then through type (1) transforms to linear frequency scale; After conversion, the interval between f (m) dwindles along with reducing of m value, the broadening along with the increase of m value; According to frequency division point f (m), the frequency response that can obtain triangular filter group H (m, k) is

(6)

3. the denoising method of transient noise according to claim 1, it is characterized in that: the process of walkaway is as follows:

Judge according to distance value and threshold T hres whether present frame contains noise; Threshold T hres is determined by the following formula self-adaptation

Thres=10·ener， (8)

Wherein ener is the energy after each frame signal normalization, and its minimum value is made as to 60.0; After detection completes, upgrade the MFCC feature of present frame,

C ^(p)(l)=b·C ^(p-1)(l)+(1-b)·C ^(p-1)(l)， (9)

Forgetting factor b=0.4 wherein; When the next frame of noise frame is speech frame, this update method can prevent flase drop.

4. the denoising method of transient noise according to claim 1 is characterized in that: the method for pitch period prediction is as follows:

1) buf (n) is carried out to low-pass filtering, obtain buf _d(n); Wherein the cutoff frequency of low-pass filter (LPF) is 900Hz;

2) to buf _d(n) carry out center clipping, obtain buf _c(n),

C wherein _lfor clipping lever, usually be made as normalization data peaked 68%;

3) to buf _c(n) carry out auto-correlation computation, the autocorrelative maximum value position of search in (96,576) scope, using it as pitch period estimated value Pitch;

Pitch ^(p)＝Pitch ^(p-1)+(Pitch ^(p-1)-Pitch ^(p-2)) (14) 。

5. the denoising method of transient noise according to claim 1, it is characterized in that: the method for waveform reconstruction is:

1) owing to storing output frame data in buf (n), so can from buf (n), extract the pitch cycle waveform of former frame, the i.e. last Pitch of former frame output signal ^(p-1)individual, its Wave data is designated as to pw ^(p-1)(n); To pw ^(p-1)(n) carry out linear interpolation, obtaining length is Pitch ^(p)new waveform, be designated as pw ^(p)(n) interpolation formula is

(15)

2) using new waveform to carry out wave period copies; It is as follows that new waveform carries out the method that wave period copies:

If a. present frame is no matter that noise frame and former frame are noise frame or clean speech frame, processing procedure is: according to formula (15), AB segment data in buf (n) and CD segment data are carried out to overlap-add, and the processing of being fade-in fade-out, there is continuity with the data that guarantee the D both sides,

buf _CD(n)＝α·buf _CD(n)+(1-α)·buf _AB(n) (16)

＝α·buf _CD(n)+(1-α)·buf _CD(n-Pitch)0≤n＜N ₁

Wherein, α is decay factor, from 1 linear attenuation to 0; AB section and CD segment data length N ₁=32;

B. according to cycle Pitch ^(p), with new waveform pw ^(p)(n) constantly copy in the DF zone; Wherein, the DE section is the present frame after repairing; EF segment data length is N ₂=32, its role is to, when next frame is speech frame, for data, be fade-in fade-out, to guarantee that the E two ends are the continuity between frame and frame;

C. export frame data that start with the C point in buf (n); The method output exists and postpones, and be the CD segment length time delay, then by all data reaches of buf (n) N point;

If present frame is the clean speech frame, and former frame is noise frame, processing procedure is as follows,

A. now in buf (n) the DG segment data be the EF segment data of previous frame; Front N by DG section and present frame input ₂it is similar with formula (15) that individual data point is carried out its calculating of data fusion, stores in DG;

B. after the strong point slavish copying of present frame remainder being arrived to the G point in buf (n);

C. export the data of the frame length started with the C point, then be a frame length by all data reaches of buf (n) N point; If present frame and former frame are all the clean speech frame, present frame is inputted to the data slavish copying to the middle area to be repaired of buf (n); The data of the frame length that output starts with the C point.