CN104658544A

CN104658544A - Method for inhibiting transient noise in voice

Info

Publication number: CN104658544A
Application number: CN201310590841.8A
Authority: CN
Inventors: 盖丽
Original assignee: Dalian You Jia Software Science And Technology Ltd
Current assignee: Dalian You Jia Software Science And Technology Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2015-05-27

Abstract

The invention discloses a method for inhibiting transient noise in a voice and belongs to the technical field of signal processing. The method for inhibiting the transient noise in the voice is characterized in that a Gammatone frequency cepstral coefficient extracting module, a transient noise detecting module and a voice signal re-establishing module are employed in the method; an input end of the Gammatone frequency cepstral coefficient extracting module receives a noise-containing voice signal, an output end of the Gammatone frequency cepstral coefficient extracting module is connected with an input end of the transient noise detecting module, an output end of the transient noise detecting module is connected with an input end of the voice signal re-establishing module, the input end of the voice signal re-establishing module receives the noise-containing voice signal and is further connected with the output end of the transient noise detecting module, and the voice signal re-establishing module outputs a de-noised voice.

Description

A kind of method that in voice, transient noise suppresses

Technical field

The present invention relates to a kind of method that in voice, transient noise suppresses, belong to signal processing technology field.

Background technology

Transient noise is present in a lot of application scenario, as in the voice communication terminal equipment such as osophone, hands-free kits, mobile phone and video conference equipment.The existence of transient noise has a strong impact on voice quality, voice signal sharpness and intelligibility is declined, causes auditory fatigue.Transient noise in voice normally additive noise, also referred to as transient noise.Transient noise has the features such as sudden, pulse feature in the time domain usually, and its energy concentrates in shorter temporal interval usually, then very wide in frequency domain distribution.Typical transient signal is usually about by an initial peak value and one period of duration the oscillatory process in short-term that 10 ~ 50ms decays and forms, and as knocked at the door, mouse click, metronome, keyboard knock, hammer sound etc. all belongs to transient noise.In most cases, the elimination of transient noise is more difficult, because most transient noise and voice signal are at the complete aliasing of time-frequency domain, and has the features such as noncontinuity.Current voice noise Restrainable algorithms is most for steady-state noise and continuing noise, and as spectrum subtracts method, adaptive filter method, Wiener Filtering etc., this type of algorithm is very poor to transient noise inhibition.Therefore, be necessary that invention is to the voice noise suppression technology under transient noise conditions.

Because the final tolerance of voice noise inhibition is the subjective feeling of people, be therefore necessary to consider that the auditory perception property of people's ear is on the impact of voice noise rejection.In the process that Auditory Perception is formed, people's ear basilar memebrane has played important effect, and basilar memebrane has good He Ne laser and resolution characteristic.Based on this characteristic, can be realized the frequency division effect of basilar membrane by design band-pass filter group, this bank of filters is just called human auditory system wave filter.Johannesma proposed logical (Gammatone, the GT) filter model of gamma in 1972, it is based on the basilar membrane model realization in auditory model, at first for describing the characteristic of the physiology impulse response of the auditory nerve of cat.This wave filter can the frequency response of simulate human auditory system, meets the auditory perception property of people's ear.The time-domain expression of its impulse response function is

g(t)＝[B ⁿt ^n-1e ^-2πBtcos(2πf _it+φ)]u(t)

B＝b ₁·ERB(f _i)

Wherein, parameter B ⁿfor filter gain; N is filter order; The gamma bandpass filter of n=4 just can the filtering characteristic of simulated substrate film well; for initial phase, u (t) is unit step function; f _icentered by frequency; ERB (f _i) be the Equivalent Rectangular Bandwidth of gamma bandpass filter, itself and centre frequency f _ipass be:

ERB(f _i)＝24.7+0.108f _i

The centre frequency of gamma bandpass filter determines the characteristic such as equivalent bandwidth, frequency response of wave filter, and from auditory perceptual characteristic, the centre frequency of each gamma bandpass filter meets logarithm and is uniformly distributed, and centre frequency is determined by following formula:

f_{i} = (f_{H} + 228.7) \exp (i \times v) - 228.7

= (f_{H} + 228.7) \exp (i \times \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH}) - 228.7,1 \leq i \leq CH

v = \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH}

Wherein, parameter v is the overlap factor between each wave filter, is used for representing the overlapping degree between each wave filter, parameter f _l, f _hfor the cutoff frequency of bank of filters, CH represents the port number of gamma bandpass filter group.Do Laplace conversion to this gamma bandpass filter impulse is corresponding, obtaining 4 rank gamma bandpass filters in the transport function of continuous domain is:

G (s) = \frac{[s + b + (\sqrt{2} - 1) ω_{i}] [s + b - (\sqrt{2} - 1) ω_{i}] [s + b + (\sqrt{2} + 1) ω_{i}] [s + b - (\sqrt{2} + 1) ω_{i}]}{{[{(s + b)}^{2} + {ω_{i}}^{2}]}^{4}}

Wherein, ω _i=2 π f _i, represent the center angular frequency of each wave filter.By Impulse invariance procedure, the Laplace of gamma bandpass filter impulse response is converted G (s) and is transformed into Z territory, then have:

G_{i} (z) = \frac{T_{s} - T_{s} a_{3} (a_{1} + (\sqrt{2} - 1) a_{2}) z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} (a_{1} - (\sqrt{2} - 1) a_{2}) z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}

\times \frac{T_{s} - T_{s} a_{3} (a_{1} + (\sqrt{2} + 1) a_{2}) z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} (a_{1} - (\sqrt{2} + 1) a_{2}) z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}

= G_{1, i} (z) \cdot G_{2, i} (z) \cdot G_{3, i} (z) \cdot G_{4, i} (z)

Wherein, T _sfor the sampling period, a ₁=cos (ω _it _s), a ₂=sin (ω _it _s),

From above formula, the gamma bandpass filter on 4 rank can be realized by 4 order transfer function cascades, carries out inverse transformation respectively, obtain the impulse response of 4 second order filters, that is: g to 4 order transfer function _{1, i}(n), g _{2, i}(n), g _{3, i}(n), g _{4, i}(n).By voice signal respectively with each impulse response convolution after, obtain gamma bandpass filter filtering export.64 channel gamma bandpass filter amplitude-frequency responses under 48kHz sampling rate are as shown in the accompanying drawing 1 of instructions.

(2) packet loss concealment is believed

VoIP is namely by the packet of IP network transferring voice, and substitute communication mode as legacy public switched telephone net (PSTN), it is more and more paid close attention to.Because network congestion or transmitting procedure delay jitter can cause letter packet loss, that is some letter bag can not appear at receiving end in time, and this situation is called letter packet loss.Design the technology of good solution packet loss problem, greatly can improve quality of voice transmission.This kind of technology can be divided into the loss recovery technology (PLR) based on transmitting terminal and the packet loss concealment (PLC) based on receiving end.Loss recovery technology comprises forward error correction (FEC) and intertexture (interleaving) etc.In general, adopt good than based on receiving end concealing technology of the effect based on transmitting terminal recovery technology, but this technology is more complicated, can increase the network bandwidth and propagation delay time simultaneously.Consider the factor of real-time, the VoIP system of many practicalities have employed packet loss concealment.Conventional PLC algorithm have quietly to substitute, packet replication technology, pattern match, pitch waveform copy and linear prediction etc.The present invention will adopt bidirectional linear to predict the bag-losing hide method transient suppression noise of (Bidirectional Linear Prediction, BLP).

Vaseghi etc. propose impulse noise detection based on linear prediction model and interpolation algorithm and Restrainable algorithms.This algorithm is divided into impulse noise detection and signal interpolation to repair two parts, and detecting portion comprises based on the linear prediction analysis of AR model, inverse filter and threshold detector.The output of detecting device is binary switch value, is used for controlling interpolator, if detect that impulsive noise exists, interpolator is activated and replaces contaminated sample value, and the method functional block diagram is as shown in the accompanying drawing 2 of instructions.

At monograph " Advanced Digital Signal Processing and Noise Reduction " (3rd editor.New York:Wiley, 2006) in, S.V.Vaseghi gives a kind of detection and suppressing method of impulsive noise, the major defect of the method: (a) not easily obtains due to the accurate model of a lot of one-dimensional signal (as voice), easily introduces harmonic distortion; B () cannot the less pulse signal of detected amplitude.

Phillip A.Hetherington and Shreyas A.Paranjpe is at patent of invention " Repetitive transient noise removal " (US patent:2006116873,2003) propose in and carry out modeling according to noise behavior, then utilize the related coefficient of the signal of modeling and signal to be detected to determine that whether data to be tested are containing noise, if there is noise, then according to modelled signal, the noise contribution in signal to be detected is removed.The process flow diagram of the method is as shown in Figure of description Fig. 3.

This technology is suitable for removing the noise with repeatability.And the type of transient noise is varied, when there is the transient noise of number of different types in the short time, modeling can be caused inaccurate, affect denoising effect.

Summary of the invention

The present invention is directed to the proposition of above problem, and develop a kind of method that in voice, transient noise suppresses.The present invention is directed to the transient noise in voice, based on the thought detecting-repair, gamma is adopted to lead to frequency cepstral coefficient (GFCC) and voice signal method for reconstructing, to improve the accuracy of detection of transient noise, propose a kind of voice transient noise denoising method, improve the voice quality of voice signal.

The technical scheme that the present invention takes is as follows:

A kind of method that in voice, transient noise suppresses: comprise three modules: gamma leads to frequency cepstral coefficient extraction module, transient noise detection module, voice signal reconstruction module;

Described gamma is led to frequency cepstral coefficient extraction module input end and receives noisy voice signal, output terminal is connected with transient noise detection module input end, described transient noise detection module output terminal is connected with the input end that voice signal rebuilds module, the input end that described voice signal rebuilds module just receives outside noisy voice signal, also be connected with described transient noise detection module output terminal, voice signal is rebuild module and is exported the voice after for denoising; Described gamma is led to frequency cepstral coefficient extraction module and may extract gamma noisy voice signal from input and lead to frequency cepstral coefficient, whether the difference that described transient noise detection module leads to frequency cepstral coefficient according to consecutive frame gamma adjudicates in current speech frame containing transient noise, if containing transient noise, then use voice signal reconstruction module reconstructs current speech frame, and replace current speech frame with this reconstructed speech frame, and export; If not containing transient noise, then do not process current speech frame, directly export.

The principle of the invention and beneficial effect: the Figure 12 in being illustrated from accompanying drawing, there is the undetected situation of transient noise in traditional detection---recovery technique, and voice flatness after repairing is bad, easily introduces new frequency component; As seen from Figure 13, the present invention can detection noise rebuild voice signal effectively, and the noise after reconstruction is residual few compared with traditional algorithm.

Accompanying drawing explanation

Fig. 1 is the GT wave filter amplitude-frequency response of 64 passages.

The detection of a kind of impulsive noise that Fig. 2 monograph " Advanced Digital Signal Processing and Noise Reduction " (3rd editor.New York:Wiley, 2006) provides and suppressing method block diagram.

The process flow diagram of Fig. 3 patent of invention " Repetitive transient noise removal " (US patent:2006116873,2003) method.

Fig. 4 functional block diagram of the present invention.

Fig. 5 gamma leads to frequency cepstral coefficient (GFCC) abstraction function block diagram.

The functional block diagram of the voice signal method for reconstructing that Fig. 6 predicts based on bidirectional linear.

The functional block diagram of Fig. 7 forward direction Periodical pitch detection method.

Fig. 8 transient noise inhibition (using SNRSeg metrics evaluation).

Fig. 9 transient noise inhibition (using segmentation logarithmic spectrum distortion LSDSeg to evaluate).

Figure 10 is not containing the sound spectrograph example of transient noise voice.

Voice sound spectrograph after Figure 11 adds noise in Figure 11 voice.

The sound spectrograph of a kind of detection of impulsive noise that Figure 12 uses S.V.Vaseghi to provide in monograph " Advanced Digital Signal Processing and Noise Reduction " (3rd editor.New York:Wiley, 2006) and the result of suppressing method process Figure 11 voice.

Figure 13 uses the sound spectrograph of the result of process Figure 11 voice of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention will be further described: gray-scale map can illustrate technique effect of the present invention, and spy provides gray-scale map so that technique effect of the present invention to be described, also better examines technique effect of the present invention for auditor simultaneously.Gray-scale map is Figure 10 to Figure 13.

The present invention program mainly comprises gamma and leads to frequency cepstral coefficient extraction module, transient noise detection module, voice signal reconstruction module, as shown in Figure 4.At transient noise detection-phase, utilize logical (Gammatone) wave filter of gamma of simulation people ear cochlea auditory model to extract gamma and lead to frequency cepstral coefficient (GFCC), lead to frequency cepstral coefficient difference according to the gamma between consecutive frame and carry out detected transient noise; If be detected as Noise frame, utilize correlativity and the short-term stationarity of voice signal, adopt receiving end bag-losing hide (PLC) algorithm based on bidirectional linear prediction, by the information of contiguous frames, waveform reconstruction is carried out to noisy speech frame.If be detected as non-noise frame, then do not do extra process, directly export.

Input sampling rate f _sthe monophonic voices signal of=48kHz.Noisy input speech signal x (n) is expressed as

x(n)=s(n)+d(n)，

Wherein, s (n) is clean speech signal, and d (n) is transient noise.Below technical solution of the present invention is described in detail.

Gamma is led to frequency cepstral coefficient (GFCC) and extracts

Concrete gamma leads to frequency cepstral coefficient (GFCC) abstraction function block diagram as shown in Figure 5, and step is as follows:

A () carries out pre-emphasis to original noisy speech x (n), strengthen high fdrequency component,

x _e(n)＝x(n)-αx(n-1)，

Wherein, α is pre emphasis factor, and its span is 0< α <1, and the present invention advises that α value is 0.97,

B () gamma leads to the filtering of (Gammatone) bank of filters, use the filtering of following gamma bandpass filter group,

G_{i} (z) \frac{T_{s} - T_{s} a_{3} [a_{1} + (\sqrt{2} - 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} [a_{1} - (\sqrt{2} - 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}

\times \frac{T_{s} - T_{s} a_{3} [a_{1} + (\sqrt{2} + 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} [a_{1} - (\sqrt{2} + 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}, 1 \leq i \leq CH

= G_{1, i} (z) \cdot G_{2, i} (z) \cdot G_{3, i} (z) \cdot G_{4, i} (z)

a ₁＝cos(ω _iT _s)，1≤i≤CH

a ₂＝sin(ω _iT _s)，1≤i≤CH

a_{3} = e^{{- bT}_{s}}

ω _i＝2πf _i，1≤i≤CH

f_{i} = (f_{H} + 228.7) \exp (i \times v) - 228.7

= (f_{H} + 228.7) \exp (i \times \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH}) - 228.7,1 \leq i \leq CH

v = \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH}

Wherein, T _sfor the sampling period, CH represents the port number of gamma bandpass filter group, and the present invention advises CH=64.Parameter v is the overlap factor between each wave filter, is used for representing the overlapping degree between each wave filter, f _hfor the cutoff frequency of bank of filters, its value is the sampling rate of input signal, and f _lspan be 10 ~ 100Hz, the present invention suggestion get 50Hz; By x _wn (), respectively by 64 wave filters, exports y after obtaining filtering _i(n);

y _i(n)＝x _w(n)*g _1,i(n)*g _2,i(n)*g _3,i(n)*g _4,i(n)，i=0,1,…,63

Wherein, the convolution operation in ' * ' representative digit signal transacting field;

C () asks the energy of each passage frequency-region signal, framing is carried out to GT filterbank output signals, frame length is N, the span of N is 240≤N≤960, when the present invention advises N to be taken as 480(sample frequency is 48KHz, equivalent time length is 10 milliseconds), calculate each path filter output component in present frame logarithmic energy and;

E (m) = \log_{e} Σ_{n = start}^{start + N - 1} {[y_{i} (n)]}^{2}, m = 0,1, . . ., 63

Wherein, strat represents the starting position of present frame in signal x (n);

D () is compressed each passage cepstrum energy with discrete cosine transform, obtain gamma and lead to frequency cepstral coefficient (Gammatone frequency cesptrum coefficient, GFCC).

C^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{CH - 1} E (m), l = 0

C^{(p)} (l) = \frac{2}{\sqrt{L}} Σ_{m = 0}^{CH - 1} E (m) \cos [\frac{πl (2 m + 1)}{2 CH}], 1 \leq l < L

Wherein, L is the exponent number that gamma leads to frequency cepstral coefficient (GFCC), and the span of L is 16≤L≤64, and the present invention advises that L gets 32.

Walkaway

Leading to the difference of frequency cepstral coefficient by comparing gamma between consecutive frame, distinguishing speech frame and noise frame.

Detailed process is as follows:

Calculate present frame, i.e. p frame, the gamma of signal leads to frequency cepstral coefficient vector C ^(p)l () and former frame, i.e. p-1 frame, the level and smooth gamma of signal leads to frequency cepstral coefficient vector euclidean distance Dis,

Dis = \sqrt{Σ_{l = 0}^{L - 1} {[C^{(p)} (l) - C_{aver}^{p - 1} (l)]}^{2}}

Level and smooth gamma leads to frequency cepstral coefficient vector renewal process be

C_{aver}^{p - 1} (l) = β \cdot C_{aver}^{p - 2} (l) + (1 - β) \cdot C^{(p - 1)} (l)

Wherein, 0< β <1, be that gamma leads to frequency cepstral coefficient vector smoothing factor, the present invention advises β=0.6;

The soft-threshold decision method based on noise energy is adopted to detect noise frame, first present frame and former frame input signal ENERGY E (p), E (p-1) is calculated, according to signal energy setting threshold value thres=q [E (p)+E (p-1)]/2, the span of q is 0.01≤q≤100, the present invention advises that q gets 0.25, when gamma lead to frequency cepstral coefficient vector distance value Dis be greater than threshold value thres time, namely judge present frame there is transient noise.

Voice signal based on bidirectional linear prediction is rebuild

According to the waveform of adjacent voice, adopt interpolation reconstruction algorithm to generate by the speech frame of noise pollution, first bidirectional linear prediction is carried out to the front and back frame of noisy speech frame, and according to linear predictor coefficient design inverse filter, calculate residual signals; Again residual signals is calculated pitch period by pitch determination algorithm, according to residual signals and the pitch period of consecutive frame, produce the pumping signal of current noisy frame, according to the linear predictor coefficient of pumping signal and former frame, rebuild current frame speech signal, and carry out with consecutive frame signal fading in, the data smoothing of mode of fading out, reach the object suppressing transient noise in voice; Based on bidirectional linear prediction voice signal method for reconstructing functional block diagram as shown in Figure 6.

If D represents the time delay of output signal, carry out border fusion for current frame signal and consecutive frame, the span of D is 16≤D≤48, and Lev represents the exponent number of linear prediction filter, and the span of Lev is 10≤D≤30;

Voice signal method for reconstructing based on bidirectional linear prediction passes through the estimated value of the sampled point generation present frame of consecutive frame, therefore needs to store B the sampling point nearest with present frame as historical data, for estimating forward linear prediction coefficient and forward direction pumping signal, is designated as correspondingly, for the B after present frame sampling point data, for estimating backward linear predictor coefficient and backward pumping signal, the present invention advises that the value of D, Lev, B is respectively 24,20,1.5N; In order to describe and write easy, all symbols of claims remainder, all for present frame, no longer mark frame number p,

Concrete grammar is as follows:

(1) by the testing result of transient noise detection module, when detecting that present frame is noisy frame, and when former frame is non-noisy frame, linear prediction analysis is carried out to the historical data in buffer zone; First to x _en () windowing, obtains the signal x after windowing _w(n)=x _e(n) w (n), window function is chosen as Hamming window w (n)=0.54-0.46cos [(2n+1) π/N], n=0, and 1 ..., N-1, x _wn the autocorrelation function of () is,

r_{corr} (m) = Σ_{n = 0}^{B / 2 - 1 - m} bu f_{f} (n + B / 2) \cdot bu f_{f} (n + B / 2 + m), 0 \leq m < Lev

Then forward linear prediction coefficient is calculated according to Levinson – Durbin algorithm

(2) according to forward linear prediction coefficient design inverse filter, and right carry out filtering, obtain residual signals

e_{f} (n) = bu f_{f} (n + Lev) - Σ_{i = 1}^{L} a_{f} (i) {buf}_{f} (n + Lev - i), 0 \leq n < B - Lev

(3) forward direction pitch determination

This method adopts residual signals to carry out pitch determination; The functional block diagram of forward direction pitch determination as shown in Figure 7.Pitch determination is as follows:

(a) low-pass filtering

Because pitch determination result is often subject to the impact of formant frequency, in order to eliminate the impact of resonance peak as far as possible, first low-pass filtering is carried out to residual signals, the resonance peak of filter out high frequency as far as possible, for different speaker, pitch period is generally distributed in 2 ~ 12ms, the cut-off frequecy of passband f of this low-pass filter _pspan be 0.8kHz<f _p<1.2kHz, the present invention arranges f _pfor 0.9kHz,

The process of (b) center clipping;

The Pitch Information of voice signal is mainly hidden in envelope, and resonance peak information is present in low amplitude value part in a large number, for reducing the impact of resonance peak, adopt center clipping function to carry out Nonlinear Processing to the residual signals after low-pass filtering, center clipping function is defined as follows:

e_{fc} (n) = \{\begin{matrix} e_{f} (n) - T_{c}, & e_{f} (n) > T_{c} \\ 0, & | e_{f} (n) | \leq T_{c} \\ e_{f} (n) + T_{c}, & e_{f} (n) < - T_{c} \end{matrix},

Wherein, threshold value T _cfor clipping level, T _c=γ max{e _fc(n), 0≤n<B-Lev}, it is 0.4 that the present invention's suggestion arranges γ;

C () pitch period is estimated

Calculate e _fcn the normalized autocorrelation computing of (), namely at (P _mIN, P _mAX) search for auto-correlation maximum value position, as pitch period estimated value P in scope _f,

r_{fc} (m) = \frac{Σ_{n = B - Lev - C}^{B - Lev - 1} e_{fc} (n - m) e_{fc} (n)}{\sqrt{Σ_{n = B - Lev - C}^{B - Lev - 1} e_{fc} (n - m) e_{fc} (n - m)}}, P_{MIN} \leq m \leq P_{MAX}

P_{f} = \arg \max_{P_{MIN} \leq m \leq P_{MAX}} r_{fc} (m)

Wherein, C is the average length of auto-correlation computation, P _mIN, P _mAXrepresent minimum value and the maximal value of pitch period search respectively, suggestion C value is 150, and P _mIN, P _mAXvalue be respectively number of samples corresponding to 2ms and 12ms 96 and 576;

(4) backward pitch determination

If detecting present frame is non-noisy frame, adopts the method similar with forward direction pitch determination in step (3) and step, detect backward pitch period; First to backward buffer zone in data do linear prediction analysis, obtain backward linear predictor coefficient and design backward inverse filter according to backward linear predictor coefficient, right carry out backward filtering, obtain backward residual signals and pitch Detection is carried out to backward residual signals, obtain backward pitch period estimated value P _b,

(5) pitch period correction

The pitch period caused for preventing frequency multiplication estimates inaccurate situation, and the present invention is to forward direction pitch period P _fwith backward pitch period P _bsmoothing process, and the pitch period P estimating present frame according to the pitch period after level and smooth _c, detailed process is,

P_{c} = P_{f} + \frac{P_{b} - P_{f}}{2},

Wherein, δ is pitch period difference decision threshold, and determined by the difference of voice signal consecutive frame pitch period, the present invention advises that δ value is 10;

(6) generation of present frame pumping signal;

The residual signals of consecutive frame and the pitch period of estimation is adopted to estimate the pumping signal of current noisy frame, respectively to e _f(n) and e _bn () is with P _cfor the cycle carries out periodic extension, obtain the forward direction pumping signal of present frame with backward pumping signal

{\tilde{e}}_{f} (n) = \{\begin{matrix} e_{f} (B - Lev - P_{c} - D + n), & 0 \leq n < D + P_{c} \\ {\tilde{e}}_{f} (n - P_{c}), & D + P_{c} \leq n < N + 2 D \end{matrix}

{\tilde{e}}_{b} (n) = \{\begin{matrix} e_{b} (n - (N + D - P_{c})), & N + D - P_{c} \leq n < N + 2 D \\ {\tilde{e}}_{b} (n + P_{c}), & N + D - P_{c} > n &GreaterEqual; 0 \end{matrix}

In order to reconstruction signal and consecutive frame are carried out overlap-add, realize edge smoothing, it is the length that N+2D, D are overlapping region that the present invention sets pumping signal length,

(7) waveform reconstruction;

By pumping signal and the corresponding linear prediction model coefficient of generation, rebuild current frame speech signal,

\{\begin{matrix} {\tilde{x}}_{f} (n) = Σ_{i = 1}^{Lev} a_{f} (i) {\tilde{x}}_{f} (n - i) + {\tilde{e}}_{f} (n), & 0 \leq n < N + 2 D \\ {\tilde{x}}_{b} (n) = Σ_{i = 1}^{Lev} a_{b} (i) {\tilde{x}}_{b} (n - i) + {\tilde{e}}_{b} (n), & N + 2 D > n &GreaterEqual; 0 \end{matrix};

Be assigned in buffer zone from Lev the sampling point that present frame is nearest respectively with as the original state of reconstruction signal;

\{\begin{matrix} {\tilde{x}}_{f} (n) = {buf}_{f} (B - D + n), & - L \leq n < - 1 \\ {\tilde{x}}_{b} (n) = {buf}_{f} (n - (N + 2 D)), & N + 2 D \leq n < N + 2 D + Lev \end{matrix},

The N number of sampling point in centre of reconstruction signal replaces current noisy frame, and D the sampling point at two ends is used for carrying out overlap-add with front and back frame, carries out process of being fade-in fade-out, smooth reconstruct signal, ensures that reconstruction signal and two side datas have continuity,

(8) reconstruction signal and front and back frame boundaries merge

In order to the border of smooth reconstruct signal and consecutive frame, by forward direction buffer zone last D sampling point and forward direction reconstruction signal a front D sampling point carries out overlap-add by quarter window, and for last D sampling point upgrading former frame data and buffer zone buf _f(n),

x^{(p - 1)} (n) = \frac{n + 1}{D + 1} {\tilde{x}}_{f} (n) + \frac{D - n}{D + 1} {buf}_{f} (B - D - n), 0 \leq n < D,

{buf}_{f} (n) = \{\begin{matrix} x^{(p - 1)} (n - B + 2 N), & 0 \leq n < B - N \\ {\tilde{x}}_{f} (n - (B - N)), & B - N \leq n < B \end{matrix},

(a) linear weighted function

Forward linear prediction signal is higher for the degree of accuracy of prediction present frame first half, and backward linear prediction signal is just in time contrary, based on this, linear weighted function is carried out to two prediction signal and obtains final reconstruction signal, and when backward linear prediction disappearance, only with forward linear prediction signal, noisy speech frame is replaced

{\tilde{x}}^{(p)} (n) = \frac{N - n}{N + 1} {\tilde{x}}_{f} (n + D) + \frac{n + 1}{N + 1} {\tilde{x}}_{b} (n + D), 0 \leq n < N,

The noisy situation of (b) continuous multiple frames signal:

When there is this type of situation, no longer calculating LP coefficient and the pumping signal of former frame, but replacing by the corresponding estimated value of former frame, then carry out the synthetic filtering in step (6) ~ (8), border is merged and linear weighted function.

The beneficial effect that technical solution of the present invention is brought

Use segmental signal-to-noise ratio SNR _segwith segmentation logarithmic spectrum distortion LSD _segcarry out transient noise and suppress outcome evaluation.Segmental signal-to-noise ratio, segmentation logarithmic spectrum distortion definition are respectively

{SNR}_{seg} = \frac{1}{N_{t}} Σ_{k = 1}^{N_{t}} 10 \cdot \log_{10} \frac{\underset{n &Element; {frm}_{k}}{Σ} {| x (n) |}^{2}}{\underset{n &Element; {frm}_{k}}{Σ} {| \hat{x} (n) - x (n) |}^{2}},

{LSD}_{seg} = \frac{1}{N_{t}} Σ_{l = 0}^{N_{t} - 1} {\frac{2}{N} Σ_{k = 0}^{N / 2 - 1} {[10 \cdot \log_{10} TX (k, l) - 10 \cdot \log_{10} T \hat{X} (k, l)]}^{2}}^{\frac{1}{2}},

Wherein, X is the short time DFT Fourier transform of raw tone, for the Short Time Fourier Transform of voice to be measured, N _tfor the frame number of speech frame to be measured, TX is defined as follows:

TX(k,l)＝max{|X(k,l)| ²,δ}，

δ = 10^{- \frac{50}{10}} \max_{k, l} {| X (k, l) |}^{2},

The beneficial effect that can obtain in the present invention

Here by technical scheme of the present invention and S.V.Vaseghi at monograph " Advanced Digital Signal Processing and Noise Reduction " (3rd editor.New York:Wiley, 2006) detection and the suppressing method of a kind of impulsive noise provided in compare, segmental signal-to-noise ratio and Spectrum Segmentation distortion the results are shown in Figure 8, Fig. 9.As seen from Figure 8, the present invention program is under three kinds of different input signal-to-noise ratios, and the increasing amount of its segmental signal-to-noise ratio is all higher than traditional detection---recovery technique; As seen from Figure 9, the Spectrum Segmentation distortion of the present invention program is less than traditional detection---recovery technique scheme, illustrate in frequency domain distortion, the performance of the program is better than conventional solution, but still have certain gap with raw tone, this is mainly because when all transient noise frames are all correctly detected, still there is distortion spectrum in reconstruction signal;

In sound spectrograph, Figure 10 ~ Figure 13 is respectively: raw tone sound spectrograph example, after Figure 10 voice add noise voice sound spectrograph, use the sound spectrograph of a kind of detection of impulsive noise of providing in monograph " Advanced Digital Signal Processing and Noise Reduction " (3rd editor.New York:Wiley, 2006) of S.V.Vaseghi and the result of suppressing method process Figure 11 voice, use the sound spectrograph of the result of process Figure 11 voice of the present invention.

As seen from Figure 12, there is the undetected situation of transient noise in traditional detection---recovery technique, and voice flatness after repairing is bad, easily introduces new frequency component; As seen from Figure 13, the present invention can detection noise rebuild voice signal effectively, and the noise after reconstruction is residual few compared with traditional algorithm.

The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change; as sample frequency changes 44.1KHz, 32KHz, 16KHz, 8KHz etc. into by 48KHz, all should be encompassed within protection scope of the present invention.

The abbreviation that the present invention relates to and Key Term are defined as follows:

AR model: AutoRegressive, autoregressive model.

BLP:Bidirectional Linear Prediction, bidirectional linear is predicted.

DCT:Discrete Cosine Transform, discrete cosine transform.

FEC:Forward Error Correction, forward error correction technique.

GFCC:Gammatone Frequency Cepstrum Coefficient, Gammatone frequency cepstral coefficient.

LPF:Low Pass Filter, low-pass filter.

LSD:Log-spectrum Distortion, logarithmic spectrum distortion.

PLC:Packet Loss Concealment, packet loss concealment.

PLR:Packet Lost Recovery, loss recovery technology.

PSTN:Public Switched Telephone Network, public switch telephone network.

PWR:Pitch Waveform Replication, pitch period waveform copy.

SNR:Signal Noise Ratio, signal to noise ratio (S/N ratio).

VoIP:Voice over IP, based on the voice of IP network.

Claims

1. the method that in voice, transient noise suppresses, is characterized in that: comprise three modules: gamma leads to frequency cepstral coefficient extraction module, transient noise detection module, voice signal reconstruction module; Described gamma is led to frequency cepstral coefficient extraction module input end and receives noisy voice signal, output terminal is connected with transient noise detection module input end, described transient noise detection module output terminal is connected with the input end that voice signal rebuilds module, the input end that described voice signal rebuilds module just receives outside noisy voice signal, also be connected with described transient noise detection module output terminal, voice signal is rebuild module and is exported the voice after for denoising; Described gamma is led to frequency cepstral coefficient extraction module and may extract gamma noisy voice signal from input and lead to frequency cepstral coefficient, whether the difference that described transient noise detection module leads to frequency cepstral coefficient according to consecutive frame gamma adjudicates in current speech frame containing transient noise, if containing transient noise, then use voice signal reconstruction module reconstructs current speech frame, and replace current speech frame with this reconstructed speech frame, and export; If not containing transient noise, then do not process current speech frame, directly export.

2. the method that in a kind of voice according to claim 1, transient noise suppresses, is characterized in that: the treatment step that gamma leads to frequency cepstral coefficient extraction module is as follows:

A (), to original noisy speech x (n) pre-emphasis, strengthens high fdrequency component; Defining original noisy speech signal is x (n), and the voice signal after pre-emphasis is x _e(n),

x _e(n)=x(n)-ax(n-1)，

Wherein, a is pre emphasis factor, and α value is 0.97;

B the filtering of () gamma bandpass filter group, uses the filtering of following gamma bandpass filter group,

G_{i} (z) = \frac{T_{s} - T_{s} a_{3} [a_{1} + (\sqrt{2} - 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} [a_{1} - (\sqrt{2} - 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}

\times \frac{T_{s} - T_{s} a_{3} [a_{1} + (\sqrt{2} + 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}} \times \frac{T_{s} - T_{s} a_{3} [a_{1} - (\sqrt{2} + 1) a_{2}] z^{- 1}}{1 - 2 a_{1} a_{3} z^{- 1} + {a_{3}}^{2} z^{- 2}}, 1 \leq i \leq CH

= G_{1, i} (z) \cdot G_{2, i} (z) \cdot G_{3, i} (z) \cdot G_{4, i} (z)

a ₁＝cos(ω _iT _s)，1≤i≤CH，

a ₂＝sin(ω _iT _s)，1≤i≤CH，

a_{3} = e^{- {bT}_{s}},

ω _i＝2πf _i，1≤i≤CH，

f_{i} = (f_{H} + 228.7) \exp (i \times v) - 228.7

= (f_{H} + 228.7) \exp (i \times \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH}) - 228.7,1 \leq i \leq CH,

v = \frac{\ln \frac{f_{L} + 228.7}{f_{H} + 228.7}}{CH},

Wherein, T _sfor the sampling period, CH represents the port number of gamma bandpass filter group, CH=64; Parameter v is the overlap factor between each wave filter, is used for representing the overlapping degree between each wave filter, f _hfor the cutoff frequency of bank of filters, its value is the sampling rate of input signal, and the span of fL is 10 ~ 100Hz, f _lget 50Hz; By x _wn (), respectively by 64 wave filters, exports y after obtaining filtering _i(n);

y _i(n)＝x _w(n)*g _1,i(n)*g _2,i(n)*g _3,i(n)*g _4,i(n)，i=0,1,…,63；

C () asks the energy of each passage frequency-region signal, framing is carried out to GT filterbank output signals, frame length is N, the span of N is 240≤N≤960, N is taken as 480, when sample frequency is 48KHz, its equivalent time length is 10 milliseconds, calculate each path filter output component in present frame logarithmic energy and;

E (m) = \log_{e} Σ_{n = start}^{start + N - 1} {[y_{i} (n)]}^{2}, m = 0,1, . . ., 63

D () is compressed each passage cepstrum energy with discrete cosine transform, obtain gamma and lead to frequency cepstral coefficient;

c^{(p)} (0) = \sqrt{\frac{2}{L}} Σ_{m = 0}^{CH - 1} E (m), l = 0

C^{(p)} (l) = \frac{2}{\sqrt{L}} Σ_{m = 0}^{CH - 1} E (m) \cos [\frac{πl (2 m + 1)}{2 CH}], 1 \leq l < L

Wherein, L is the exponent number that gamma leads to frequency cepstral coefficient, and the span of L is 16≤L≤64, and L gets 32.

3. the method that in a kind of voice according to claim 1, transient noise suppresses, is characterized in that: the testing process of transient noise detection module is as follows:

Dis = \sqrt{Σ_{l = 0}^{L - 1} {[C^{(p)} (l) - C_{aver}^{p - 1} (l)]}^{2}};

C_{aver}^{p - 1} (l) = β \cdot C_{aver}^{p - 2} (l) + (1 - β) \cdot C^{(p - 1)} (l);

Wherein, β is the smoothing factor that gamma leads to frequency cepstral coefficient, its β=0.6; The soft-threshold decision method based on noise energy is adopted to detect noise frame, first present frame and former frame input signal ENERGY E (p), E (p-1) is calculated, according to signal energy setting threshold value thres=q [E (p)+E (p-1)]/2, q gets 0.25, when gamma lead to frequency cepstral coefficient vector distance value Dis be greater than threshold value thres time, namely judge present frame there is transient noise.

4. the method that in a kind of voice according to claim 1, transient noise suppresses, is characterized in that: the disposal route that voice signal rebuilds module is as follows:

According to the waveform of adjacent voice, adopt interpolation reconstruction algorithm to generate by the speech frame of noise pollution, first bidirectional linear prediction is carried out to the front and back frame of noisy speech frame, and according to linear predictor coefficient design inverse filter, calculate residual signals; Again residual signals is calculated pitch period by pitch determination algorithm, according to residual signals and the pitch period of consecutive frame, produce the pumping signal of current noisy frame, according to the linear predictor coefficient of pumping signal and former frame, rebuild current frame speech signal, and carry out with consecutive frame signal fading in, the data smoothing of mode of fading out, reach the object suppressing transient noise in voice; If D represents the time delay of output signal, carry out border fusion for current frame signal and consecutive frame, the span of D is 16≤D≤48, and Lev represents the exponent number of linear prediction filter, and the span of Lev is 10≤D≤30; Voice signal method for reconstructing based on bidirectional linear prediction passes through the estimated value of the sampled point generation present frame of consecutive frame, therefore needs to store B the sampling point nearest with present frame as historical data, for estimating forward linear prediction coefficient and forward direction pumping signal, is designated as correspondingly, for the B after present frame sampling point data, for estimating backward linear predictor coefficient and backward pumping signal, the value of D, Lev, B is respectively 24,20,1.5N; In order to describe and write easy, all symbols of claims remainder, all for present frame, no longer mark frame number p,

Concrete grammar is as follows:

r_{corr} (m) = Σ_{n = 0}^{B / 2 - 1 - m} bu f_{f} (n + B / 2) \cdot {buf}_{f} (n + B / 2 + m), 0 \leq m < Lev;

e_{f} (n) = {buf}_{f} (n + Lev) - Σ_{i = 1}^{L} a_{f} (i) {buf}_{f} (n + Lev - i), 0 \leq n < B - Lev;

(3) forward direction pitch determination

This method adopts residual signals to carry out pitch determination; Pitch determination is as follows:

(a) low-pass filtering

Because pitch determination result is often subject to the impact of formant frequency, in order to eliminate the impact of resonance peak as far as possible, first low-pass filtering is carried out to residual signals, the resonance peak of filter out high frequency as far as possible, for different speaker, pitch period is generally distributed in 2 ~ 12ms, therefore, and the cut-off frequecy of passband f of this low-pass filter _pspan be 0.8kHz<f _p<1.2kHz, f _pfor 0.9kHz,

The process of (b) center clipping

e_{fc} (n) = \{\begin{matrix} e_{f} (n) - T_{c}, & e_{f} (n) > T_{c} \\ 0, & | e_{f} (n) | \leq T_{c} \\ e_{f} (n) + T_{c}, & e_{f} (n) < - T_{c} \end{matrix},

Wherein, threshold value T _cfor clipping level, T _c=γ max{e _fc(n), 0≤n<B-Lev}, γ are 0.4;

C () pitch period is estimated

r_{fc} (m) = \frac{Σ_{n = B - Lev - C}^{B - Lev - 1} e_{fc} (n - m) e_{fc} (n)}{\sqrt{Σ_{n = B - Lev - C}^{B - Lev - 1} e_{fc} (n - m) e_{fc} (n - m)}}, P_{MIN} \leq m \leq P_{MAX};

P_{f} = \arg \max_{P_{MIN} \leq m \leq P_{MAX}} r_{fc} (m),

Wherein, C is the average length of auto-correlation computation, and the span of C is 120<C<240, C value is 150, P _mIN, P _mAXrepresent minimum value and the maximal value of pitch period search respectively, and P _mIN, P _mAXvalue be respectively number of samples corresponding to 2ms and 12ms 96 and 576;

(4) backward pitch determination

If detecting present frame is non-noisy frame, adopts the method similar with forward direction pitch determination in step (3) and step, detect backward pitch period; First to backward buffer zone in data do linear prediction analysis, obtain backward linear predictor coefficient and design backward inverse filter according to backward linear predictor coefficient, right carry out backward filtering, obtain backward residual signals and pitch Detection is carried out to backward residual signals, obtain backward pitch period estimated value P _b;

(5) pitch period correction

The pitch period caused for preventing frequency multiplication estimates inaccurate situation, to forward direction pitch period P _fwith backward pitch period P _bsmoothing process, and the pitch period P estimating present frame according to the pitch period after level and smooth _c, detailed process is,

P_{c} = P_{f} + \frac{P_{b} - P_{f}}{2},

Wherein, δ is pitch period difference decision threshold, and determined by the difference of voice signal consecutive frame pitch period, δ value is 10;

(6) generation of present frame pumping signal;

{\tilde{e}}_{f} (n) = \{\begin{matrix} e_{f} (B - Lev - P_{c} - D + n), & 0 \leq n < D + P_{c} \\ {\tilde{e}}_{f} (n - P_{c}), & D + P_{c} \leq n < N + 2 D \end{matrix}

{\tilde{e}}_{b} (n) = \{\begin{matrix} e_{b} (n - (N + D - P_{c})), & N + D - P_{c} \leq n < N + 2 D \\ {\tilde{e}}_{b} (n + P_{c}), & N + D - P_{c} > n &GreaterEqual; 0 \end{matrix}

In order to reconstruction signal and consecutive frame are carried out overlap-add, realize edge smoothing, setting pumping signal length is the length that N+2D, D are overlapping region,

(7) waveform reconstruction

\{\begin{matrix} {\tilde{x}}_{f} (n) = Σ_{i = 1}^{Lev} a_{f} (i) {\tilde{x}}_{f} (n - i) + {\tilde{e}}_{f} (n), & 0 \leq n < N + 2 D \\ {\tilde{x}}_{b} (n) = Σ_{i = 1}^{Lev} a_{b} (i) {\tilde{x}}_{b} (n - i) + {\tilde{e}}_{b} (n), & N + 2 D > n &GreaterEqual; 0 \end{matrix};

\{\begin{matrix} {\tilde{x}}_{f} (n) = bu f_{f} (B - D + n), & - L \leq n < - 1 \\ {\tilde{x}}_{b} (n) = {buf}_{b} (n - (N + 2 D)), & N + 2 D \leq n < N + 2 d + Lev \end{matrix},

(8) reconstruction signal and front and back frame boundaries merge

x^{(p - 1)} (n) = \frac{n + 1}{D + 1} {\tilde{x}}_{f} (n) + \frac{D - n}{D + 1} {buf}_{f} (B - D - n), 0 \leq N < D,

{buf}_{f} (n) = \{\begin{matrix} x^{(p - 1)} (n - B + 2 N), & 0 \leq n < B - N \\ {\tilde{x}}_{f} (n - (B - N)), & B - N \leq n < B \end{matrix},

(a) linear weighted function

{\tilde{x}}^{(p)} (n) = \frac{N - n}{N + 1} {\tilde{x}}_{f} (n + D) + \frac{n + 1}{N + 1} {\tilde{x}}_{b} (n + D), 0 \leq n < N,

The noisy situation of (b) continuous multiple frames signal: