CN106486131A

CN106486131A - A kind of method and device of speech de-noising

Info

Publication number: CN106486131A
Application number: CN201610898662.4A
Authority: CN
Inventors: 吴威麒; 张凯磊
Original assignee: Shanghai Qian Wan Answer Cloud Computing Technology Co Ltd
Current assignee: Suzhou Qianwen wandaba Education Technology Co., Ltd
Priority date: 2016-10-14
Filing date: 2016-10-14
Publication date: 2017-03-08
Anticipated expiration: 2036-10-14
Also published as: CN106486131B

Abstract

The embodiment of the invention discloses a kind of method and device of speech de-noising.The method includes：Speech detection is carried out to Noisy Speech Signal, to distinguish speech frame and non-speech frame；Respectively noise estimation is carried out to speech frame and non-speech frame, noise power spectrum fusion estimate is obtained, wherein, the noise power spectrum fusion estimate is the fusion value of speech frame noise power spectrum estimate and non-speech frame noise power spectrum estimate；Merging estimate according to the noise power spectrum carries out denoising to the Noisy Speech Signal.The embodiment of the present invention is by adopting technique scheme, noise estimation has all been carried out to speech frame and non-speech frame, and comprehensively both noise estimated results carry out denoising to Noisy Speech Signal, can be effectively improved the denoising effect of existing voice denoising scheme, improve voice quality.

Description

A kind of method and device of speech de-noising

Technical field

The present embodiments relate to voice process technology, more particularly to a kind of method and device of speech de-noising.

Background technology

During real-time speech communicating, various noise jamming problems can be run into, especially for mobile devices such as mobile phones For, voice noise problem seems especially prominent.Additionally, in the case of sound is played by loudspeaker, due to there is echo Problem, so for remote recording, in this case the tonequality of voice is highly prone to external environment noise and non-linear The impact of residual echo.

In order to improve voice communication quality, need to carry out denoising to voice, to improve the definition of voice.Traditional Speech de-noising algorithm usually assumes that noise is additivity and smoothly, using voice activity detection (Voice Activity Detection, VAD) noisy speech divided into phonological component and non-speech portion (i.e. unvoiced segments), non-speech portion master by technology Noise characteristic to be shown as, is then processed to non-speech portion by certain statistical method again, you can obtain ambient noise The approximate evaluation of characteristic.However, the noise in phonological component there may be difference with the noise of non-speech portion, especially receiving (there is the noise of the property taken advantage of) in the case of affecting to residual echo, according only to coming the noise estimated result of non-speech portion to whole Body voice signal carries out the effect on driving birds is not good of denoising.

Content of the invention

A kind of method and device of speech de-noising is embodiments provided, to improve going for existing voice denoising scheme Make an uproar effect.

In a first aspect, embodiments providing a kind of method of speech de-noising, the method includes：

Speech detection is carried out to Noisy Speech Signal, to distinguish speech frame and non-speech frame；

Respectively noise estimation is carried out to speech frame and non-speech frame, noise power spectrum fusion estimate is obtained, wherein, described Noise power spectrum fusion estimate is the fusion of speech frame noise power spectrum estimate and non-speech frame noise power spectrum estimate Value；

Merging estimate according to the noise power spectrum carries out denoising to the Noisy Speech Signal.

Second aspect, the embodiment of the present invention additionally provide a kind of device of speech de-noising, and the device includes：

Speech detection module, for carrying out speech detection to Noisy Speech Signal, to distinguish speech frame and non-speech frame；

Noise estimation module, for carrying out noise estimation to speech frame and non-speech frame respectively, obtains noise power spectrum and melts Estimate is closed, wherein, the noise power spectrum fusion estimate is speech frame noise power spectrum estimate and non-speech frame noise The fusion value of power Spectral Estimation value；

Denoising module, goes to the Noisy Speech Signal for merging estimate according to the noise power spectrum Make an uproar process.

A kind of method and device of speech de-noising is embodiments provided, by voice being carried out to Noisy Speech Signal Detection, distinguishes speech frame and non-speech frame, and carries out noise estimation to which, obtains noise power spectrum fusion estimate, according to Noise power spectrum fusion estimate carries out denoising to Noisy Speech Signal.By adopting technique scheme, to speech frame Noise estimation is all carried out with non-speech frame, and comprehensively both noise estimated results have been carried out at denoising to Noisy Speech Signal Reason, can be effectively improved the denoising effect of existing voice denoising scheme, improve voice quality.

Description of the drawings

Fig. 1 is the method flow diagram of the speech de-noising that the embodiment of the present invention one is provided；

Fig. 2 is the method flow diagram of the speech de-noising that the embodiment of the present invention two is provided；

Fig. 3 is the method flow diagram of the speech de-noising that the embodiment of the present invention three is provided；

Fig. 4 is the method flow diagram of the speech de-noising that the embodiment of the present invention four is provided；

Fig. 5 a is the original Noisy Speech Signal spectrogram that the embodiment of the present invention four is provided；

Fig. 5 b is the speech signal spec-trum figure after the denoising that the embodiment of the present invention four is provided；

Fig. 6 is the structural representation of the device of the speech de-noising that the embodiment of the present invention five is provided.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment that states is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure is illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 is the method flow diagram of the speech de-noising that the embodiment of the present invention one is provided, and the present embodiment can be used for speech de-noising, The method can be executed by the device of speech de-noising, and the device can be realized by the mode of software and/or hardware, and the device is permissible It is integrated in any intelligent terminal that speech de-noising function is provided, in implementing, intelligent terminal may include：Panel computer, hand The mobile terminal such as machine and electronic reader, above-mentioned terminal are only citings, and non exhaustive, including but not limited to above-mentioned intelligent terminal.

Referring to Fig. 1, the method for the speech de-noising, including：

S110, speech detection is carried out to Noisy Speech Signal, to distinguish speech frame and non-speech frame.

Intelligent terminal for reception is that a kind of non-stationary time-varying noisy speech by formation after environmental disturbances is believed to voice signal Number.Intelligent terminal for reception first, is sampled to time domain Noisy Speech Signal to after non-stationary time-varying voice signal, will simulation Signal is converted into data signal.Generally, the sample frequency of time domain Noisy Speech Signal obtains 44100 for 44100Hz, i.e., one second Individual sampled data.Adding window framing is carried out to the time domain Noisy Speech Signal after sampling, makes each frame time domain Noisy Speech Signal all It is stable.Preferably, the window function that commonly uses in speech processes has rectangular window, Hanning window and Hamming window.To adding window framing when Domain Noisy Speech Signal carries out Fourier transformation, is converted into frequency domain Noisy Speech Signal frame.Wherein, sampling, framing and Fu In leaf transformation for those skilled in the art common technology means, for brevity, will not be described here.

To frequency domain Noisy Speech Signal frame, speech detection is carried out frame by frame, to distinguish speech frame and non-speech frame.Voice is examined Survey can be regarded as carrying out feature extraction according to speech characteristic parameter, and wherein, speech characteristic parameter effectively can represent phonetic feature, With good distinction, voice and non-voice can be efficiently differentiated out according to its feature.VAD skill can be adopted in the present embodiment Art carries out speech detection.Generally, can be by the mel-frequency cepstrum coefficient (Mel in the frequency domain character parameter of extraction voice signal Frequency Cepstrum Coefficient, MFCC) distinguishing speech frame and non-speech frame.

S120, respectively noise estimation is carried out to speech frame and non-speech frame, obtain noise power spectrum fusion estimate.

Exemplary, after speech detection (VAD), one by one noise estimation can be carried out to each frame.If it is determined that current Frame is speech frame, then estimate that mode carries out noise estimation to present frame according to the noise of speech frame, obtain speech frame noise Power estimation value.For example noise estimation can be carried out using minimum tracking algorithm or quantile noise estimation method.Preferably, adopt Carry out noise estimation with fractional-dimension calculus method, in a period of time, Noisy Speech Signal is in the quantile of this frequency arrowband Value is considered as the noise power estimation value (also being understood as noise energy) of present band.Specifically, can be according to following public affairs Formula draws speech frame noise power spectrum estimate：

λ_d(n, k)=Quantiles (X (n, k)²) n=0,1,2..., M

Wherein, M represents frame number；X (n, k) represents the voice spectrum component of k-th frequency of n-th frame；Quantiles () represents Quantile is taken, typically takes 0.25 or 0.5.It should be noted that M represents carrying out obtaining after sub-frame processing to Noisy Speech Signal Totalframes, in above formula, n represents the frame number of speech frame, and the concrete value of n is determined by voice detection results.

If it is determined that present frame is non-speech frame, then estimate that mode carries out noise to present frame according to the noise of non-speech frame Estimate, obtain non-speech noise power estimation value.Non-speech frame noise power spectrum estimate can be drawn according to equation below：

λ_d(n, k)=a*X (n, k)²+(1-a)*Quantiles(X(n,k)²) n=0,1,2..., M

It should be noted that n represents the frame number of non-speech frame in the formula, the concrete value of n is determined by voice detection results.

Noise power spectrum fusion estimate is speech frame noise power spectrum estimate and non-speech frame noise power Power estimation According to equation below, the fusion value of value, can show that noise power spectrum merges estimate：

Wherein, L represents noise smoothing siding-to-siding block length, can use 9 frames；A represents weight coefficient, it is preferred that a takes 0.8；λ_d(n, K) represent speech frame noise power spectrum estimate or non-speech frame noise power spectrum estimate, specifically can be determined by the value of n； λ_last(n, k) represents noise power spectrum fusion estimate.

S130, according to the noise power spectrum merge estimate denoising is carried out to the Noisy Speech Signal.

Exemplary, after noise power spectrum fusion estimate is obtained, prior weight is entered using direct judgement method Row is estimated.Estimate that the method for prior weight is not limited only to above-mentioned direct judgement method, the algorithm that can also be suitable for using other, example Such as Casual algorithm, Non-Casual algorithm etc..In the present embodiment, estimate can be merged according to noise power spectrum and calculate priori letter Make an uproar and compare, Wiener filtering gain function can be correspondingly made available according to prior weight.Obtain Wiener filtering gain function it Afterwards, Wiener filtering is carried out to frequency domain Noisy Speech Signal, obtains frequency domain and remove noisy speech signal.The estimation of prior weight, wiener The calculating of filtering gain function can refer to existing calculation, will not be described here.

Further, inversefouriertransform can be carried out to the voice signal of frequency domain denoising, it is possible to by Overlap method The final output voice of synthesis, completes the process of whole speech de-noising.Wherein, synthesize the method for final output voice and not only limit In Overlap, here will not enumerate.

The present embodiment provide technical scheme, by carrying out speech detection to Noisy Speech Signal, distinguish speech frame and Non-speech frame, and noise estimation is carried out to which, noise power spectrum fusion estimate is obtained, estimate is merged according to noise power spectrum Denoising is carried out to Noisy Speech Signal.By technique scheme is adopted, not only noise figure can be estimated in non-speech frame, While in speech frame, can also update the estimate of noise component(s), and comprehensively both noise estimated results are believed to noisy speech Number denoising is carried out, the denoising effect of existing voice denoising scheme can be effectively improved, improves voice quality.

The speech de-noising method of the embodiment of the present invention is applicable to the voice signal during real-time voice network service Denoising is carried out, the denoising effect under the application scenarios such as real-time voice chat and online question-answering is compared to existing voice denoising side For case, effect is especially prominent.Compared with black phone voice communication and social networks phone, speech de-noising during online question-answering Difficulty is larger, this is because during online question-answering, student in order to see the mobile phone or computer screen real-time notes, be unwilling to beat as tradition Phone equally againsts ear and speaks, and student seldom wears earphone during online question-answering.Based on above-mentioned situation, played by loudspeaker Sound, echo problem are more projected, and relatively remote record, speech quality is highly prone to external environment noise and non-linear residual The impact of remaining echo.The speech de-noising method of the embodiment of the present invention, to application scenarios such as real-time voice chat and online question-answerings Under speech de-noising, the noise that not only can suppress in non-speech portion, and the noise of phonological component can be suppressed, especially Inhibition to residual echo is more obvious, can be effectively improved the denoising effect of existing voice denoising scheme, improve voice Quality.

Embodiment two

Fig. 2 is the method flow diagram of the speech de-noising that the embodiment of the present invention two is provided, and the present embodiment is in above-described embodiment On the basis of, when carrying out speech detection to Noisy Speech Signal, it is proposed that an effective phonetic feature combination, can be more accurate Distinguish noisy speech frame and non-speech frame.

Referring to Fig. 2, the method for the speech de-noising, including：

S210, the phonetic feature of extraction Noisy Speech Signal.

The noisy speech feature of extraction includes mel cepstrum coefficients MFCC, linear predictive coding residual sum spectral centroid Centroid.The mankind have different perceptions to different frequency voice：To below 1kHz, linear with frequency, right More than 1kHz, becomes logarithmic relationship with frequency.Frequency is higher, and perception is poorer.In the application usually only using low frequency MFCC, And abandon medium-high frequency MFCC.In Mel (Mel) frequency domain, people is linear relationship to the perception of tone, if two sections of voices Mel difference on the frequency twice, then people is in perceptually also poor twice.Mel frequency domain is simulation perception energy of the human ear to different frequency voice Power, in order to this perception characteristic of human ear is simulated, common frequencies yardstick is transformed into by Mel dimensions in frequency according to equation below：

Wherein, f is actual linear frequency, f_melFor Mel mark frequency.

Time domain Noisy Speech Signal to adding window framing carries out Fourier transformation, is converted into frequency domain Noisy Speech Signal After frame, the triangle bandpass filter group of frequency upper linear distribution, power spectral filter to voice signal are marked with one group of M Mel.Each The scope that individual triangle bandpass filter is covered all is similar to a critical bandwidth of human ear, and that simulates human ear with this shelters effect Should.Wherein, on M Mel mark frequency, the triangle bandpass filter group response formula of linear distribution is as follows：

Wherein, H_mK () represents the triangle bandpass filter group of the upper linear distribution of Mel mark frequency, k represents k-th wave filter, 1 ≤ m≤M, M typically take the centre frequency that 40, f (m) represents m-th triangular filter.

Logarithm is asked for output P (m) of triangle bandpass filter group according to equation below, can obtain being similar to homomorphism change The result that changes：

X (m)=log (P (m))

N rank discrete cosine transform (Discrete Cosine Transformation, DCT) is carried out to X (m), is removed each Correlation between dimension voice signal, voice signal is mapped to lower dimensional space, tries to achieve the MFCC parameter of standard.According to following public affairs Formula tries to achieve the MFCC parameter of standard：

Wherein, N typically takes 13, X_kThe MFCC parameter of expression standard.

Linear predictive coding (Linear Prediction Coding LPC) residual error is LPC residual, and it belongs to excitation letter Breath source information, reflects the cyclophysis of vocal cord vibration, and noise does not have the characteristic.Wherein, the voice frequency of n-th sampled point Spectrum output X (n) can be obtained by the linear combination of front p specimen sample point estimation, can be obtained according to equation below：

X(n)≈a₁X(n-1)+a₂X(n-2)+...a_pX(n-p)

Wherein, a₁、a₂...a_pFor p rank LPC coefficient, minimized by the systematic error quadratic sum of the model and try to achieve, but original Frame voice spectrum data are poor with X (n), obtain LPC residual.

Spectral centroid Centroid reflects the barycenter of frequency spectrum, wherein, the barycenter of speech region closer to 800～ 4000Hz, and the barycenter distribution of noise is significantly different.Spectral centroid Centroid can be obtained according to equation below：

Preferably, the phonetic feature also includes that frequency spectrum flatness Flatness, spectrum offset amount Rolloff and frequency spectrum are disturbed At least one of dynamic degree Zcr.

Wherein it is possible to obtain frequency spectrum flatness Flatness according to equation below：

Spectrum offset amount Rolloff is one kind tolerance to spectral shape, and it describes the deviation post of 85% energy spectrum.Can To obtain spectrum offset amount Rolloff according to equation below：

Frequency spectrum disturbance degree Zcr reflects voice spectrum disturbance, chaotic degree.Frequency spectrum can be obtained according to equation below Disturbance degree Zcr：

S220, speech model and noise model are generated using classifier training according to the phonetic feature for being extracted, to distinguish Go out speech frame and non-speech frame.

The phonetic feature of said extracted is constituted speech feature vector, voice is trained using grader, generate language Sound model and noise model.Wherein, grader can using mixed Gauss model (Gaussian Mixture Model, GMM) or SVMs (Support Vector Machine, SVM), is instructed to voice using mixed Gauss model in the present embodiment Practice, generate speech model and noise model.Output probability maximum according to speech model and noise model is carried out to every frame voice Judge, to distinguish speech frame and non-speech frame.When being trained to speech model and noise model, expectation maximization can be used Method, estimates the initial parameter of each Gauss based on k mean algorithm, through iteration several times, until speech model and noise mode Type convergence is restrained.

S230, respectively noise estimation is carried out to speech frame and non-speech frame, obtain noise power spectrum fusion estimate.

S240, according to the noise power spectrum merge estimate denoising is carried out to the Noisy Speech Signal.

The method of the speech de-noising that the present embodiment is provided, during by carrying out speech detection to Noisy Speech Signal, extracts band The phonetic feature of noisy speech signal, constitutes an effective phonetic feature combination, can more accurately distinguish noisy speech Frame and non-speech frame, contribute to improving the denoising effect of speech de-noising scheme further, improve voice quality.

Embodiment three

Fig. 3 is the method flow diagram of the speech de-noising that the embodiment of the present invention three is provided, and the present embodiment is in above-described embodiment On the basis of, stationary noise suppression, non-speech noise suppression and nonstationary noise suppression is carried out to Noisy Speech Signal.

Referring to Fig. 3, the method for the speech de-noising, including：

S310, speech detection is carried out to Noisy Speech Signal, to distinguish speech frame and non-speech frame.

S320, respectively noise estimation is carried out to speech frame and non-speech frame, obtain noise power spectrum fusion estimate.

S330, according to the noise power spectrum merge estimate the Noisy Speech Signal is carried out stationary noise suppression, Non-speech noise suppression and nonstationary noise suppression.

Exemplary, after noise power spectrum fusion estimate is obtained, can be according to the Wiener filtering gain function pair for calculating The frequency domain Noisy Speech Signal carries out denoising.Suppression to stationary noise, non-voice can be completed by said process The suppression of noise and the suppression of nonstationary noise, wherein, the suppression to three kinds of noises can be carried out step by step, i.e., first to voice The stationary noise of signal is suppressed, and then again non-speech noise is suppressed, and finally recycles a nonstationary noise suppression Factor pair nonstationary noise processed is suppressed, and is finally exported the voice signal for completing above-mentioned three kinds of noise suppressed, you can obtain The voice signal of final denoising.Additionally, also can the feature of comprehensive three kinds of noises carry out disposable noise suppressed, and export denoising Voice signal.

The method of the speech de-noising that the present embodiment is provided, by carrying out stationary noise suppression, non-language to Noisy Speech Signal Sound noise suppressed and nonstationary noise suppression, improve the denoising effect of speech de-noising scheme further, can not only suppress flat Steady noise, moreover it is possible to suppress nonstationary noise and residual echo, improve voice quality.

Example IV

Fig. 4 is the method flow diagram of the speech de-noising that the embodiment of the present invention four is provided, and the present embodiment is in above-described embodiment On the basis of, the fusion noise that estimate generates stationary noise, non-speech noise and nonstationary noise is merged according to noise power spectrum Inhibiting factor, and stationary noise suppression, non-voice are carried out to the Noisy Speech Signal according to the fusion noise suppression factor Noise suppressed and nonstationary noise suppression.

Referring to Fig. 4, the method for the speech de-noising, including：

S410, speech detection is carried out to Noisy Speech Signal, to distinguish speech frame and non-speech frame.

S420, respectively noise estimation is carried out to speech frame and non-speech frame, obtain noise power spectrum fusion estimate.

S430, merge estimate according to the noise power spectrum and generate stationary noise, non-speech noise and nonstationary noise Fusion noise suppression factor.

The fusion noise suppression factor of stationary noise, non-speech noise and nonstationary noise is made an uproar to all in voice signal One comprehensive consideration of sound, according to the fusion noise suppression factor, can be disposably to all of noise in voice signal Suppressed, and without the need for suppressing to three noise likes substep, improved the treatment effeciency of speech de-noising.Permissible according to equation below Obtain merging noise suppression factor：

Wherein, X (n, k) represents the voice spectrum component of k-th frequency of n-th frame, and g (n, k) represents wiener inhibiting factor, Coeff (n, k) represents non-voice inhibiting factor, and θ represents nonstationary noise inhibiting factor.θ is an invariant, and value is general For a constant between 0.001-0.1.

Wherein, the wiener inhibiting factor can be drawn by equation below：

Wherein, γ (n, k) represents posteriori SNR,Represent prior weight, max () expression takes maximum, λ_last (n, k) represents noise power spectrum fusion estimate, variable factor when β (n) represents.In traditional algorithm, β is immobilisation factor, generally takes 0.9～0.98, and in the algorithm using the β (n) of time-varying, be conducive to making fast reaction to voice signal conversion.

Because the problem of residual echo, VAD detection is caused it is possible that erroneous judgement, needs to carry out frequency spectrum further point Analysis, because within voice is concentrated mainly on 0～4000Hz of low frequency, even if under the interference of residual echo, occupies voice at least 50% energy above, so need to suppress non-speech portion (including noise and residual echo).

In this step, the non-voice inhibiting factor can be drawn by equation below：

Work as Ratio>During=threshold

Otherwise, coeff (n, k)=1.0

Wherein, Ratio representsWithRatio；Low, for presetting minimum, is an experience Value；Threshold is given threshold, and threshold generally takes 1.0.

Optionally, the fusion noise suppression factor can also be：

Wherein, X (n, k) represents the voice spectrum component of k-th frequency of n-th frame, and g (n, k) represents wiener inhibiting factor, Coeff (n, k) represents non-voice inhibiting factor, and θ represents nonstationary noise inhibiting factor.

S440, according to described fusion noise suppression factor carry out stationary noise suppression, non-language to the Noisy Speech Signal Sound noise suppressed and nonstationary noise suppression.

After obtaining the fusion noise suppression factor, Noisy Speech Signal can be suppressed, obtain the voice of denoising Signal.Can obtain final removing noisy speech signal according to equation below (1) or formula (2).As shown in Figure 5 a, it is that grandfather tape is made an uproar The spectrogram of voice signal；As shown in Figure 5 b, be according to the present embodiment provide speech de-noising method denoising after voice letter Number spectrogram.

Wherein, output (n, k) represents the voice spectrum component after the denoising of k-th frequency of n-th frame.

Comparison diagram 5a and Fig. 5 b can be seen that the method for the present embodiment can effectively be suppressed to various noise component(s)s, right The inhibition of nonstationary noise is especially prominent, and after denoising, speech components are effectively maintained.

The method of the speech de-noising that the present embodiment is provided, is steadily made an uproar by merging estimate generation according to noise power spectrum The fusion noise suppression factor of sound, non-speech noise and nonstationary noise, and according to fusion noise suppression factor to noisy speech Signal carries out denoising, can disposably complete the work to speech de-noising, improves the treatment effeciency of speech de-noising.Effectively change The denoising effect of speech de-noising scheme has been apt to it, stationary noise can not only have been suppressed, moreover it is possible to which analysis has estimated non-voice composition, to non-language Sound noise is suppressed, while when suppressing to nonstationary noise, weak component is carried out reducing, strong component is amplified, real Existing dynamic smoothing non-stationary noise reduction, improves voice quality.

Embodiment five

Fig. 5 is the structural representation of the device of the speech de-noising that the embodiment of the present invention five is provided, the dress of the speech de-noising Put including speech detection module 610, noise estimation module 620 and denoising module 630, below each module is carried out specifically Bright.

The speech detection module 610, for carrying out speech detection to Noisy Speech Signal, to distinguish speech frame and non- Speech frame；

The noise estimation module 620, for carrying out noise estimation to speech frame and non-speech frame respectively, obtains noise work( Rate spectrum fusion estimate, wherein, the noise power spectrum fusion estimate is speech frame noise power spectrum estimate and non-voice The fusion value of frame noise power spectrum estimate；

The denoising module 630, believes to the noisy speech for merging estimate according to the noise power spectrum Number carry out denoising.

Optionally, speech detection module 610 includes：

Speech feature extraction unit, for extracting the phonetic feature of Noisy Speech Signal, wherein, the phonetic feature includes Mel cepstrum coefficients MFCC, linear predictive coding residual sum spectral centroid Centroid；

Taxon, for generating speech model and noise mode according to the phonetic feature for being extracted using classifier training Type, to distinguish speech frame and non-speech frame.

Optionally, the phonetic feature also includes that frequency spectrum flatness Flatness, spectrum offset amount Rolloff and frequency spectrum are disturbed At least one of dynamic degree Zcr.

Optionally, noise estimation module 620 includes：

Speech frame noise estimation module, for drawing speech frame noise power spectrum estimate according to equation below：

λ_d(n, k)=Quantiles (X (n, k)²) n=0,1,2..., M

Non-speech frame noise estimation module, for drawing non-speech frame noise power spectrum estimate according to equation below：

λ_d(n, k)=a*X (n, k)²+(1-a)*Quantiles(X(n,k)²) n=0,1,2..., M

According to equation below, fusion noise estimation module, for showing that noise power spectrum merges estimate：

Wherein, M represents frame number, and X (n, k) represents the voice spectrum component of k-th frequency of n-th frame, and Quantiles () represents Quantile is taken, L represents noise smoothing siding-to-siding block length, a represents weight coefficient, λ_d(n, k) represents speech frame noise power Power estimation Value or non-speech frame noise power spectrum estimate, λ_last(n, k) represents noise power spectrum fusion estimate.

Optionally, denoising module 630 includes：

Denoising unit, puts down to the Noisy Speech Signal for merging estimate according to the noise power spectrum Steady noise suppressed, non-speech noise suppression and nonstationary noise suppression.

Optionally, denoising unit specifically for：

The fusion that estimate generates stationary noise, non-speech noise and nonstationary noise is merged according to the noise power spectrum Noise suppression factor；

Stationary noise suppression, non-speech noise are carried out to the Noisy Speech Signal according to the fusion noise suppression factor Suppression and nonstationary noise suppression.

Optionally, the fusion noise suppression factor is：

Optionally, the wiener inhibiting factor is drawn by equation below：

Wherein, γ (n, k) represents posteriori SNR,Represent prior weight, max () expression takes maximum, λ_last (n, k) represents noise power spectrum fusion estimate, variable factor when β (n) represents.

Optionally, the non-voice inhibiting factor is drawn by equation below：

Work as Ratio>During=threshold

Otherwise, coeff (n, k)=1.0

Wherein, Ratio representsWithRatio, low for preset minimum, threshold For given threshold.

The device of the speech de-noising provided by the embodiment of the present invention can perform the voice provided by any embodiment of the present invention The method of denoising, possesses the corresponding functional module of execution method and beneficial effect.

Obviously, it will be understood by those skilled in the art that each module of the above-mentioned present invention or each step can be by as above Described sliced service device and management server are implementing.Optionally, the embodiment of the present invention be able to can be held with computer installation Capable program realizing, such that it is able to be stored in storage device being executed by processor, can deposit by described program It is stored in a kind of computer-readable recording medium, storage medium mentioned above can be read-only storage, disk or CD etc.； Or they are fabricated to each integrated circuit modules respectively, or the multiple modules in them or step are fabricated to single collection Become circuit module to realize.So, the present invention is not restricted to the combination of any specific hardware and software.

Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other Equivalent embodiments more can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of method of speech de-noising, it is characterised in that include：

Respectively noise estimation is carried out to speech frame and non-speech frame, obtain noise power spectrum fusion estimate, wherein, the noise Power spectrum fusion estimate is the fusion value of speech frame noise power spectrum estimate and non-speech frame noise power spectrum estimate；

2. method according to claim 1, it is characterised in that speech detection is carried out to Noisy Speech Signal, to distinguish Speech frame and non-speech frame, including：

The phonetic feature of Noisy Speech Signal is extracted, wherein, the phonetic feature includes mel cepstrum coefficients MFCC, linear prediction Coded residual and spectral centroid Centroid；

Speech model and noise model are generated using classifier training according to the phonetic feature for being extracted, with distinguish speech frame and Non-speech frame.

3. method according to claim 2, it is characterised in that the phonetic feature also includes frequency spectrum flatness At least one of Flatness, spectrum offset amount Rolloff and frequency spectrum disturbance degree Zcr.

4. method according to claim 1, it is characterised in that respectively noise estimation is carried out to speech frame and non-speech frame, Noise power spectrum fusion estimate is obtained, including：

Speech frame noise power spectrum estimate is drawn according to equation below：

λ_d(n, k)=Quantiles (X (n, k)²) n=0,1,2..., M

Non-speech frame noise power spectrum estimate is drawn according to equation below：

λ_d(n, k)=a*X (n, k)²+(1-a)*Quantiles(X(n,k)²) n=0,1,2..., M

Show that noise power spectrum merges estimate according to equation below：

λ_{l a s t} (n, k) = \frac{Σ_{n = 0}^{L - 1} λ_{d} (n, k)}{L}

Wherein, M represents frame number, and X (n, k) represents the voice spectrum component of k-th frequency of n-th frame, and Quantiles () represents and takes point Digit, L represent noise smoothing siding-to-siding block length, and a represents weight coefficient, λ_d(n, k) represent speech frame noise power spectrum estimate or Non-speech frame noise power spectrum estimate, λ_last(n, k) represents noise power spectrum fusion estimate.

5. method according to claim 1, it is characterised in that described estimate is merged to institute according to the noise power spectrum Stating Noisy Speech Signal carries out denoising, including：

Merging estimate according to the noise power spectrum carries out stationary noise suppression, non-speech noise to the Noisy Speech Signal Suppression and nonstationary noise suppression.

6. method according to claim 5, it is characterised in that estimate is merged to the band according to the noise power spectrum Noisy speech signal carries out stationary noise suppression, non-speech noise suppression and nonstationary noise suppression, including：

The fusion noise that estimate generates stationary noise, non-speech noise and nonstationary noise is merged according to the noise power spectrum Inhibiting factor；

Stationary noise suppression, non-speech noise suppression are carried out according to the fusion noise suppression factor to the Noisy Speech Signal And nonstationary noise suppression.

7. method according to claim 6, it is characterised in that the fusion noise suppression factor is：

Q (n, k) = \frac{| X (n, k) | * g (n, k) * c o e f f (n, k)}{| X (n, k) | * g (n, k) * c o e f f (n, k) + θ}

8. method according to claim 6, it is characterised in that the fusion noise suppression factor is：

Q (n, k) = \frac{{(| X (n, k) | * g (n, k) * c o e f f (n, k))}^{2}}{{(| X (n, k) | * g (n, k) * c o e f f (n, k))}^{2} + θ}

9. the method according to claim 7 or 8, it is characterised in that the wiener inhibiting factor is drawn by equation below：

γ (n, k) = \frac{X {(n, k)}^{2}}{λ_{l a s t} (n, k)}

\hat{ϵ} (n, k) = β (n) \frac{g (n - 1, k) * γ (n - 1, k)}{λ_{l a s t} (n, k)} + (1 - β (n)) m a x [γ (n, k) - 1, 0]

β (n) = \frac{1}{1 + | \frac{m a x [γ (n, k) - 1, 0] - \hat{ϵ} (n - 1, k)}{m a x [γ (n, k) - 1, 0] + 1} |}

g (n, k) = \frac{\hat{ϵ} (n, k)}{\hat{ϵ} (n, k) + 1}

Wherein, γ (n, k) represents posteriori SNR,Represent prior weight, max () expression takes maximum, λ_last(n, K) noise power spectrum fusion estimate, variable factor when β (n) represents are represented.

10. the method according to claim 7 or 8, it is characterised in that the non-voice inhibiting factor is obtained by equation below Go out：

R a t i o = \frac{Σ_{k = l o w + 1}^{N - 1} | X (n, k) |}{Σ_{k = 1}^{l o w} | X (n, k) |}

Work as Ratio>During=threshold

\{\begin{matrix} c o e f f (n, k) = 0.95 & k = 1, ..., l o w \\ c o e f f (n, k) = 0.95 - \frac{(0.95 - 0.5 / R a t i o)}{l o w} * (k - l o w - 1) & k = l o w + 1, ..., N - 1 \end{matrix}

Otherwise, coeff (n, k)=1.0

Wherein, Ratio representsWithRatio, for presetting minimum, threshold is for setting for low Determine threshold value.

11. a kind of devices of speech de-noising, it is characterised in that include：

Noise estimation module, for carrying out noise estimation to speech frame and non-speech frame respectively, obtains noise power spectrum fusion and estimates Evaluation, wherein, the noise power spectrum fusion estimate is speech frame noise power spectrum estimate and non-speech frame noise power The fusion value of Power estimation value；

Denoising module, is carried out at denoising to the Noisy Speech Signal for merging estimate according to the noise power spectrum Reason.

12. devices according to claim 11, it is characterised in that denoising module includes：

Denoising unit, is steadily made an uproar to the Noisy Speech Signal for merging estimate according to the noise power spectrum Sound suppression, non-speech noise suppression and nonstationary noise suppression.

13. devices according to claim 12, it is characterised in that denoising unit specifically for：

14. devices according to claim 13, it is characterised in that the fusion noise suppression factor is：

Q (n, k) = \frac{| X (n, k) | * g (n, k) * c o e f f (n, k)}{| X (n, k) | * g (n, k) * c o e f f (n, k) + θ}

15. devices according to claim 13, it is characterised in that the fusion noise suppression factor is：

Q (n, k) = \frac{{(| X (n, k) | * g (n, k) * c o e f f (n, k))}^{2}}{{(| X (n, k) | * g (n, k) * c o e f f (n, k))}^{2} + θ}