CN103295580A

CN103295580A - Method and device for suppressing noise of voice signals

Info

Publication number: CN103295580A
Application number: CN 201310175549
Authority: CN
Inventors: 宋辉
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2013-05-13
Filing date: 2013-05-13
Publication date: 2013-09-11

Abstract

The invention discloses a method and a device for suppressing noise of voice signals. The method for suppressing the noise of the voice signals includes performing Mel-domain decomposition on a target voice signal X(f) to obtain K Mel-domain sub-band signals X<i>(f); determining local minimum energy values of various sub-bands in a time domain and signals corresponding to the minimum energy values by the aid of preset time windows, estimating corresponding sub-band noise signals of the various sub-bands and acquiring an estimated signal N(f) of the noise in a full frequency band by means of splicing estimation results; using the N(f) as a removal target to generate a filter, and suppressing the noise of the target voice signal X(f). The i can be 1, or 2, ..., or K. According to the scheme, the method and the device have the advantages that the integrity of voice characteristics can be kept to the greatest extent in a noise reduction procedure, and accordingly the voice recognition accuracy is improved.

Description

A kind of pronunciation signal noise inhibition method and device

Technical field

The present invention relates to the audio signal processing technique field, particularly relate to a kind of pronunciation signal noise inhibition method and device.

Background technology

Along with Internet development, the user has been not limited only to content of text at the object of the enterprising line search of network, and picture, audio frequency, video etc. have all become the object that search engine is supported.Wherein, voice-based search has become a kind of popular application form.Different with traditional text search, in the input process of voice signal, noise is inevitably, therefore how to carry out effective noise reduction process, has just become to improve the key of phonetic search performance.

At present, noise reduction process thinking at phonetic search mainly is divided into two classes: the first kind is that the clean speech characteristic signal in the feature database is added the processing of making an uproar, training obtains the voice feature data with noise, thereby increases the matching degree of feature database data and the actual recording of user data.The defective of this mode, be that on the one hand data training cost is higher, be on the other hand to add to make an uproar and handle all noise types that are difficult to contain in the reality, therefore, in case user's surrounding environment when using phonetic search is a kind of new noise, training data will not possess the anti-noise effect fully so.

The thinking of the second class noise reduction process is to utilize existing voice enhancement algorithm that the voice signal of band noise is strengthened processing.Yet the original purpose of existing voice enhancement algorithm is in order to promote the subjective auditory perception of voice, but has destroyed the spectrum structure of voice, influences the integrality of phonetic feature.Therefore directly existing speech enhancement technique is applied in the speech recognition system, might not obtains the forward income, also can reduce the discrimination of system under a lot of situations.

Summary of the invention

For solving the problems of the technologies described above, the embodiment of the invention provides a kind of pronunciation signal noise inhibition method and device, to be implemented in the integrality that keeps phonetic feature in the noise reduction process as far as possible, promotes the speech recognition accuracy rate.Technical scheme is as follows:

The embodiment of the invention provides a kind of pronunciation signal noise inhibition method, and this method comprises:

Target voice signal X (f) is carried out the Mel territory decompose, obtain K Mel territory subband signal X _i(f), (i=1,2 ..., K);

Utilize the preset time window mouth, determine that each subband is in the local minimum energy value of time domain

And the signal of this minimum energy value correspondence

Utilize each subband

Noise signal to respective sub-bands is estimated, obtains the noise estimated signal N (f) of full range band by the splicing estimated result;

Serve as to eliminate target to generate wave filter with N (f), target voice signal X (f) is carried out squelch handle.

According to a kind of embodiment of the present invention, described each subband that utilizes

Noise signal to respective sub-bands is estimated, comprising:

With each subband

Noise estimated signal as respective sub-bands.

Noise signal to respective sub-bands is estimated, comprising:

Local minimum energy value with each subband

Noise energy estimated value as this subband;

According to the noise energy estimated value of each subband, calculate the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively;

The time quantum that the average sub band signal to noise ratio (S/N ratio) is lower than predetermined threshold value is defined as the non-voice time quantum;

Utilize the non-voice time quantum respective signal of each subband With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated.

According to a kind of embodiment of the present invention, described noise energy estimated value according to each subband is calculated the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively, comprising:

Only at default Mid Frequency and high band, according to the noise energy estimated value of each subband, calculate the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively.

According to a kind of embodiment of the present invention, the described non-voice time quantum respective signal of utilizing each subband

With the minimum energy value respective signal Noise signal to respective sub-bands is estimated, comprising:

Utilize

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f),

Obtain the noise estimated signal of each subband; λ wherein _iBe preset weight value, 0≤λ _i≤ 1.

According to a kind of embodiment of the present invention, at different subbands, default different λ _iValue.

According to a kind of embodiment of the present invention, described at different subbands, default different λ _iValue comprises:

At the subband than low-frequency range, default less λ _iValue, at the subband of higher frequency band, default bigger λ _iValue.

According to a kind of embodiment of the present invention, described time quantum is: an audio frame.

The embodiment of the invention also provides a kind of pronunciation signal noise restraining device, and this device comprises:

The signal decomposition module is used for that target voice signal X (f) is carried out the Mel territory and decomposes, and obtains K Mel territory subband signal X _i(f), (i=1,2 ..., K);

The least energy determination module is used for utilizing the preset time window mouth, determines that each subband is in the local minimum energy value of time domain And the signal of this minimum energy value correspondence

The noise estimation module is used for utilizing each subband Noise signal to respective sub-bands is estimated, obtains the noise estimated signal N (f) of full range band by the splicing estimated result;

Noise suppression module, being used for N (f) serves as to eliminate target to generate wave filter, target voice signal X (f) is carried out squelch handle.

According to a kind of embodiment of the present invention, described noise estimation module specifically is used for:

With each subband

Noise estimated signal as respective sub-bands.

According to a kind of embodiment of the present invention, described noise estimation module comprises:

The time quantum type is judged submodule, is used for the local minimum energy value with each subband

Noise energy estimated value as this subband; According to the noise energy estimated value of each subband, calculate the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively; The time quantum that the average sub band signal to noise ratio (S/N ratio) is lower than predetermined threshold value is defined as the non-voice time quantum;

Noise estimator module is for the non-voice time quantum respective signal of utilizing each subband With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated.

According to a kind of embodiment of the present invention, described time quantum type is judged submodule, specifically is used for:

According to a kind of embodiment of the present invention, described noise estimator module specifically is used for:

Utilize

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f),

According to a kind of embodiment of the present invention, at the subband than low-frequency range, default less λ _iValue, at the subband of higher frequency band, default bigger λ _iValue.

The technical scheme that the embodiment of the invention provides adopts the least energy tracer technique in Mel territory, the least energy of each Mel frequency band is followed the trail of, and preserved the signal corresponding with each Mel subband least energy

Be with the difference of traditional noise suppressing method: these signals

May not belong to same time frame, but can guarantee that the noise estimated result of each Mel frequency band is the energy minimum, guarantee the reliability that noise is estimated effectively, make speech components can in the process of squelch, not sustain damage.

Further, the technical scheme that the embodiment of the invention provides can also be utilized the least energy tracking result, and the speech/non-speech discriminant information, at the Mel frequency band noise estimation value is revised, at the different qualities of different Mel frequency range noises, can select different noise correction step-lengths.Like this, can guarantee can not introduce speech components to the reliability of noise estimation, can carry out dynamic estimation to the noise of different frequency range again, thereby further promote the noise suppression effect for the nonstationary noise voice signal.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, the accompanying drawing that describes below only is some embodiment that put down in writing among the present invention, for those of ordinary skills, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is a kind of structural representation of embodiment of the invention speech recognition system;

Fig. 2 is a kind of process flow diagram of embodiment of the invention pronunciation signal noise inhibition method;

Fig. 3 is a kind of structural representation of embodiment of the invention pronunciation signal noise restraining device.

Embodiment

Figure 1 shows that a kind of typical speech recognition system in the prior art, by noise suppression module, add the training module of making an uproar, identification module etc. and partly constitute.In the training stage, noisy speech is through Noise Suppression Device, and the voice signal after being enhanced adds the training module of making an uproar and utilizes this signal to carry out plus noise model training, obtains adding the acoustic model after making an uproar.At cognitive phase, Noisy Speech Signal passes through noise suppression module equally, and identification module sent in the enhancing voice that obtain, and identifies processing according to plus noise model and language model that training in advance is good, obtains final voice identification result.

As seen, noise suppression module is the important component part of whole speech recognition system.Existing noise reduction process thinking is to utilize existing voice enhancement algorithm that the voice signal of band noise is strengthened processing.Though voice enhancement algorithm can promote the subjective auditory perception of voice, destroyed the spectrum structure of voice, influence the integrality of phonetic feature.Therefore directly existing speech enhancement technique is applied in the speech recognition system, might not obtains the forward income, also can reduce the discrimination of system under a lot of situations.

At the problems referred to above, the embodiment of the invention provides a kind of pronunciation signal noise to suppress scheme, and this design for scheme target is under the prerequisite of not damaging voice signal, suppresses ground unrest as far as possible, residual noise in the feasible enhancing voice that finally obtain reduces on amplitude as far as possible.The basic functional principle of this noise suppressing system is as follows:

At first with the Noisy Speech Signal of input, (Mel Filter Bank MFB), carries out Mel territory sub-band division to extract identical Mel triangular filter group by one group with phonic signal character.Each Mel subband signal is carried out local least energy to be followed the trail of, the local least energy in time domain of each Mel subband of real time record, determine that simultaneously (effect of Mel subband least energy has two to the Mel territory signal corresponding with each least energy, the one, as the detection foundation of the voice Origin And Destination of input signal, the 2nd, the foundation of estimating as noise signal in the noise suppression algorithm); Utilize the least energy respective signal of each subband that the noise signal of each subband is estimated then, obtain the noise estimated signal of full range band by the splicing estimated result; At last the noise estimated signal with the full range band serves as to eliminate the target design wave filter, the Noisy Speech Signal of initial input is carried out squelch handle.

In such scheme, because whole noise estimation procedure (rather than as traditional voice enhancement algorithm in frequency domain) in the Mel territory is carried out, act on same transform domain with the feature extraction processing of voice, and each Mel subband has been carried out the least energy tracking, therefore can guarantee the squelch to each Mel subband.The part that eliminates must be noise component, and can not lose speech components, has effectively protected the phonetic element of each Mel subband.Compare with common speech enhancement technique, can significantly promote the accuracy rate of speech recognition system.

In order to make those skilled in the art understand technical scheme among the present invention better, below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is described in detail.Obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, the every other embodiment that those of ordinary skills obtain should belong to the scope of protection of the invention.

Shown in Figure 2, be the process flow diagram of a kind of noise suppressing method of the present invention, this method can may further comprise the steps:

S101 carries out the Mel territory to the target voice signal and decomposes, and obtains K Mel territory subband signal;

The Mel territory is to be divided by the one group of triangular filter group that defines at frequency domain, and this bank of filters is also referred to as Mel Filter Bank, is called for short MFB.Because MFB can simulate the critical bandwidth effect of human auditory system well, therefore in speech recognition system, can extract the feature of voice well in the Mel territory.In embodiment of the invention scheme, sub-band division is carried out according to the Mel frequency marking, and squelch is also handled in the Mel territory.

Suppose that pending target voice signal is X (f), the process of Mel territory sub-band division is exactly to divide according to the bandwidth of Mel triangular filter group, with full range band voice signal, resolves into K Mel territory subband signal X ₁(f) ..., X _K(f).

Wherein, each subband signal can further be expressed as voice subband component and noise subband component sum again:

X _i(f)＝S _i(f)+N _i(f)，i＝1，...，K

If can estimate the noise component N of each subband _i(f), just can estimate the noise signal N (f) of full range band, serve as to eliminate target with full range band noise signal then, and designing filter is finished squelch.In the content of present embodiment back, will introduce in detail and how noise signal N (f) be estimated.

S102 utilizes the preset time window mouth, determines that each subband is in the local minimum energy value of time domain And the signal of this minimum energy value correspondence

Embodiment of the invention scheme utilizes the local least energy of each Mel to the noise component N of each subband _i(f) estimate.At first the least energy of each subband is followed the trail of, determined that namely each subband is in the local minimum energy value of time domain.Consider that the actual noise signal mostly is non-stationary signal, therefore can introduce long time window here, to improve the adaptive ability to nonstationary noise.

For example, access time, window was 300 frames, in this time window each Mel subband was carried out local least energy and followed the trail of, that is: at each Mel subband, find 300 continuous frames respectively, each 300 frame is 300 frames with least energy in subband separately.

Supposing to follow the trail of the least energy that obtains each Mel subband is respectively:

E_{N, 1}^{\min}, . . . . . ., E_{N, K}^{\min} .

Record the signal of each the Mel subband corresponding with least energy simultaneously:

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f) .

These subband signals may not belong to same signal frame on time domain, but can guarantee the energy minimum of each Mel frequency band, and therefore the foundation that this part signal is estimated as noise is very reliable.In carrying out the squelch filtering, if with this part as by the component that filtered out, can guarantee that the whole of filtering are the noise component of each Mel frequency band, speech components has then intactly been kept.That is to say, the application's scheme adopts a kind of comparatively method of " insurance ", only the very strong noise signal of determinacy is carried out filtering and handles, and whether uncertain for those is the signal of noise, do filtering would rather not and handle, avoid the phonetic feature of signal is caused damage.

S103 utilizes each subband

1) method of estimation of the noise signal of each subband:

In one embodiment of the invention, can be directly with the least energy respective signal of each subband

Noise estimated signal as respective sub-bands.According to the explanation of front as can be known, this method of estimation can guarantee can not comprise speech components in the noise estimated result.

In another embodiment of the invention, can also be to the least energy respective signal of each subband

Do further correction, to realize noise estimation effect more accurately.Concrete steps are as follows:

S103a is with the local minimum energy value of each subband

Noise energy estimated value as this subband;

S103b according to the noise energy estimated value of each subband, calculates the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively;

For sound signal, can be on the time domain be several time quantums with this division of signal, each time quantum is as a basic processing unit.In field of audio processing, typical basic processing unit is an audio frame.In an embodiment of the present invention, also describe as the basic processing unit with " audio frame ", but this and should not be construed restriction to the present invention program.

For each frame signal, calculate the average sub band signal to noise ratio (S/N ratio) of this frame:

SNR (f) = \underset{k}{Σ} \frac{E_{i}}{E_{i}^{\min}}

E wherein _iThe gross energy of representing i subband;

Be the minimum local energy of i subband, the noise energy estimated value as i subband participates in snr computation here.To the snr value summation of each subband of frame signal, just obtain the average sub band signal to noise ratio snr (f) of this frame, this value is more big, illustrates that voice signal is more strong; Otherwise, illustrate that voice signal is more weak, even may not have voice signal, all be noise.

By the research to a large amount of actual speech signal, find that noise signal concentrates on low-frequency range usually, so the signal to noise ratio (S/N ratio) of medium-high frequency section has more the ability of differentiation voice/noise.Based on above-mentioned situation, can adopt following formula to calculate every frame signal average sub band signal to noise ratio (S/N ratio):

SNR (f) = Σ_{k = K / 3}^{K} \frac{E_{i}}{E_{i}^{\min}}

Notice that calculating signal to noise ratio (S/N ratio) does not here adopt whole subband signal summations, but adopt the subband of medium-high frequency section, the summation scope is from K/3 to K, and purpose is to remove low-frequency band, keeps the medium-high frequency band.Because noise signal concentrates on low-frequency band mostly, so input signal is bigger in the interference of low-frequency range, and the interference of medium-high frequency section is less, and the result reliability that calculates signal to noise ratio (S/N ratio) is higher.Here K/3 is an empirical value, is understandable that, this value also should not be construed as restriction to the present invention program.

S103c, the time quantum that the average sub band signal to noise ratio (S/N ratio) is lower than predetermined threshold value is defined as the non-voice time quantum;

SNR (f) and preset threshold value SNR with each frame _ThrCompare.If SNR (f)＞SNR _Thr, then this frame is differentiated and is speech frame, otherwise this frame is differentiated is non-speech frame.

For non-speech frame, further determine its respective signal at each subband

Represent that certain frame is in the signal characteristic value of i subband.If have a plurality of non-speech frame in the subband, can adopt the impartial mode of making even to obtain

S103d utilizes the non-voice time quantum respective signal of each subband

With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated.

Utilize the result of speech/non-speech judgement, revise minimum Mel sub belt energy tracking result:

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f) .

Obtain the noise estimated result of " more reliable " of each Mel subband:

X_{1}^{update} (f), . . . . . ., X_{K}^{update} (f) .

The noise correction of each subband, can carry out in the following manner:

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f)

As can be seen from the above equation, revised noise estimated signal is actually the non-speech frame respective signal

With the minimum energy value respective signal

This two-part weighted results.

λ wherein _iBe preset weight value, span plays the effect that step-length is revised in control in [0,1].

Two kinds of extreme cases are:

λ _i=0: then noise estimation value is modified to " non-speech frame " respective signal; Be equivalent to: as long as find the signal of non-voice, then utilize non-speech audio as the noise estimated result.

λ _i=1: then utilize the least energy tracking result to estimate as noise all the time, be equivalent to not revise.

When actual treatment, can adjust different λ according to concrete environment, noisiness _iValue.In addition, at different Mel frequency bands, can select different step-length λ _iFor example, for low-frequency band, noise signal more trends towards stationary signal, then λ _iCan select a little bit smaller, " insurance " more; For high frequency band, noise signal more trends towards non-stationary, at this moment λ _iCan select more greatly, can dynamically trace into the variation of noise like this.

Here the normally individual empirical value of the span of said " low frequency ", " high frequency " does not have absolute restriction.For example the frequency band of voice signal is 0～8k, it is generally acknowledged that 0～2kHz is low frequency, is high frequency more than the 2k.By repeatedly experiment, general λ with low-frequency range _iBe set to 0.1～0.2, the λ of high band _iBe set to 0.5～0.6, this only is optimal value on the experiment meaning certainly, should not be construed as the restriction to the present invention program.

More than provide two kinds of noise signals to each subband to carry out estimation approach, wherein the advantage of first method is to handle simply, under some simple environment, such as steady or accurate stationary noise environment, can directly adopt the least energy tracking result, as the estimation of noise.Second method has then further increased the dynamic tracking power to noise, the signal of its utilization " non-speech frame " (being noise signal in real time) is followed the trail of the result to minimum Mel energy and is revised, given the deviser very big degree of freedom simultaneously, can design different correction step factors according to the actual noise environment.The more important thing is, can select different step-lengths at the noisiness of different Mel frequency bands, thereby realize the specific aim of different frequency bands noise is upgraded, further promote the estimation accuracy to nonstationary noise.

2) joining method of the noise signal of each subband:

By the processing of step 1), obtain the noise signal estimated value of each subband:

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f)

Or

X_{1}^{update} (f), . . . . . ., X_{K}^{update} (f) .

Handle for designing filter carries out squelch, the noise signal of each subband need be spliced, obtain noise estimated signal N (f) or the N of full range band _Update(f).

Exist overlappingly at the frequency domain of standard owing to decompose K subband signal obtaining in the Mel territory, so can not obtain N (f) by the mode of direct addition, particularly, can adopt " sectional type splicing method " to splice.Illustrate, for two adjacent Mel frequency bands

With

Corresponding Mel frequency band range is respectively: And

Then connecting method is as follows:

N^{update} (f) = X_{i}^{update} (f)

if

f_{i}^{lower} / 3 < f < f_{i}^{upper} / 3

N^{update} (f) = X_{i + 1}^{update} (f)

if

f_{i + 1}^{lower} / 3 < f < f_{i + 1}^{upper} / 3

N^{update} (f) = [X_{i}^{update} (f) + X_{i + 1}^{update} (f)] / 2

if

f_{i}^{upper} / 3 < f < f_{i + 1}^{lower} / 3

Be example with i subband, it and former and later two subbands have overlapping, so can be divided into 3 parts to i subband: 1) with the overlapping part of i-1 subband; 2) independent parts; 3) with the overlapping part of i+1 subband.For convenience of calculation, think that here overlapping part is the same with independent parts length, so the length of each part is equivalent to 1/3rd of whole subband length, so will be divided by 3 for interval endpoint.

Except segmentation splicing method, can also adopt " Mel territory compartment splicing method " to splice, the odd number section, the even number section that are about to the Mel subband are spliced respectively.Because odd number wave filter and the even number wave filter of Mel triangular filter group, can both not have and constitute the full range band signal overlappingly, so can directly splice processing respectively.

Certainly, more than be that two kinds of typical joining methods are introduced.According to the characteristics of Mel territory signal decomposition, those skilled in the art can also adopt additive method that each subband signal after decomposing is spliced, and the embodiment of the invention does not need to limit to concrete joining method.

S104 serves as to eliminate target to generate wave filter with N (f), target voice signal X (f) is carried out squelch handle.

Obtain noise estimated signal N (f) or the N of full range band _Update(f) afterwards, just can realize squelch by designing filter.

For example, to every frame signals with noise of input, with N ^Update(f) carry out filtering for eliminating target, finish noise N ^Update(f) inhibition is handled.Suppose that present frame is X (f), according to minimum mean square error criterion, generate optimal filter and satisfy:

H (f) = \sqrt{1 - \frac{1}{SNR (f)}} = \sqrt{1 - \frac{N^{update} (f)}{X (f)}}

Utilize H (f), input signal is carried out filtering, can obtain the voice signal after the denoising.Still in frequency domain, carry out on this filtering surface, but because noise estimates that with the noise renewal all be to finish in the Mel territory, the optimum Wiener filtering that therefore in fact is still in the Mel territory is handled.Because the noise estimated result of each Mel frequency band can not comprise speech components, therefore can guarantee the reliability that noise is estimated effectively, make speech components can in the process of filtering, not sustain damage.

Be understandable that, according to noise estimated signal N (f) or the N of full range band _Update(f), those skilled in the art can also adopt the additive method designing filter, and the embodiment of the invention does not need to limit to concrete filter design method.

Corresponding to top method embodiment, the present invention also provides a kind of pronunciation signal noise restraining device, referring to shown in Figure 3, this device can comprise: signal decomposition module 110, least energy determination module 120, noise estimation module 130, noise suppression module 140.Concrete function and cooperation relation to each module is described further below:

Signal decomposition module 110 is used for that target voice signal X (f) is carried out the Mel territory and decomposes, and obtains K Mel territory subband signal X _i(f), (i=1,2 ..., K);

The Mel territory is to be divided by the one group of triangular filter group that defines at frequency domain, and this bank of filters is also referred to as Mel Filter Bank, is called for short MFB.Because MFB can simulate the critical bandwidth effect of human auditory system well, therefore in speech recognition system, can extract the feature of voice well in the Mel territory.In embodiment of the invention scheme, sub-band division is carried out according to the Mel frequency marking.Squelch is also handled in the Mel territory.

X _i(f)＝S _i(f)+N _i(f)，i＝1，...，K

If can estimate the noise component N of each subband _i(f), just can estimate the noise signal N (f) of full range band, serve as to eliminate the target design wave filter with full range band noise signal then, finishes squelch.In the content of present embodiment back, will introduce in detail and how noise signal N (f) be estimated.

Least energy determination module 120 is used for utilizing the preset time window mouth, determines that each subband is in the local minimum energy value of time domain

And the signal of this minimum energy value correspondence

E_{N, 1}^{\min}, . . . . . ., E_{N, K}^{\min} .

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f) .

These subband signals may not belong to same signal frame on time domain, but can guarantee the energy minimum of each Mel frequency band, therefore with the foundation of this part signal as the noise estimation, be very reliable, in carrying out the squelch filtering, if with this part as by the component that filtered out, can guarantee that the whole of filtering are the noise component of each Mel frequency band, speech components has then intactly been kept, that is to say, the application's scheme adopts a kind of comparatively method of " insurance ", only the very strong noise signal of determinacy being carried out filtering handles, whether uncertain for those is the signal of noise, does filtering would rather not and handles, and avoids the phonetic feature of signal is caused damage.

Noise estimation module 130 is used for utilizing each subband

1) noise signal of each subband is estimated:

Do further correction, to realize noise estimation effect more accurately.Specific design is as follows:

For sound signal, can be on the time domain be several time quantums with this division of signal, each time quantum is as a basic processing unit.In field of audio processing, typical basic processing unit is an audio frame, in an embodiment of the present invention, also describe as the basic processing unit with " audio frame ", but this and should not be construed restriction to the present invention program.

SNR (f) = \underset{k}{Σ} \frac{E_{i}}{E_{i}^{\min}}

E wherein _iThe gross energy of representing i subband;

SNR (f) = Σ_{k = K / 3}^{K} \frac{E_{i}}{E_{i}^{\min}}

For non-speech frame, further determine its respective signal at each subband

Represent certain frame in the signal characteristic value of i subband, if having a plurality of non-speech frame in the subband, can adopt the impartial mode of making even to obtain

Noise estimator module is for the non-voice time quantum respective signal of utilizing each subband

With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated.

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f) .

Obtain the noise estimated result of " more reliable " of each Mel subband:

X_{1}^{update} (f), . . . . . ., X_{K}^{update} (f) .

The noise correction of each subband, can carry out in the following manner:

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f)

With the minimum energy value respective signal This two-part weighted results.

λ wherein _iBe preset weight value, span plays the effect that step-length is revised in control in [0,1],

Two kinds of extreme cases are:

λ _i=1: then utilize the least energy tracking result as the noise estimated result all the time, be equivalent to not revise.

When actual treatment, can adjust different λ according to concrete environment, noisiness _iValue.In addition, at different Mel frequency bands, can select different step-length λ _i, for example, for low-frequency band, noise signal more trends towards stationary signal, then λ _iCan select a little bit smaller, " insurance " more; For high frequency band, noise signal more trends towards non-stationary, at this moment λ _iCan select more greatly, can dynamically trace into the variation of noise like this.

Two kinds of schemes that the noise signal of each subband is estimated more than are provided, wherein the advantage of first method is to handle simply, under some simple environment, such as steady or accurate stationary noise environment, can directly adopt the least energy tracking result, as the estimation of noise.Second method has then further increased the dynamic tracking power to noise, the signal of its utilization " non-speech frame " (being noise signal in real time) is followed the trail of the result to minimum Mel energy and is revised, given the deviser very big degree of freedom simultaneously, can design different correction step factors according to the actual noise environment.The more important thing is, can select different step-lengths at the noisiness of different Mel frequency bands, thereby realize the specific aim of different frequency bands noise is upgraded, further promote the estimation accuracy to nonstationary noise.

2) noise signal of each subband is spliced:

By 1) processing, obtain the noise signal estimated value of each subband:

X_{1}^{\min} (f), . . . . . ., X_{K}^{\min} (f)

Or

X_{1}^{update} (f), . . . . . ., X_{K}^{update} (f) .

Because decomposing K the subband signal that obtains in the Mel territory exists overlapping at the frequency domain of standard, therefore can not obtain N (f) by the mode of direct addition, particularly, can adopt " sectional type splicing method " to splice, illustrate, for two adjacent Mel frequency bands

With

Corresponding Mel frequency band range is respectively:

And

Then connecting method is as follows:

N^{update} (f) = X_{i}^{update} (f)

if

f_{i}^{lower} / 3 < f < f_{i}^{upper} / 3

N^{update} (f) = X_{i + 1}^{update} (f)

if

f_{i + 1}^{lower} / 3 < f < f_{i + 1}^{upper} / 3

N^{update} (f) = [X_{i}^{update} (f) + X_{i + 1}^{update} (f)] / 2

if

f_{i}^{upper} / 3 < f < f_{i + 1}^{lower} / 3

Except segmentation splicing method, can also adopt " Mel territory compartment splicing method " to splice, the odd number section, the even number section that are about to the Mel subband are spliced respectively.Because odd number wave filter and the even number wave filter of Mel triangular filter group can both not have overlapping formation full range band signal, so can directly splice processing respectively.

Certainly, more than two kinds of typical joining methods are introduced, according to the characteristics of Mel territory signal decomposition, those skilled in the art can also adopt additive method that each subband signal after decomposing is spliced, and the embodiment of the invention does not need to limit to concrete joining method.

Noise suppression module 140, being used for N (f) serves as to eliminate target to generate wave filter, target voice signal X (f) is carried out squelch handle.

H (f) = \sqrt{1 - \frac{1}{SNR (f)}} = \sqrt{1 - \frac{N^{update} (f)}{X (f)}}

For the convenience of describing, be divided into various modules with function when describing above the device and describe respectively.Certainly, when enforcement is of the present invention, can in same or a plurality of softwares and/or hardware, realize the function of each module.Wherein said module as the separating component explanation can or can not be physically to separate also, the parts that show as the unit can be or can not be physical modules also, namely can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select wherein some or all of module to realize the purpose of present embodiment scheme according to the actual needs.Those of ordinary skills namely can understand and implement under the situation of not paying creative work.

As seen through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of some part of each embodiment of the present invention or embodiment.

The above only is the specific embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a pronunciation signal noise inhibition method is characterized in that, this method comprises:

And the signal of this minimum energy value correspondence

Utilize each subband

2. method according to claim 1 is characterized in that, described each subband that utilizes

Noise signal to respective sub-bands is estimated, comprising:

With each subband Noise estimated signal as respective sub-bands.

3. method according to claim 1 is characterized in that, described each subband that utilizes

Noise signal to respective sub-bands is estimated, comprising:

Local minimum energy value with each subband Noise energy estimated value as this subband;

Utilize the non-voice time quantum respective signal of each subband

With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated.

4. method according to claim 3 is characterized in that, described noise energy estimated value according to each subband is calculated the average sub band signal to noise ratio (S/N ratio) of each time quantum of target voice signal respectively, comprising:

5. method according to claim 3 is characterized in that, the described non-voice time quantum respective signal of utilizing each subband

With the minimum energy value respective signal

Noise signal to respective sub-bands is estimated, comprising:

Utilize

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f),

6. method according to claim 5 is characterized in that, at different subbands, and default different λ _iValue.

7. method according to claim 6 is characterized in that, and is described at different subbands, default different λ _iValue comprises:

8. according to each described method of claim 3 to 7, it is characterized in that described time quantum is: an audio frame.

9. a pronunciation signal noise restraining device is characterized in that, this device comprises:

The least energy determination module is used for utilizing the preset time window mouth, determines that each subband is in the local minimum energy value of time domain

And the signal of this minimum energy value correspondence

The noise estimation module is used for utilizing each subband

10. device according to claim 9 is characterized in that, described noise estimation module specifically is used for:

With each subband

Noise estimated signal as respective sub-bands.

11. device according to claim 9 is characterized in that, described noise estimation module comprises:

Noise signal to respective sub-bands is estimated.

12. device according to claim 11 is characterized in that, described time quantum type is judged submodule, specifically is used for:

13. device according to claim 11 is characterized in that, described noise estimator module specifically is used for:

Utilize

X_{i}^{update} (f) = λ_{i} X_{i}^{\min} (f) + (1 - λ_{i}) X_{i}^{non - speech} (f),

14. device according to claim 13 is characterized in that, at different subbands, and default different λ _iValue.

15. device according to claim 14 is characterized in that, at the subband than low-frequency range, and default less λ _iValue, at the subband of higher frequency band, default bigger λ _iValue.

16. according to each described device of claim 11 to 15, it is characterized in that described time quantum is: an audio frame.