CN104103278A

CN104103278A - Real time voice denoising method and device

Info

Publication number: CN104103278A
Application number: CN201310112271.1A
Authority: CN
Inventors: 朱宝
Original assignee: Beijing Oak Pacific Interactive Technology Development Co Ltd
Current assignee: Beijing Oak Pacific Interactive Technology Development Co Ltd
Priority date: 2013-04-02
Filing date: 2013-04-02
Publication date: 2014-10-15

Abstract

The invention provides a real time voice denoising method and device; the method comprises the following steps: generating a frequency domain zone noise voice signal according to a voice input received by a voice receiver; calculating a logarithm spectrum posterior signal to noise ratio according to the frequency domain zone noise voice signal, wherein the logarithm spectrum posterior signal to noise ratio refers to a ratio between logarithm of a power spectrum of a present frame frequency domain zone noise voice signal and a logarithm of a previous frame noise power estimation value; obtaining a noise power spectrum estimation value according to the logarithm spectrum posterior signal to noise ratio and based on a weight noise estimation algorithm; generating a Wiener filtering gain function according to the noise power spectrum estimation value, and filtering the frequency domain zone noise voice signal according to the gain function, thus generating a frequency domain denoising voice signal; generating a time domain denoising voice signal according to the frequency domain denoising voice signal, and the time domain denoising voice signal is further processed by the voice receiver. Correspondingly, the invention also provides the real time voice denoising device.

Description

A kind of method and apparatus of real-time voice denoising

Technical field

The present invention relates to speech digital processing field, relate in particular to a kind of method and apparatus of real-time voice denoising.

Background technology

Aspect squelch, Wiener filtering algorithm be always most important, be also the best algorithm for estimating of effect, be widely used in the various fields such as image, video, speech processes.Wherein, aspect speech de-noising, there are at present a lot of speech de-noising methods based on Wiener filtering.For example, but these methods can not be advantageously applied to the limited pronunciation receiver of processing power, intelligent mobile terminal conventionally.Taking intelligent mobile terminal as example, the limitation in this method application is embodied in: the first, the speed of existing voice denoising method tracking noise is fast not, and the complexity that method realizes is higher, does not therefore meet the real-time operation demand of intelligent mobile terminal; The second, when real-time noise is estimated, existing way is normally using the start frame of Noisy Speech Signal as initial noise, thus, can cause in a period of time after voice start accurately tracking noise, and then cause sound after treatment during this period of time that distortion can occur.Although conventionally all shorter during this period of time, use the user of this intelligent mobile terminal still can feel very significantly and therefore user's experience is affected the distortion of sound.In addition, at present the speech de-noising method based on Wiener filtering is being carried out aspect differentiation perfectly not enough to weak voice and noise, therefore easily causes the distortion of weak voice.

Therefore, wish to propose a kind of method and apparatus of the real-time voice denoising based on Wiener filtering that can address the above problem.

Summary of the invention

In order to overcome above-mentioned defect of the prior art, the invention provides a kind of method of real-time voice denoising, the method comprises:

The phonetic entry receiving according to pronunciation receiver generates frequency domain Noisy Speech Signal;

Calculate logarithmic spectrum posteriori SNR according to described frequency domain Noisy Speech Signal, described logarithmic spectrum posteriori SNR is the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal;

Obtain noise power spectrum estimated value based on weighted noise algorithm for estimating according to described logarithmic spectrum posteriori SNR;

The gain function that generates Wiener filtering according to described noise power spectrum estimated value, carries out filtering according to this gain function to described frequency domain Noisy Speech Signal, to generate frequency domain denoising voice signal;

Generate time domain denoising voice signal according to described frequency domain denoising voice signal, this time domain denoising voice signal is further processed by described pronunciation receiver.

According to an aspect of the present invention, logarithm value described in the method is the natural logarithm value taking e the end of as.

According to an aspect of the present invention, described in the method, calculating described logarithmic spectrum posteriori SNR comprises: adopt the performance number of white Gaussian noise as the initial noise power estimation value of described frequency domain Noisy Speech Signal.

According to another aspect of the present invention, described in the method, obtaining noise power spectrum estimated value based on weighted noise algorithm for estimating according to described logarithmic spectrum posteriori SNR comprises: calculate weighting factor; Set mark value, this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.

According to a further aspect of the invention, in the method, set mark value, and obtain noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value and comprise: if the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, judge that current frame signal is strong voice, now set described mark value, and keep noise power spectrum estimated value constant; If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with described mark value, judge that current frame signal is the weak voice of following after strong voice, this seasonal this mark value is progressively decremented to predetermined value and upgrades described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor; If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set described mark value, now upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.

According to a further aspect of the invention, described in the method, mark value is defined as:

The step-length that described mark value is successively decreased is defined as:

According to a further aspect of the invention, in the method, generating time domain denoising voice signal according to described frequency domain denoising voice signal comprises: described frequency domain denoising voice signal is made up of the multi-group data sequentially joining, successively each group data is processed, wherein, if one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in buffer memory, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to process described first group of data of having supplemented after a described frame remainder certificate, and the result of splicing of the last frame data of described first group of data after caching process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in buffer memory, if one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in buffer memory, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the result of splicing of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

According to a further aspect of the invention, the method also comprises: after generating time domain denoising voice signal according to described frequency domain denoising voice signal, described time domain denoising voice signal is with to logical rectification.

According to a further aspect of the invention, terminal described in the method comprises transmission, plays and/or stores the further processing of described time domain denoising voice signal.

According to a further aspect of the invention, pronunciation receiver described in the method is intelligent mobile terminal.

Correspondingly, the present invention also provides a kind of equipment of real-time voice denoising, and this equipment comprises:

Time-frequency modular converter, generates frequency domain Noisy Speech Signal for the phonetic entry receiving according to pronunciation receiver;

Snr computation module, for calculating logarithmic spectrum posteriori SNR according to described frequency domain Noisy Speech Signal, described logarithmic spectrum posteriori SNR is the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal;

Estimation module, for obtaining noise power spectrum estimated value based on weighted noise algorithm for estimating according to described logarithmic spectrum posteriori SNR;

Wiener filtering module, for generate the gain function of Wiener filtering according to described noise power spectrum estimated value, carries out filtering according to this gain function to described frequency domain Noisy Speech Signal, to generate frequency domain denoising voice signal;

Frequently modular converter time, for generating time domain denoising voice signal according to described frequency domain denoising voice signal.

According to an aspect of the present invention, logarithm value described in this equipment is the logarithm value taking e the end of as.

According to an aspect of the present invention, this equipment also comprises noise generation module, for generating white Gaussian noise; Described snr computation module adopts the performance number of described white Gaussian noise as the initial noise power estimation value of described frequency domain Noisy Speech Signal while calculating described logarithmic spectrum posteriori SNR.

According to another aspect of the present invention, described in this equipment, estimation module comprises: weighting factor computing unit, for calculating weighting factor; Noise power spectral estimation unit, for setting mark value, this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.

According to a further aspect of the invention, described in this equipment, estimation module also comprises: judging unit, for in the time that the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, judge that current frame signal is strong voice, trigger described noise power spectral estimation unit and set described mark value, and keep noise power spectrum estimated value constant; And in the time that the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with described mark value, judge that current frame signal is the weak voice of following after strong voice, and trigger described noise power spectral estimation unit and make this mark value progressively be decremented to predetermined value and upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor; And in the time that the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set described mark value, upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.

According to a further aspect of the invention, described in this equipment, mark value is defined as:

According to a further aspect of the invention, the voice signal of frequency domain denoising described in this equipment is made up of the multi-group data sequentially joining, successively each group data is processed, wherein, if one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in buffer memory of modular converter when described frequency, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to process described first group of data of having supplemented after a described frame remainder certificate, and the result of splicing of the last frame data of described first group of data after caching process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in buffer memory, if one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in buffer memory of modular converter when described frequency, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the access node fruit of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

According to a further aspect of the invention, this equipment also comprises bandpass filtering modules block, for described time domain denoising voice signal being with to logical rectification.

According to a further aspect of the invention, this equipment also comprises processing module, further processes described time domain denoising voice signal for described pronunciation receiver, and wherein, described further processing comprises transmission, plays and/or stores.

According to a further aspect of the invention, pronunciation receiver described in this equipment is intelligent mobile terminal.

Compared with prior art, the present invention has the following advantages:

(1) the present invention improves it on the basis of weighted noise algorithm for estimating, has adopted logarithmic spectrum posteriori SNR to estimate real-time noise.Algorithm after improvement still has the effectively simple and advantage of tracking noise fast of weighted noise algorithm for estimating self on the one hand, can meet the real-time operation demand of the limited pronunciation receiver of the such processing power of for example intelligent mobile terminal, on the other hand owing to having adopted logarithmic spectrum posteriori SNR, make the speed of tracking noise faster, and, because people's ear is more responsive to decibel value (namely logarithm value), therefore adopt logarithm value to replace snr value and will more meet the auditory properties of people's ear for carrying out speech processes.In addition, advantage of the present invention is also when real-time noise is estimated, adopt the estimated value of white noise as initial noise, thus, can be sooner, tracking noise more accurately, and because all frequencies of white noise have identical energy, for the frequency response characteristic of people's ear, even if therefore there is distortion, the phonetic incepting on user and the impact of identification are also little.

(2) normal attribute based on generally following weak voice after strong voice, in the time judging that voice signal is strong voice, set mark value, in the time that voice signal changes, make this mark value progressively successively decrease until be 0, then just start noise to upgrade.Thus, successively decrease in mark value during this period of time in can realize the protection to weak voice.

(3), in the process that the frequency domain denoising voice signal after utilizing splicing adding method to Wiener filtering is processed, in splice data and this last frame data that obtain in the last frame data of the frequency domain denoising voice signal before this time processed, these last frame data, process the positional information of data completely without described splicing adding method and carry out buffer memory after splicing adding method is processed.Adopt above-mentioned caching mechanism can ensure that signal can be processed and can not be repeated to process frame by frame, and effectively reduced processing operand, and then accelerated processing speed.

Brief description of the drawings

By reading the detailed description that non-limiting example is done of doing with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:

Fig. 1 is according to the method flow diagram of real-time voice denoising of the present invention;

Fig. 2 (a) is the time domain Noisy Speech Signal that pronunciation receiver receives;

Fig. 2 (b) is the method voice signal after treatment that utilizes real-time voice denoising provided by the present invention;

Fig. 3 is the device structure schematic diagram according to real-time voice denoising of the present invention;

Fig. 4 is the structural representation of the intelligent terminal of the method and apparatus for realizing real-time voice denoising provided by the invention.

In accompanying drawing, same or analogous Reference numeral represents same or analogous parts.

Embodiment

For a better understanding and interpretation of the present invention, below in conjunction with accompanying drawing, the present invention is described in further detail.

As everyone knows, Wiener filtering is the basis of linear filtering theory, according to Wiener filtering algorithm, that supposes linear filter is input as useful signal and noise sum, both are wide-sense stationary process and can learn their second-order statistics, then try to achieve the parameter of optimum linear filter according to minimum mean square error criterion.On this basis, can also try to achieve optimum linear filter according to maximum output signal-to-noise ratio criterion, statistics detection criteria and other optimum criterions.The present invention will propose a kind of method and apparatus of real-time voice denoising based on Wiener filtering algorithm.Before the method and apparatus of real-time voice denoising provided by the present invention is specifically described, first main thought of the present invention is set forth.

Particularly, time domain Noisy Speech Signal can be expressed as follows:

y(t)=s(t)+n(t)

Wherein, y (t) is time domain Noisy Speech Signal, and s (t) is time domain primary speech signal, and n (t) is noise in time domain signal, and s (t) is uncorrelated with n (t).

Time domain Noisy Speech Signal y (t) is carried out to Short Time Fourier Transform and is converted into frequency domain Noisy Speech Signal, can obtain:

Y(k,ω)=S(k,ω)+N(k,ω)

Wherein, k is frame number, and ω is frequency sequence number, Y (k, ω), S (k, ω), N (k, ω) are respectively the spectrum component on frequency domain Noisy Speech Signal, frequency domain primary speech signal, ω frequency of frequency domain noise signal k frame.

Designing gain function is the S filter of H (k, ω), and frequency domain Noisy Speech Signal Y (k, ω) obtains the estimated value to frequency domain primary speech signal S (k, ω) after Wiener filtering:

\hat{S} (k, ω) = Y (k, ω) \times H (k, ω)

Wherein, for the estimated value of S (k, ω).

The design criteria of the gain function H (k, ω) of S filter is the estimated value of wishing the frequency domain primary speech signal obtaining after filtering equal frequency domain primary speech signal S (k, ω), that is:

\hat{S} (k, ω) = S (k, ω)

That is to say, can from Noisy Speech Signal, primary speech signal be extracted by Wiener filtering.Therefore,

\hat{S} (k, ω) = Y (k, ω) \times H (k, ω)

Can be converted to:

S(k,ω)=Y(k,ω)×H(k,ω)

Can obtain through distortion:

H (k, ω) = \frac{S (k, ω)}{Y (k, ω)}

Can obtain by Y (k, ω)=S (k, ω)+N (k, ω) substitution above formula and after arranging:

H (k, ω) = \frac{\frac{S (k, ω)}{N (k, ω)}}{\frac{S (k, ω)}{N (k, ω)} + 1}

Definition priori signal to noise ratio (S/N ratio) is:

PRIO_SNR (k, ω) = \frac{{| S (k, ω) |}^{2}}{{| N (k, ω) |}^{2}}

Wherein, PRIO_SNR (k, ω) is the priori signal to noise ratio (S/N ratio) on ω frequency of frequency domain Noisy Speech Signal k frame.PRIO_SNR (k, ω) substitution above formula can be obtained:

H (k, ω) = \sqrt{\frac{PRIO_SNR (k, ω)}{PRIO_SNR (k, ω) + 1}}

Can draw from above-mentioned analysis, the key factor of removing preferably noise (abbreviation denoising) based on Wiener filtering algorithm is the estimation of priori signal to noise ratio (S/N ratio).

The estimation of priori signal to noise ratio (S/N ratio) depends on again the estimation of noise power spectrum.Therefore, accurately estimating noise power spectrum is very important.In existing noise power spectrum algorithm for estimating, weighted noise is estimated (WN, Weight Noise Estimation) algorithm is famous with advantages such as its simple effectively and fast tracking noises, its main thought is: first calculate weighting factor by the posteriori SNR of estimating, then being multiplied by weighting factor by Noisy Speech Signal obtains weighted value, and then is averaging and obtains the noise power spectrum that will estimate.The present invention makes improvements on the basis of WN algorithm, algorithm after improvement still has the effectively simple and advantage of tracking noise fast of WN algorithm self on the one hand, can meet the real-time operation demand of the limited pronunciation receiver of the such processing power of for example intelligent mobile terminal, there is on the other hand the speed of faster tracking noise, the processing of voice is also met more to the auditory properties of people's ear, and can effectively overcome WN algorithm and cannot accurately distinguish the weak point of weak voice and noise, thereby reach good real-time voice denoising effect.

It should be noted that, those skilled in the art is to be understood that, the present invention is particularly useful for this advantage of the limited pronunciation receiver of processing power and does not mean that being confined to this, clearly, use it for the pronunciation receiver that processing power is stronger and will obtain better real-time voice denoising effect.

Below, will the method for real-time voice denoising provided by the present invention be described based on above-mentioned thought.Please refer to Fig. 1, Fig. 1 is according to the process flow diagram of the method for real-time voice denoising of the present invention.As shown in the figure, the method comprises the following steps:

In step S100, the phonetic entry receiving according to pronunciation receiver generates frequency domain Noisy Speech Signal.

Particularly, in the present embodiment, will describe taking described pronunciation receiver as intelligent mobile terminal.The voice signal that intelligent mobile terminal receives is generally and is subject to the time domain Noisy Speech Signal that forms after ambient noise interference for example.Intelligent mobile terminal receives after time domain Noisy Speech Signal, first, time domain Noisy Speech Signal is sampled, and its objective is and converts simulating signal to digital signal.In the present embodiment, be 44100Hz to the sample frequency of time domain Noisy Speech Signal, obtain 44100 sampled datas a second.Those skilled in the art will appreciate that sample frequency concrete numerical value can according to voice signal also the actual requirement of proper mass set, wherein, sample frequency is higher, the more true nature of voice signal reduction.Then, divide frame processing to the time domain Noisy Speech Signal after sampling.In the present embodiment, adopt the mode of adding Hamming window to divide frame processing, that is, adopt the long Hamming window of fixed window to intercept described sampled data and generate a frame time domain Noisy Speech Signal, then Hamming window moves certain length to generate next frame time domain Noisy Speech Signal.Here the window length of Hamming window is defined as the quantity of the included sampled data of a frame time domain Noisy Speech Signal.Divide the reason of frame processing to be to the time domain Noisy Speech Signal after sampling: Wiener filtering algorithm is based on stationary stochastic process, and as a whole, described time domain Noisy Speech Signal is non-stationary process, but, in one section of short time range (it is generally acknowledged 10～30ms), voice signal can be thought to stationary process.Each the frame time domain Noisy Speech Signal obtaining after dividing frame to process all meets stationary process, therefore can utilize Wiener filtering algorithm to carry out denoising to this each frame time domain Noisy Speech Signal.In the present embodiment, the window length of Hamming window is set as 256, comprises 256 sampled datas; It is that 1/4th windows are long that the window of Hamming window moves length, 64 data of the namely each displacement of Hamming window.After dividing frame to process, the time domain Noisy Speech Signal frame that interpolation Hamming window is obtained is converted to frequency domain Noisy Speech Signal frame and the signal after point frame is carried out to end-point detection through Short Time Fourier Transform.In the present embodiment, adopt the Short Time Fourier Transform of 128 frequencies.Wherein, the common technology means that sampling, interpolation Hamming window and Short Time Fourier Transform are those skilled in the art, for brevity, do not repeat them here.

In step S101, calculate logarithmic spectrum posteriori SNR according to described frequency domain Noisy Speech Signal, described logarithmic spectrum posteriori SNR is the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal.

Particularly, in the present embodiment, described logarithmic spectrum posteriori SNR is defined as the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal.Described logarithmic spectrum posteriori SNR can represent with following formula:

POST_SNR (k, ω) = \frac{\log_{a} {| y (k, ω) |}^{2}}{\log_{a} λ_{D} (k - 1, ω)}

Wherein, POST_SNR (k, ω) is the logarithmic spectrum posteriori SNR on ω frequency of frequency domain Noisy Speech Signal k frame, | y (k, ω) | ²be the power spectrum of frequency domain Noisy Speech Signal on ω frequency of k frame, λ _d(k-1, ω) is the estimated value of noise power on ω frequency of k-1 frame.

In the present embodiment, truth of a matter a equals constant e, right | y (k, ω) | ²and λ _d(k-1, ω) gets respectively natural logarithm.It should be noted that, adopt logarithm value can make speech processes more meet the auditory properties of people's ear and can improve the speed that noise is followed the tracks of, wherein, truth of a matter value is less, and the fineness of sound is higher, and truth of a matter value is larger, faster to the tracking velocity of noise.When those skilled in the art can be according to the requirement to sound quality after denoising, specific implementation, select the truth of a matter of logarithm for the demand of real-time and the arithmetic capability of equipment and speed.

In the time calculating the logarithmic spectrum posteriori SNR (being k=1) of initial frame, preferably, adopt the performance number of white Gaussian noise as the initial noise power estimation value of this frame signal.Wherein, white Gaussian noise refers to power spectrum density equally distributed random noise in whole frequency domain.

In step S102, obtain noise power spectrum estimated value based on WN algorithm according to described logarithmic spectrum posteriori SNR.

Particularly, first, according to described logarithmic spectrum posteriori SNR, calculate weighting factor by weighting factor function.The weighting factor function that wherein calculates weighting factor is as follows:

gain (k, ω) = \{\begin{matrix} 1, & POST_SNR (k, ω) < γ_{1} \\ \frac{γ_{2} - POST_SNR (k, ω)}{γ_{2} - γ_{1}}, & γ_{1} \leq POST_SNR (k, ω) \leq γ_{2} \\ 0, & POST_SNR (k, ω) > γ_{2} \end{matrix}

Wherein, gain (k, ω) is the weighting factor of frequency domain Noisy Speech Signal on ω frequency of k frame, γ ₁and γ ₂for threshold value, be used for strong voice, weak voice or noise and noise to divide.

As POST_SNR (k, ω) > γ ₂time, judge that the signal on ω frequency of k frame is strong voice, therefore do not need to upgrade noise, now weighting factor is 0.And work as γ ₁≤ POST_SNR (k, ω)≤γ ₂time, the signal on ω frequency of k frame may may be also noise for weak voice, need to upgrade noise, now weighting factor is as POST_SNR (k, ω) < γ ₁time, judge that the signal on ω frequency of k frame is noise, now weighting factor is 1, need to upgrade noise.In the present embodiment, γ ₁=1, γ ₂=1.07.Those skilled in the art is to be understood that threshold gamma ₁and γ ₂occurrence can need to set according to actual design.

Then, set mark value, this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.Generally, after strong voice, be that the possibility of weak voice is very big, and weak voice often easy and noise obscure.Therefore,, in order to prevent the distortion of weak voice, need to protect weak voice.Concrete way is:

If the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, judge that current frame signal is strong voice, now set a mark value, and keep noise power spectrum estimated value constant.For example in the present embodiment, can set first threshold and equal γ ₂if the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than γ ₂, judge that present frame is strong voice, now give gain=0, and set a mark value, and keep noise power spectrum estimated value constant, do not need noise power spectrum to upgrade.Wherein, the setting means of mark value is preferably as follows:

flag = [\frac{time \times fre}{win_length}]

Wherein, flag is mark value, and time is the duration of wishing the weak voice of protection, and its unit is second, and fre is the sample frequency to time domain Noisy Speech Signal, and the window that win_length is Hamming window is long, and [x] represents to be no more than the maximum integer of x.Can find out from above-mentioned formula, in the present embodiment, mark value is an integer.Illustrating, if wish the protection weak voice of 0.1 second, is 44100Hz, window length be 256 in the situation that in sample frequency so, sets mark value and is:

flag = [\frac{0.1 \times 44100}{256}] = 17

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with mark value, that is to say, when logarithmic spectrum posteriori SNR changes, be decreased to while equaling first threshold from being greater than first threshold, judge that current frame signal is the weak voice of following after strong voice, this seasonal this mark value is progressively successively decreased, and wherein, the step-length that mark value is successively decreased is at every turn:

Δflag = \frac{Δwin_length}{win_length}

Wherein, Δ flag is the step-length that mark value is successively decreased, and the window that win_length is Hamming window is long, and the window that Δ win_length is Hamming window moves length.Illustrating, is that 1/4th windows are long if the window of Hamming window moves length, and the step-length that mark value is successively decreased equals 0.25, namely successively decreases 0.25 at every turn.

For example, in the time that mark value is decremented to predetermined value (equaling 0), just start to upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.Wherein, the method for upgrading noise power spectrum estimated value is existing open in existing WN algorithm, for brevity, does not repeat them here.Thus, successively decrease in mark value during this period of time in, noise power spectrum is not upgraded, therefore can not there is weak voice to be mistaken for the situation of noise, thereby realize the protection to weak voice.

It should be noted that; in the process of successively decreasing in mark value; if mark value is not yet decremented to predetermined value, but the numerical value of the logarithmic spectrum posteriori SNR of current frame signal becomes and is greater than first threshold from being less than or equal to first threshold, that is to say; also do not reach the duration of wishing weak voice protection; current frame signal has just become strong voice, now, and the successively decreasing of stop flag value; reset mark value, and do not need noise power spectrum to upgrade.Still taking above-mentioned mark value as 17, the step-length of successively decreasing is 0.25 for example describes, and when supposing that mark value is decremented to 5, detects that the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, and resetting mark value is 17.

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set mark value, that is to say that weak speech frame does not have strong speech frame to occur before occurring, now upgrades described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.Illustrate; the numerical value of initial its logarithmic spectrum posteriori SNR of one or more frame signals of Noisy Speech Signal is less than or equal to first threshold; and due to not yet occur strong voice therefore mark value do not set; now noise power spectrum is upgraded, and do not need to use the protection mechanism of above-mentioned weak voice.It should be noted that; it is commonplace that the situation that strong voice occur followed in weak voice; and the situation that weak speech frame does not have strong speech frame to occur before occurring is less; therefore; in this case; even owing to not starting the distortion of the weak voice that weak voice protection mechanism causes, the phonetic incepting on user's entirety and the impact of identification are also little.

After obtaining noise power spectrum estimated value based on WN algorithm, preferably, noise power spectrum is further carried out to smothing filtering (for example using three rank smoothing filters), obtain final noise power spectrum estimated value.

In step S103, generate the gain function of Wiener filtering according to described noise power spectrum estimated value, according to this gain function, described frequency domain Noisy Speech Signal is carried out to filtering, to generate frequency domain denoising voice signal.

Particularly, in the present embodiment, after obtaining described noise power spectrum estimated value, utilize directly judgement (Decision Directed) algorithm to estimate priori signal to noise ratio (S/N ratio), specifically, directly judgement method has used the single order of the priori signal to noise ratio (S/N ratio) of former frame and the posteriori SNR of present frame smoothly to calculate the priori signal to noise ratio (S/N ratio) of present frame, and this algorithm has been conventionally known to one of skill in the art, does not repeat them here.Those skilled in the art is to be understood that the method for estimating priori signal to noise ratio (S/N ratio) is not limited only to above-mentioned direct decision algorithm, can also use other applicable algorithms, and such as Casual algorithm, Non-casual algorithm etc., will not enumerate at this.

From above, the gain function of Wiener filtering can be expressed as follows by priori signal to noise ratio (S/N ratio):

H (k, ω) = \sqrt{\frac{PRIO_SNR (k, ω)}{PRIO_SNR (k, ω) + 1}}

Therefore, after the priori signal to noise ratio (S/N ratio) of utilizing direct decision algorithm to obtain, can the corresponding gain function that calculates Wiener filtering.

After obtaining the gain function of Wiener filtering, frequency domain Noisy Speech Signal is carried out to Wiener filtering, obtain frequency domain denoising voice signal.

In step S104, generate time domain denoising voice signal according to described frequency domain denoising voice signal, this time domain denoising voice signal is further processed by described pronunciation receiver.

Particularly, described frequency domain denoising voice signal is made up of the multi-group data sequentially joining, successively each group data is processed, wherein: if one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in buffer memory, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to process described first group of data of having supplemented after a described frame remainder certificate, and the result of splicing of the last frame data of described first group of data after caching process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in buffer memory.If one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in buffer memory, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the result of splicing of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

With an instantiation, above-mentioned steps is described below.Suppose that it (is only signal for example that each group data forms by 1000 data herein, the length of the length of every group of data that obtain after the Wiener filtering phonetic entry received with pronunciation receiver is relevant), it is that every frame comprises 256 data that frame length equals 256(), in the time carrying out inverse Fourier transform, it is 1/4th frame lengths (i.e. 64 data) that each frame moves length.For first group of data after Wiener filtering, first its last frame data of buffer memory.Before first group of data, do not export other data, therefore, then supplement a frame remainder certificate at the front end of these first group of data, supplement 256 0.For the sake of clarity, use 1 to 1256 pair of these 1256 data of label to be numbered successively at this.Then these 1256 data are carried out to inverse Fourier transform frame by frame, and result is superposeed, wherein, label is that 1 to 1024 data have been applied 4 times, label is that 1025 to 1088 data have been applied 3 times, label is that 1089 to 1152 data have been applied 2 times, label is that 1153 to 1216 data have been applied 1 time, and the data that label is 1217 to 1256 do not have processed, that is to say, label is that 1 to 1024 data are spliced completely, and label is that 1025 to 1256 data are for splicing not exclusively.Because first group of data comprises 1000 data, therefore after utilizing splicing adding method to process, still to return to 1000 data, be that return labal is the average (equaling this result of splicing divided by stacking fold) of the result of splicing of 1 to 1000 data, can obtain label and be the time-domain signal after 1 to 1000 data-switching.In addition, also need data (being last frame data) to label 1001 to 1256 to carry out buffer memory through splicing adding method splice result and the positional information to the data of not splicing completely after treatment (being that label is the position of 1025 data).Next, second group of data after Wiener filtering are processed.First the last frame data of second group of data are carried out to buffer memory, then the front end second group of data by the last frame data filling of first group of data of buffer memory.Still continue above-mentioned numbering at this, use 1001 to 2256 pairs of these 1256 data of label to be numbered successively, wherein label is that 1001 to 1256 data are first group of last frame data in data, and label is that 1257 to 2256 data are second group of data.Because front 24 data in first group of data last frame (being that label is 1001 to 1024 data) have been processed completely (having passed through 4 stacks), therefore, the positional information of the data of not splicing completely in first group of data last frame based on buffer memory, processes frame by frame since the 25th data (being that label is 1025 data).Processing finishes rear discovery, label is that 1025 to 1088 data have been applied 1 time, the data of label 1089 to 1152 have been applied 2 times, the data of label 1153 to 1216 have been applied 3 times, the data of label 1217 to 2048 have been applied 4 times, and the data of label 2049 to 2112 have been applied 3 times, and the data of label 2113 to 2176 have been applied 2 times, the data of label 2177 to 2240 have been applied 1 time, and the data of label 2241 to 2256 do not have processed.By the results added of splicing of the data to label 1001 to 1256 of splice result and the buffer memory of this processing, just making label is that 1001 to 2048 data have all been applied 4 times, splices completely.Return labal is the average (equaling this result of splicing divided by stacking fold) of the result of splicing of 1001 to 2000 data, and can obtain label is the time-domain signal after 1001 to 2000 conversions.In addition, need to carry out buffer memory to the result of splicing of last frame data in the data of label 1001 to 2256 (label is 2001 to 2256 data) equally, and positional information to the data of not splicing completely in this last frame (being that label is the position of 2241 data) carries out buffer memory, for the processing of the 3rd group of data.Follow-up data are processed by that analogy, and for brevity, this is no longer going to repeat them.When processing last group when data, last frame data will be dropped, and the average of the result of splicing of other data will be returned, and wherein, because frame data are very short, therefore abandon the impact that last frame data bring user little.

The time-domain signal that splicing adding method obtains after processing is digital signal, therefore also needs, by D/A switch, this digital signal is converted to simulating signal, obtains final time domain denoising voice signal.So far completed the whole process that extracts clean speech signal from Noisy Speech Signal.

Finally, described intelligent mobile terminal can be further processed this clean speech signal according to user's demand, and wherein, described further processing comprises transmission, plays and/or storage.

Preferably, the time domain denoising voice signal obtaining in step S104 is with to logical rectification further to remove low-frequency noise and high frequency noise, improves denoising effect with this.

Please refer to Fig. 2 (a) and Fig. 2 (b), wherein, Fig. 2 (a) is the time domain Noisy Speech Signal that pronunciation receiver receives, and Fig. 2 (b) is the method voice signal after treatment that utilizes real-time voice denoising provided by the present invention.Contrast by Fig. 2 (a) and Fig. 2 (b) can be found out, utilizes the method for real-time voice denoising provided by the present invention can effectively remove noise and the weak voice of protection.

Correspondingly, the present invention also provides a kind of equipment of real-time voice denoising.Please refer to Fig. 3, Fig. 3 is the device structure schematic diagram according to real-time voice denoising of the present invention.As shown in the figure, this equipment 20 comprises:

Time-frequency modular converter 201, generates frequency domain Noisy Speech Signal for the phonetic entry receiving according to pronunciation receiver;

Snr computation module 202, for calculating logarithmic spectrum posteriori SNR according to described frequency domain Noisy Speech Signal, described logarithmic spectrum posteriori SNR is the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal;

Estimation module 203, for obtaining noise power spectrum estimated value based on weighted noise algorithm for estimating according to described logarithmic spectrum posteriori SNR;

Wiener filtering module 204, for generate the gain function of Wiener filtering according to described noise power spectrum estimated value, carries out filtering according to this gain function to described frequency domain Noisy Speech Signal, to generate frequency domain denoising voice signal;

Frequently modular converter 205 time, for generating time domain denoising voice signal according to described frequency domain denoising voice signal.

Below, the specific works process to above-mentioned module is described.

Particularly, in the present embodiment, will describe taking described pronunciation receiver as intelligent mobile terminal.The voice signal that intelligent mobile terminal receives is generally and is subject to the time domain Noisy Speech Signal that forms after ambient noise interference for example.Intelligent mobile terminal receives after time domain Noisy Speech Signal, and first, described time-frequency modular converter 201 is sampled to time domain Noisy Speech Signal, its objective is and converts simulating signal to digital signal.Then, described time-frequency modular converter 201 divides frame processing to the time domain Noisy Speech Signal after sampling.In the present embodiment, adopt the mode of adding Hamming window to divide frame processing, that is, adopt the long Hamming window of fixed window to intercept described sampled data and generate a frame time domain Noisy Speech Signal, then Hamming window moves certain length to generate next frame time domain Noisy Speech Signal.Here the window length of Hamming window is defined as the quantity of the included sampled data of a frame time domain Noisy Speech Signal.Divide the reason of frame processing to be to the time domain Noisy Speech Signal after sampling: Wiener filtering algorithm is based on stationary stochastic process, and as a whole, described time domain Noisy Speech Signal is non-stationary process, but, in one section of short time range (it is generally acknowledged 10～30ms), voice signal can be thought to stationary process.Each the frame time domain Noisy Speech Signal obtaining after dividing frame to process all meets stationary process, therefore can utilize Wiener filtering algorithm to carry out denoising to this each frame time domain Noisy Speech Signal.After dividing frame to process, the time domain Noisy Speech Signal frame that described time-frequency modular converter 201 obtains interpolation Hamming window is converted to frequency domain Noisy Speech Signal frame and the signal after point frame is carried out to end-point detection through Short Time Fourier Transform.

Then, described snr computation module 202 is calculated logarithmic spectrum posteriori SNR.In the present embodiment, described logarithmic spectrum posteriori SNR is defined as the ratio between the logarithm value of power spectrum and the logarithm value of former frame noise power estimation value of present frame frequency domain Noisy Speech Signal.Preferably, described snr computation module 202, in the time that the power spectrum to present frame frequency domain Noisy Speech Signal and former frame noise power estimation value are asked for logarithm, adopts the logarithm taking e the end of as.

In a preferred embodiment, equipment 20 provided by the present invention also comprises noise generation module (not shown), and for generating white Gaussian noise, wherein, white Gaussian noise refers to power spectrum density equally distributed noise in whole frequency domain.Described snr computation module 202 adopts the power of described white Gaussian noise as the initial noise power estimation value of described frequency domain Noisy Speech Signal while calculating described logarithmic spectrum posteriori SNR.

Then, described estimation module 203 estimates that based on weighted noise (WN, Weight Noise Estimation) algorithm obtains noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR.Alternatively, described estimation module 203 comprises weighting factor computing unit 2031 and noise power spectral estimation unit 2032.Wherein, described weighting factor computing unit 2031, according to described logarithmic spectrum posteriori SNR, calculates weighting factor by weighting factor function.Wherein, weighting factor function please refer to the weighting factor function of mentioning in aforementioned method steps S102, is no longer repeated in this description at this.

Then, described noise power spectral estimation unit 2032 is set mark value, and this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.

Wherein, described estimation module 203 further comprises judging unit (not shown).If the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, described judging unit judges that current frame signal is strong voice, triggers described noise power spectral estimation unit 2032 and sets described mark value, and keep noise power spectrum estimated value constant.Wherein, the setting means of described mark value is as follows:

flag = [\frac{time \times fre}{win_length}]

Wherein, flag is mark value, and time is the duration of wishing the weak voice of protection, and its unit is second, and fre is the sample frequency to time domain Noisy Speech Signal, and the window that win_length is Hamming window is long, and [x] represents to be no more than the maximum integer of x.

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with described mark value, described judging unit judges that current frame signal is the weak voice of following after strong voice, triggers described noise power spectral estimation unit 2032 and makes this mark value progressively be decremented to predetermined value and upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.Wherein, the step-length that described mark value is successively decreased is at every turn:

Δflag = \frac{Δwin_length}{win_length}

Wherein, Δ flag is the step-length that mark value is successively decreased, and the window that win_length is Hamming window is long, and the window that Δ win_length is Hamming window moves length.

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set described mark value, upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.

Thus, successively decrease in mark value during this period of time in, can not there is weak voice to be mistaken for the situation of noise, therefore realized the protection to weak voice.

In a preferred embodiment, this equipment 20 also comprises smothing filtering module (not shown), for after described estimation module 203 obtains the estimated value of noise power, noise power is further carried out to smothing filtering, obtains final noise power estimation value.

After obtaining the estimated value of described noise power, described Wiener filtering module 204 for example utilizes directly judgement (Decision Directed) algorithm to estimate priori signal to noise ratio (S/N ratio).Then, described Wiener filtering module 204, according to described priori signal to noise ratio (S/N ratio), is calculated the gain function of Wiener filtering, and according to this gain function, described frequency domain Noisy Speech Signal is carried out to filtering, to generate frequency domain denoising voice signal.

The described frequency domain denoising voice signal that described Wiener filtering module 204 carries out obtaining after filtering to described frequency domain Noisy Speech Signal is made up of the multi-group data sequentially joining, when described frequency, modular converter 205 is processed each group data successively, wherein: if one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in 205 buffer memorys of modular converter when described frequency, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to process described first group of data of having supplemented after a described frame remainder certificate, and the result of splicing of the last frame data of described first group of data after caching process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in buffer memory.If one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in 205 buffer memorys of modular converter when described frequency, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the access node fruit of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

When described frequency, to utilize the described time-domain signal that splicing adding method obtains be digital signal to modular converter 205, and therefore modular converter 205 also needs this digital signal to be converted to simulating signal when described frequency, obtains final time domain denoising voice signal.So far equipment 20 provided by the present invention has completed the whole process that extracts clean speech signal from Noisy Speech Signal.

Preferably, equipment 20 provided by the present invention also comprises bandpass filtering modules block (not shown), for time domain denoising voice signal being with to logical rectification further to remove low-frequency noise and high frequency noise, improve denoising effect with this, obtain last pure voice signal.

Preferably, equipment 20 provided by the present invention also comprises processing module (not shown), further processes described time domain denoising voice signal for described intelligent mobile terminal, and wherein, described further processing comprises transmission, plays and/or stores.

Compared with prior art, the method and apparatus of real-time voice denoising provided by the present invention has the following advantages:

With reference to figure 4, Fig. 4 is the structural representation of the intelligent mobile terminal equipment (being equipment 20) of the method and apparatus for realizing real-time voice denoising provided by the invention.Intraware, software and protocol architecture with reference to figure 4 to common intelligent mobile terminal describe.

Intelligent mobile terminal has processor 510, and it is responsible for the integrated operation of mobile terminal, and can utilize any business can obtain CPU (central processing unit), digital signal processor or any other electronic programmable logical device and realize.The related storer 520 of processor 510 tool, this storer 520 includes but not limited to RAM storer, ROM storer, eeprom memory, flash memory or its combination.Storer 520 is controlled for various objects by processor 500, and one of them is as various software stored program instructions and data in intelligent mobile terminal.

The software view of this intelligent mobile terminal comprises real time operating system 540, driver, application processor 550 and various application for man-machine interface 560.Described application examples is text editor 551, handwriting recognition application 552 and various other multimedia application 553 in this way, and typically these other multimedia application comprise such as audio call application, video call application, sending and receiving Short Message Service (SMS) messages application, multimedia information service (MMS) application or e-mail applications, web browser, instant message transrecieving application, book applications, calendar application, control panel application, camera application, one or more video-game, notepad appli-cation etc.Two or more that it should be noted that above-mentioned application can be used as same application and carry out.

Described intelligent mobile terminal also comprises one or more hardware controls, for with together with the driver of man-machine interface 560 with display device 561, physical button 562, microphone 563 and various other I/O equipment (such as loudspeaker, Vib., jingle bell generator, LED indicator etc.) cooperation, to realize the man-machine interaction of described intelligent mobile terminal.Those skilled in the art are to be understood that user can carry out operating mobile terminal by the man-machine interface of such formation 560.

The software view of this intelligent mobile terminal can also comprise the logic that various modules, protocol stack, driver etc. are relevant to communication, be summarized as communication interface 570 as shown in Figure 3, be used to wireless radio interface 571 and alternatively for example, for blue tooth interface 572 and/or infrared interface 573 provide communication service (transmission, network and connectedness), to realize the network connectivty of described intelligent mobile terminal.Wireless radio interface 571 comprises inside or exterior antenna and for setting up and safeguard the suitable radio circuit of the wireless link that leads to base station.As known to the skilled person, described radio circuit comprises a series of analog-and digital-electronic packages, and it forms radio receiver and transmitter together.These assemblies for example comprise bandpass filter, amplifier, frequency mixer, local oscillator, low-pass filter, ad/da converter etc.

Mobile communication terminal can also comprise reader device 530, and this reader device 530 generally includes processor and data-carrier store etc., for the network of reading the information of SIM card and cooperation wireless radio interface 517 access carriers of taking this as a foundation provide.

The method of real-time voice denoising provided by the invention can realize by programmable logic device (PLD), also may be embodied as computer software, can be for example a kind of computer program according to embodiments of the invention, move this program product computing machine is carried out for demonstrated method.Described computer program comprises computer-readable recording medium, comprises computer program logic or code section on this medium, for realizing each step of said method.Described computer-readable recording medium can be the removable medium (for example hot-plugging technology memory device) that is installed in the built-in medium in computing machine or can dismantles from basic computer.Described built-in medium includes but not limited to rewritable nonvolatile memory, for example RAM, ROM, flash memory and hard disk.Described removable medium includes but not limited to: optical storage media (for example CD-ROM and DVD), magneto-optic storage media (for example MO), magnetic recording medium (for example tape or portable hard drive), have the media (for example storage card) of built-in rewritable nonvolatile memory and have the media (for example ROM box) of built-in ROM.

It will be appreciated by those skilled in the art that any computer system with suitable programmer all can carry out all steps of the method for the present invention being included in program product.Although most embodiments of describing in this instructions all lay particular emphasis on software program, the alternate embodiment that realizes method provided by the invention as firmware and hardware is equally within the scope of protection of present invention.

To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned example embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims instead of above-mentioned explanation, is therefore intended to all changes that drop in the implication and the scope that are equal to important document of claim to be included in the present invention.Any Reference numeral in claim should be considered as limiting related claim.In addition, obviously miscellaneous part, unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.Multiple parts, unit or the device of stating in system claim also can be realized by software or hardware by parts, unit or device.

Above disclosed is only preferred embodiments more of the present invention, the interest field that certainly can not limit the present invention with this, and the equivalent variations of therefore doing according to the claims in the present invention, still belongs to the scope that the present invention is contained.

Claims

1. a method for real-time voice denoising, the method comprises:

2. method according to claim 1, wherein, described logarithm value is the natural logarithm value taking e the end of as.

3. method according to claim 1 and 2, wherein, the described logarithmic spectrum posteriori SNR of described calculating comprises:

Adopt the performance number of white Gaussian noise as the initial noise power estimation value of described frequency domain Noisy Speech Signal.

4. method according to claim 1 and 2, wherein, describedly obtains noise power spectrum estimated value based on weighted noise algorithm for estimating according to described logarithmic spectrum posteriori SNR and comprises:

Calculate weighting factor;

Set mark value, this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.

5. method according to claim 4, wherein, set mark value, and obtain noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value and comprise:

If the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is greater than first threshold, judge that current frame signal is strong voice, now set described mark value, and keep noise power spectrum estimated value constant;

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with described mark value, judge that current frame signal is the weak voice of following after strong voice, this seasonal this mark value is progressively decremented to predetermined value and upgrades described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor;

If when the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set described mark value, now upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.

6. method according to claim 5, wherein:

Described mark value is defined as:

7. method according to claim 1 and 2, wherein, generates time domain denoising voice signal according to described frequency domain denoising voice signal and comprises:

Described frequency domain denoising voice signal is made up of the multi-group data sequentially joining, successively each group data processed, wherein:

If one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in buffer memory, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to supplemented a described frame remainder according to after described first group of data process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in splice result and the buffer memory of the last frame data of described first group of data after caching process;

If one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in buffer memory, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the result of splicing of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

8. method according to claim 1 and 2, generates time domain denoising voice signal according to described frequency domain denoising voice signal and also comprises afterwards:

Described time domain denoising voice signal is with to logical rectification.

9. method according to claim 1 and 2, wherein, described pronunciation receiver comprises transmission, plays and/or stores the further processing of described time domain denoising voice signal.

10. method according to claim 1 and 2, wherein, described pronunciation receiver is intelligent mobile terminal.

The equipment of 11. 1 kinds of real-time voice denoisings, this equipment comprises:

12. equipment according to claim 11, wherein, described logarithm value is the logarithm value taking e the end of as.

13. according to the equipment described in claim 11 or 12, wherein:

This equipment also comprises noise generation module, for generating white Gaussian noise;

Described snr computation module adopts the performance number of described white Gaussian noise as the initial noise power estimation value of described frequency domain Noisy Speech Signal while calculating described logarithmic spectrum posteriori SNR.

14. according to the equipment described in claim 11 or 12, and wherein, described estimation module comprises:

Weighting factor computing unit, for calculating weighting factor;

Noise power spectral estimation unit, for setting mark value, this mark value is used for distinguishing strong speech frame and weak speech frame, and obtains described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR, described weighting factor and described mark value.

15. equipment according to claim 14, wherein, described estimation module also comprises:

Judging unit, is greater than first threshold for the numerical value of the logarithmic spectrum posteriori SNR when current frame signal, judges that current frame signal is strong voice, triggers described noise power spectral estimation unit and sets described mark value, and keep noise power spectrum estimated value constant;

And in the time that the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and is set with described mark value, judge that current frame signal is the weak voice of following after strong voice, triggers described noise power spectral estimation unit and makes this mark value progressively be decremented to predetermined value and upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor;

And in the time that the numerical value of the logarithmic spectrum posteriori SNR of current frame signal is less than or equal to first threshold and does not set described mark value, upgrade described noise power spectrum estimated value according to described logarithmic spectrum posteriori SNR and described weighting factor.

16. equipment according to claim 15, wherein:

Described mark value is defined as:

17. according to the equipment described in claim 11 or 12, wherein:

If one group of pending data are first group of data of described frequency domain denoising voice signal, the last frame data of first group of data described in buffer memory of modular converter when described frequency, and before a frame remainder certificate is replenished to described first group of data, utilize splicing adding method to supplemented a described frame remainder according to after described first group of data process, and the positional information of the data of not splicing completely in the last frame data of first group of data described in splice result and the buffer memory of the last frame data of described first group of data after caching process;

If one group of pending data are the N group data of described frequency domain denoising voice signal, wherein N is more than or equal to 2, the last frame data of N group data described in buffer memory of modular converter when described frequency, and the last frame data filling of N-1 being organized to data is before described N group data, the positional information of the data of not splicing completely in the last frame data based on described N-1 group data, utilize splicing adding method to process the described N group data of the last frame data of having supplemented described N-1 group data, the access node fruit of the last frame data of splice result and described N-1 group data after treatment is superposeed, and the result of splicing of the last frame data of described N group data after caching process, and the positional information of the data of not splicing completely in the last frame data of N group data described in buffer memory.

18. according to the equipment described in claim 11 or 12, and this equipment also comprises:

Bandpass filtering modules block, for being with logical rectification to described time domain denoising voice signal.

19. according to the equipment described in claim 11 or 12, and this equipment also comprises:

Processing module, further processes described time domain denoising voice signal for described pronunciation receiver, and wherein, described further processing comprises transmission, plays and/or stores.

20. according to the equipment described in claim 11 or 12, and wherein, described pronunciation receiver is intelligent mobile terminal.