CN101599274A

CN101599274A - The method that voice strengthen

Info

Publication number: CN101599274A
Application number: CNA2009101084022A
Authority: CN
Inventors: 叶利剑
Original assignee: CHANGZHOU MEIOU ELECTRONICS Co Ltd; AAC Acoustic Technologies Shenzhen Co Ltd
Current assignee: Changzhou Meiou Electronics Co., Ltd.; AAC Technologies Pte Ltd
Priority date: 2009-06-26
Filing date: 2009-06-26
Publication date: 2009-12-09
Anticipated expiration: 2029-06-26
Also published as: CN101599274B

Abstract

The invention provides a kind of method that realizes that voice increase, the method that these voice strengthen comprises: Noisy Speech Signal is carried out branch frame, pre-emphasis processing, transforms to frequency domain; Be divided into some frequency bands, calculate the signal energy of each channel, and upgrade the noise energy estimated value; Calculate the posteriority signal to noise ratio (S/N ratio) of present frame, and priori SNR estimation value; Calculate the decay factor of each frequency band and handle and obtain speech signal with enhanced signal-to-noise ratio; Voice signal after handling is transformed to time domain, again output.

Description

The method that voice strengthen

[technical field]

The present invention relates to the method that a kind of voice increase.

[background technology]

Since the existence of a large amount of neighbourhood noises, voice acquisition system, and as the microphone microphone, the general signal to noise ratio (S/N ratio) of the voice signal that collects is not high enough.In order to collect the high voice signal of signal to noise ratio (S/N ratio), usually, gather voice signal in the certain limit by utilizing directional microphone, or the method for utilizing voice to strengthen promotes the signal to noise ratio (S/N ratio) of voice signal.

Calculated amount and storage space that existing relevant voice enhancement algorithm needs are all bigger than normal, and than higher, the area of the silicon that needs when making special chip is also bigger, thereby makes its cost also than higher, and noise reduction neither be very desirable to the requirement of hardware.

Therefore, be necessary to study the method that a kind of new voice strengthen, to reach good noise reduction.

[summary of the invention]

The technical matters that the present invention need solve provides a kind of method of voice increase of excellent noise reduction effect,

According to above-mentioned technical matters, designed the method that a kind of voice strengthen, it may further comprise the steps:

(1), voice collection device being collected Noisy Speech Signal carries out branch frame, pre-emphasis processing, arrives frequency domain through Short Time Fourier Transform again with processor;

(2), the Noisy Speech Signal that will transform to behind the frequency domain is divided into some frequency bands, calculate each frequency band energy again and carry out level and smooth, obtain the signal energy in each frequency band after level and smooth, described signal energy comprises speech energy and noise energy, and obtaining described noise energy estimated value, the noise energy estimated value returns method of average judgement according to the time and can constantly upgrade;

(3), by signal energy and noise energy estimated value, calculate the posteriority signal to noise ratio (S/N ratio) of each frequency band present frame, and obtain the priori SNR estimation value of present frame by the priori SNR estimation value of former frame;

(4), according to the priori SNR estimation value that obtains, calculate the decay gain factor of each frequency band;

(5), with the decay gain factor that obtains, the signal spectrum that is divided into each frequency band is handled;

(6), the frequency-region signal after will handling transforms to time domain, the processing of postemphasising, phase add operation become final output signal.

More excellent is, is treated to the decay gain factor that the Noisy Speech Signal of present frame be multiply by frequency band in the described step (5).

More excellent is that described step (6) includes:

(61), by inverse fast fourier transform frequency-region signal is transformed to time domain, the time domain voice signal after being enhanced;

(62), increase the weight of to handle by low-pass filter;

(63), the lap with consecutive frame carries out the phase add operation again, obtains final output signal.

Compare with correlation technique, the method that voice of the present invention strengthen has realized real-time speech-enhancement system, output be voice signal behind the direct noise reduction, improved greatly noise alleviation, guaranteed the intelligibility of voice, especially to automobile noise, a class such as the street noise attenuating of additive noise stably is particularly outstanding.

[description of drawings]

Fig. 1 is the schematic flow sheet of the method for voice enhancing of the present invention;

[embodiment]

The invention will be further described below in conjunction with drawings and Examples.

The method algorithm that voice of the present invention strengthen can be integrated in the special-purpose chip processor, and by the interface data transmission of this chip processor with corresponding voice collection device, forms a real-time speech-enhancement system.Voice signal is by the voice collection device collection, and the method algorithm process that is directly strengthened by the voice in the chip processor obtains the signal after signal to noise ratio (S/N ratio) strengthens again, and output is for secondary use.Voice collection device can be the microphone microphone in the present embodiment, and the simulating signal of microphone collection also need be converted to digital signal, handles for chip processor.

The method that voice of the present invention strengthen, it may further comprise the steps:

(5), with the decay gain factor that obtains, the signal spectrum that is divided into each frequency band is handled, the Noisy Speech Signal of present frame be multiply by the decay gain factor of frequency band;

(6), the frequency-region signal after will handling transforms to time domain, the processing of postemphasising, phase add operation become final output signal, its concrete steps are: (61), by inverse fast fourier transform frequency-region signal is transformed to time domain, the time domain voice signal after being enhanced; (62), increase the weight of to handle by low-pass filter; (63), the lap with consecutive frame carries out the phase add operation again, obtains final output signal.

Be introduced below by specific embodiment, the sampling rate of the Noisy Speech Signal of the voice acquisition system input that these voice strengthen is 8kHZ again, and precision is 16.

At first, the Noisy Speech Signal in time domain being carried out the branch frame, is to be that unit is divided into some Noisy Speech Signals unit with the frame with Noisy Speech Signal.This Noisy Speech Signal unit is made up of sampled point, chosen the sample frequency of 8kHz among the present invention, needs according to the short-time spectrum analysis, frame length is generally set between 10～35ms, present embodiment divides frame with 32ms, and promptly a frame Noisy Speech Signal unit is provided with 256 sampled points, naturally, any frame Noisy Speech Signal unit has certain frame length, and the frame length of each frame of the present invention is 256.

For the blocking effect between the Noisy Speech Signal unit that prevents adjacent two frames, when minute frame, to make between the Noisy Speech Signal unit of adjacent two frames certain aliasing part is arranged, that is, it is former frame section data data that D data are arranged in these frame data, and wherein aliasing partly is described below:

s(n)＝d(m，D+n)

0≤n＜L

Wherein s represents the input tape noisy speech signal

d(m，n)＝d(m-1，L+n) 0≤n＜D

Wherein, d represents 256 point sampling signals of present frame, because the length of any frame is 256, Duplication is 75%, so the sampled point number D=192 of lap.Distance L=256-192=64 that first sampled point of the Noisy Speech Signal unit of consecutive frame is separated by.

Can have 50%～75% Duplication between the Noisy Speech Signal unit of adjacent two frames of the present invention.Present embodiment is chosen between the Noisy Speech Signal unit of adjacent two frames has 75% Duplication, promptly consistent with the Noisy Speech Signal unit of 75% (192 point) after the Noisy Speech Signal unit of preceding 75% (192 point) of this frame and the former frame.

Next, the Noisy Speech Signal behind the branch frame through a Hi-pass filter, is handled as pre-emphasis.Because ground unrest is generally bigger at the low frequency part energy in the Noisy Speech Signal, the deal of low frequency part makes noise reduction better so the use Hi-pass filter can be decayed.Its form is as follows:

H(z)＝1-αz ^-1

The general value of α is between 0.75-0.95, and better effect can be obtained in α=0.9 here.

Because voice signal is stably in short-term, handle so can carry out the branch frame, but the branch frame can bring the discontinuous of frame signal boundary again and cause frequency to be revealed signal.So, to carry out Short Time Fourier Transform (STFT) for the voice signal behind minute frame.Short Time Fourier Transform can be understood as does Fourier transform again to the windowing of frame signal elder generation.The purpose of windowed function is exactly for when doing Short Time Fourier Transform, reduces the discontinuous of frame signal boundary and causes frequency to reveal.Here used a length to equal the Hamming window of 256 of frame lengths, it can effectively reduce the oscillation degree of Gibbs' effect.

Hamming window function is defined as follows:

win(n)＝{

0.54-0.46cos(2*π*n/M) 0≤n≤M-1

0 all the other n

}

Short Time Fourier Transform is as follows:

X (m, k 1) = \frac{2}{M} Σ_{n = 0}^{M - 1} win (n - m) \times x (m) e^{- 2 πjk 1 \frac{n}{M}}

0≤k1≤M-1

Wherein, M=256 is the computational length of Fourier Tranform in short-term.M represents the m frame signal.

So just the Noisy Speech Signal s with present frame has transformed from the time domain to frequency domain.

The Noisy Speech Signal that transforms to behind the frequency domain comprises voice signal and noise signal, and this signal is that unit is divided into some frequency bands with the frame, and the voice signal at different frequency bands carries out different policing actions afterwards.

Then the following Noisy Speech Signal of 4kHz is carried out frequency band division below, signal Processing is afterwards all carried out in each frequency band, so both can reduce computational complexity, can do different processing at different frequency bands again, obtains better effect.

Signal among the present invention is divided into 23 frequency bands altogether.Specifically see Table 1.

23 frequency band division of table 1

Frequency band number	Initial frequency (Hz)	Cutoff frequency (Hz)
Frequency band number	Initial frequency (Hz)	Cutoff frequency (Hz)	1	62.5	93.75
2	125	156.25	1	62.5	93.75
2	125	156.25	3	187.5	218.75
4	250	281.25	3	187.5	218.75
4	250	281.25	5	312.5	343.75
6	375	406.25	5	312.5	343.75
6	375	406.25	7	437.5	468.75
8	500	531.25	7	437.5	468.75
8	500	531.25	9	562.5	593.75
10	625	656.25	9	562.5	593.75
10	625	656.25	11	687.5	718.75
12	750	781.25	11	687.5	718.75
12	750	781.25	13	812.5	906.25
14	937.5	1062.5	13	812.5	906.25
14	937.5	1062.5	15	1093.75	1250
16	1281.25	1468.75	15	1093.75	1250
16	1281.25	1468.75	17	1500	1718.75
18	1750	2000	17	1500	1718.75
18	1750	2000	19	2031.25	2312.5

20	2343.75	2687.5
20	2343.75	2687.5	21	2718.75	3125
22	3156.25	3687.5	21	2718.75	3125
22	3156.25	3687.5	23	3718.75	3968.75

The signal energy of each frequency band estimates, calculates and carries out smoothly with following formula:

E(m，k)＝|X(m，k)| ² 0≤k≤N-1

Y(m，k)＝αY(m-1，k)+(1-α)E(m，k) 0≤k≤N-1

Wherein, (m represents the sequence number of present frame to Y for m, the k) signal energy in each the frequency band interval of expression after level and smooth, and k represents the sequence number of current subband, and smoothing factor is represented in α=0.75.N is the frequency band sum of choosing, promptly 23.

The signal energy in each the frequency band interval after level and smooth comprises speech energy and noise energy,, obtains the noise energy estimated value earlier here.The noise energy estimated value can constantly be upgraded in every frame, to guarantee as far as possible near actual noise energy.

Among the present invention, the renewal of the noise energy estimated value of each frequency band time based on approximate posteriority signal to noise ratio (S/N ratio) of having adopted returns the method for average.The more new formula of its noise energy estimated value is as follows:

V(m，k)＝α(m，k)V(m-1，k)+(1-α(m，k))E(m，k)

Wherein α (m k) is smoothing factor, and its computing formula is as follows:

α (m, k) = \frac{1}{1 + e^{- β (γ_{k} (m) - λ)}}

γ_{k} (m) = \frac{E (m, k)}{\frac{1}{10} Σ_{i = 1}^{10} V (m - i, k)}

γ _k(λ) being approximate posteriority signal to noise ratio (S/N ratio), is by the signal energy when former frame, compares with the arithmetic mean of 10 frame noise energies before to obtain.β, λ is a coefficient, gets β=6.2 among the present invention, λ=1.4.

As seen, if when present frame is pure voice signal, approximate posteriority signal to noise ratio (S/N ratio) is very big, then α (m, k) → 1, V (m, k) ≈ V (m-1, k), promptly the noise energy estimated value is not upgraded; Opposite, if when present frame is pure noise signal, approximate posteriority signal to noise ratio (S/N ratio) is very little, and promptly (m, k) → 0, (m, k) (m, k), promptly the noise energy estimated value is updated to the noise power value of present frame to ≈ E to V to α.

Then, the posteriority signal to noise ratio (S/N ratio) of removing to calculate each frequency band present frame according to signal energy and noise energy estimated value.Calculate the posteriority signal to noise ratio (S/N ratio) of current frame signal, as follows:

{SNR}_{post} (m, k) = \frac{Y (m, k)}{V (k)}

Wherein, V _(k)The noise energy estimated value that expression is brought in constant renewal in.Herein, V, V _(k), (m k) all represents the noise energy estimated value to V, and m represents the sequence number of present frame, and k represents the sequence number of current subband.

Based on the priori SNR estimation formula of Ephraim and Malah, calculate the priori SNR estimation value of present frame then

Next, the decay gain factor to each frequency band calculates.Based on the priori SNR estimation value that previous calculations draws, take different strategies.For the big frequency band of signal to noise ratio (S/N ratio), can think voice signal, adopt the method for spectral substraction to obtain decay factor; For the little frequency band of signal to noise ratio (S/N ratio), think noise signal, it is carried out to a certain degree decay.Select SNR among the present invention _Prior=1.5 is criterion, is higher than 1.5 the voice signal of thinking, is lower than 1.5 and can thinks noise signal.

Concrete gain reduction factor computing formula is as follows:

Wherein, a, b are respectively different constants.

Consider that noise such as automobile mainly concentrates on the following lower band of 800Hz, and that street noise distributes on frequency domain is relatively average, so, gets different parameters for medium and low frequency section and high band.

Among the present invention for the frequency band of k≤14, i.e. the following signal of 800Hz, a=6.3, b=4.8.

For the frequency band of k＞18, i.e. the above signal of 2kHz, a=6.7, b=3.3

All the other situations, promptly a=5.9 is got, b=3.5 in 14＜k≤18 o'clock

Obtain decaying behind the gain factor, (m k), multiply by the decay gain factor of the frequency band that obtains previously, and what obtain is exactly voice signal after this frequency band signal to noise ratio (S/N ratio) strengthens with the Noisy Speech Signal X of each frequency band of present frame.

\hat{S} (k) = q (k) * X (k)

0≤k≤N-1

Wherein, N=23 is the frequency band sum,

It is the voice signal estimated value after k frequency band strengthens.

At last, the voice signal after the signal to noise ratio (S/N ratio) after handling strengthened from the frequency domain transform to the time domain, increase the weight of to handle, the phase add operation becomes final output signal, it is operating as:

The first step: inverse fast fourier transform (FFT) transforms to time domain to the voice signal of frequency domain, the time domain voice signal after being enhanced.

The conversion of time domain realizes with general contrary discrete Fourier transform (IDFT).

s (m, n) = \frac{1}{2} * Σ_{n = 0}^{M - 1} \hat{S} (k) e^{j 2 πnk / M}

0≤k≤M-1

Wherein, M=256 is frame length.S is the voice signal that transforms to after full range band after the time domain strengthens.

Second step: the processing of postemphasising.

With the pre-emphasis of front handle opposite, here with signal by a low-pass filter, farthest reduce original signal.The frequency response of wave filter is as follows;

H(z)＝1+αz ^-1

The coefficient here is corresponding with the processing of front pre-emphasis, gets α=0.9.

The 3rd goes on foot: the lap of the consecutive frame of the voice signal after pre-emphasis is handled carries out the phase add operation.

Concrete lap addition can be represented with following method.

s^{'} (n) = \{\begin{matrix} s (m, n) + s (m - 1, n + L) & 0 \leq n < M - L \\ s (m, n) & M - L \leq n < M \end{matrix}

L=64 is the distance that adjacent frame signal begins to locate, and M=256 is frame length.The final output signal after the phase add operation is finished in s ' representative.

Compare with correlation technique, the method that voice of the present invention strengthen has realized real-time speech-enhancement system, voice collection device output be voice signal behind the direct noise reduction, saved the cost of other use respective algorithms, and improved the intelligibility that noise alleviation, signal to noise ratio (S/N ratio) has been improved, has guaranteed voice greatly, especially to automobile noise, a class such as the street noise attenuating of additive noise stably is particularly outstanding.

Above-described only is embodiments of the present invention, should be pointed out that for the person of ordinary skill of the art at this, under the prerequisite that does not break away from the invention design, can also make improvement, but these all belongs to protection scope of the present invention.

Claims

1, a kind of method of voice enhancing is characterized in that, may further comprise the steps:

2, the method that strengthens according to the described voice of claim 1 is characterized in that, is treated to the decay gain factor that the Noisy Speech Signal of present frame be multiply by frequency band in the described step (5).

3, the method that strengthens according to the described voice of claim 1, it is characterized in that: described step (6) includes:

(62), increase the weight of to handle by low-pass filter;