CN101582264A

CN101582264A - Method and voice collecting system for speech enhancement

Info

Publication number: CN101582264A
Application number: CNA2009101080587A
Authority: CN
Inventors: 叶利剑
Original assignee: AAC Acoustic Technologies Shenzhen Co Ltd; AAC Acoustic Technologies Changzhou Co Ltd
Current assignee: AAC Technologies Holdings Shenzhen Co Ltd; AAC Technologies Holdings Changzhou Co Ltd
Priority date: 2009-06-12
Filing date: 2009-06-12
Publication date: 2009-11-18

Abstract

The invention provides a method and a voice collecting system for speech enhancement. The system comprises a microphone collecting device and a chip which integrates the method for the speech enhancement. The method for speech enhancement comprises the following steps of: carrying out frame division and pre-emphasis treatment on the noise-contained speech signal so as to be converted to frequency domains; dividing into a plurality of frequency bands; calculating the signal energies of each channel; calculating the estimation value of the posterior signal-to-noise-ratio and the prior signal-to-noise ratio of the current frame; judging whether updating the estimation value of the noise energy; calculating and processing the attenuation factor of each frequency band so as to obtain the speech signal with enhanced signal-to-noise ratio; and converting the processed speech signal into the time domain and outputting the speech signal.

Description

The voice acquisition system that method that voice strengthen and voice increase

[technical field]

The present invention relates to a kind of method of voice increase and the voice acquisition system of integrated this method.

[background technology]

Since the existence of a large amount of neighbourhood noises, voice acquisition system, and as the microphone microphone, the general signal to noise ratio (S/N ratio) of the voice signal that collects is not high enough.In order to collect the high voice signal of signal to noise ratio (S/N ratio), usually, gather voice signal in the certain limit by utilizing directional microphone, or the method for utilizing voice to strengthen promotes the signal to noise ratio (S/N ratio) of voice signal.

Calculated amount and storage space that existing relevant voice enhancement algorithm needs are all bigger than normal, and than higher, the area of the silicon that needs when making special chip is also bigger, thereby makes its cost also than higher, and noise reduction neither be very desirable to the requirement of hardware.

Therefore, be necessary to study the method that a kind of new voice strengthen, to reach good noise reduction.

[summary of the invention]

The technical matters that the present invention need solve provides a kind of method of voice increase of excellent noise reduction effect,

According to above-mentioned technical matters, designed the method that a kind of voice strengthen, it may further comprise the steps:

(1), voice collection device being collected Noisy Speech Signal carries out branch frame, pre-emphasis processing, arrives frequency domain through Short Time Fourier Transform again with chip;

(2), the Noisy Speech Signal that will transform to behind the frequency domain is divided into some frequency bands, calculate each frequency band energy again and carry out level and smooth, obtain the signal energy in each frequency band after level and smooth, described signal energy comprises speech energy and noise energy, and obtains the initial estimate of described noise energy;

(3), by the initial estimate of signal energy and noise energy, calculate the posteriority signal to noise ratio (S/N ratio) of each frequency band present frame, and obtain the priori SNR estimation value of present frame by the priori SNR estimation value of former frame;

(4), present frame is adjudicated, judge whether it is noise, otherwise execution in step (5), be then to carry out (6) by the priori SNR estimation value that obtains;

(5), the estimated value of the noise energy of each frequency band is upgraded, the estimated value of the current renewal by signal energy and noise energy again, calculate the posteriority signal to noise ratio (S/N ratio) of each frequency band present frame, and obtain the priori SNR estimation value of present frame by the priori SNR estimation value of former frame, continue execution in step (4) to adjudicate again;

(6), according to the priori SNR estimation value that obtains, calculate the decay gain factor of each frequency band;

(7), with the decay gain factor that obtains, the signal spectrum that is divided into each frequency band is handled;

(8), the frequency-region signal after will handling transforms to time domain, the processing of postemphasising becomes output signal.

More excellent is, is treated to the decay gain factor that the Noisy Speech Signal of present frame be multiply by frequency band in the described step (7).

More excellent is to operate described step (8) and include:

(81), by inverse fast fourier transform frequency-region signal is transformed to time domain, the time domain voice signal after being enhanced;

(82), increase the weight of to handle by low-pass filter.

Another technical matters solved by the invention provides the voice acquisition system that a kind of voice increase, and it comprises the chip of the method that voice collection device, integrated as above predicate sound increase.

More excellent is that described chipset is formed in the voice collection device.

More excellent is that described voice collection device is the microphone microphone.

Compare with correlation technique, the method that voice of the present invention strengthen has realized real-time speech-enhancement system, voice collection device output be voice signal behind the direct noise reduction, and improved greatly noise alleviation, guaranteed the intelligibility of voice, especially to automobile noise, a class such as the street noise attenuating of additive noise stably is particularly outstanding.

[description of drawings]

Fig. 1 is the schematic flow sheet of the method for voice enhancing of the present invention;

[embodiment]

The invention will be further described below in conjunction with drawings and embodiments.

Main thought of the present invention is in the chip that a kind of voice enhancement algorithm is integrated in special use, and by the interface data transmission of this design chips with corresponding voice collection device, to form a real-time speech-enhancement system.Voice signal is directly handled by the voice enhancement algorithm in the chip by the voice collection device collection again, obtains the signal after signal to noise ratio (S/N ratio) strengthens, and output is for secondary use.

The voice acquisition system that voice of the present invention strengthen comprises: voice collection device, voice signal process chip, chip is integrated in this voice collection device.This voice collection device is the microphone microphone in the present embodiment, and the simulating signal of microphone collection also need be converted to digital signal, handles for chip.

The present invention is integrated in the method that the voice in the chip strengthen, and it may further comprise the steps:

(1), voice collection device being collected Noisy Speech Signal (this signal is a digital signal) carries out branch frame, pre-emphasis processing, arrives frequency domain through Short Time Fourier Transform again with chip;

Estimated value to the noise energy of each frequency band is upgraded, the estimated value of the current renewal by signal energy and noise energy again, calculate the posteriority signal to noise ratio (S/N ratio) of each frequency band present frame, and obtain the priori SNR estimation value of present frame by the priori SNR estimation value of former frame, continue execution in step (4) to adjudicate again;

(7), with the decay gain factor that obtains, the signal spectrum that is divided into each frequency band is handled, the Noisy Speech Signal of present frame be multiply by the decay gain factor of frequency band;

(8), the frequency-region signal after will handling transforms to time domain, the processing of postemphasising becomes output signal.Concrete steps (8) are:

(82), increase the weight of to handle by low-pass filter.

Be introduced below by specific embodiment, the sampling rate of the Noisy Speech Signal of the voice acquisition system input that these voice strengthen is 8kHZ again, and precision is 16.

At first, the Noisy Speech Signal in time domain being carried out the branch frame, is to be that unit is divided into some Noisy Speech Signals unit with the frame with Noisy Speech Signal.This Noisy Speech Signal unit is made up of sampled point, the present invention has chosen the sample frequency of 8KHz, needs according to the short-time spectrum analysis, frame length is generally set between 10～35ms, present embodiment divides frame with 32ms, and promptly a frame Noisy Speech Signal unit is provided with 256 sampled points, nature, any frame Noisy Speech Signal unit has certain frame length, and the frame length of arbitrary frame of the present invention is 256.

Voice signal behind the branch frame through a Hi-pass filter, is handled as pre-emphasis.Because the ground unrest in the voice signal is generally bigger at the low frequency part energy,, make noise reduction better so use can the decay deal of low frequency part of this Hi-pass filter.Its form is as follows:

H(z)＝1-αz ^-1

The general value of α is between 0.75-0.95, and effect preferably can be obtained in α=0.9 here.

Because voice signal is stably in short-term, handle so can carry out the branch frame, but the branch frame can bring the discontinuous of frame signal boundary again and cause frequency to be revealed signal.So, to carry out Short Time Fourier Transform (STFT) for the voice signal behind minute frame.Short Time Fourier Transform can be understood as does Fourier transform again to the windowing of frame signal elder generation.The purpose of windowed function is exactly for when doing Short Time Fourier Transform, reduces the discontinuous of frame signal boundary and causes frequency to reveal, thereby reduce " blocking effect ".Here used a length to equal the Hamming window of 256 of frame lengths, it can effectively reduce the oscillation degree of Gibbs' effect.

Hamming window function is defined as follows:

win(n)＝{

0.54-0.46cos(2*π*n/M) 0≤n≤M-1

0 all the other n

}

Short Time Fourier Transform is as follows:

X (m, k 1) = \frac{2}{M} Σ_{n = 0}^{M - 1} win (n - m) \times x (m) e^{- 2 πjk 1 \frac{n}{M}}

0≤k1≤M-1

Wherein, M=256 is the computational length of Fourier Tranform in short-term.M represents the m frame signal.

So just the Noisy Speech Signal s with present frame has transformed from the time domain to frequency domain.

The Noisy Speech Signal that transforms to behind the frequency domain comprises voice signal and noise signal, and this signal is that unit is divided into some frequency bands with the frame, and the voice signal at different frequency bands carries out different policing actions afterwards.

Below the following Noisy Speech Signal of 4kHz is carried out frequency band division, signal Processing is afterwards all carried out in each frequency band, so both can reduce computational complexity, can do different processing at different frequency bands again, obtains better voice and strengthens effect.

Signal among the present invention is divided into 23 frequency bands altogether, specifically sees Table 1.

23 frequency band division of table 1

Frequency band number	Initial frequency (Hz)	Cutoff frequency (Hz)
Frequency band number	Initial frequency (Hz)	Cutoff frequency (Hz)	1	62.5	93.75
2	125	156.25	1	62.5	93.75
2	125	156.25	3	187.5	218.75
4	250	281.25	3	187.5	218.75
4	250	281.25	5	312.5	343.75
6	375	406.25	5	312.5	343.75
6	375	406.25	7	437.5	468.75
8	500	531.25	7	437.5	468.75
8	500	531.25	9	562.5	593.75
10	625	656.25	9	562.5	593.75
10	625	656.25	11	687.5	718.75
12	750	781.25	11	687.5	718.75
12	750	781.25	13	812.5	906.25
14	937.5	1062.5	13	812.5	906.25
14	937.5	1062.5	15	1093.75	1250
16	1281.25	1468.75	15	1093.75	1250

17	1500	1718.75
17	1500	1718.75	18	1750	2000
19	2031.25	2312.5	18	1750	2000
19	2031.25	2312.5	20	2343.75	2687.5
21	2718.75	3125	20	2343.75	2687.5
21	2718.75	3125	22	3156.25	3687.5
23	3718.75	3968.75	22	3156.25	3687.5

The signal energy of each frequency band estimates, calculates and carries out smoothly with following formula:

E(m，k)＝|X(m，k)| ² 0≤k≤N-1

Y(m，k)＝αY(m-1，k)+(1-α)E(m，k) 0≤k≤N-1

Wherein, and Y (m represents the sequence number of present frame for m, the k) signal energy in each the frequency band interval of expression after level and smooth, and k represents the sequence number of current subband, and smoothing factor is represented in α=0.75, and N is the frequency band sum of choosing, promptly 23.

The signal energy in each the frequency band interval after level and smooth comprises speech energy and noise energy, here, obtain the initial estimated value of a noise energy earlier, the posteriority signal to noise ratio (S/N ratio) of removing to calculate each frequency band present frame according to the initial estimated value of signal energy and noise energy, and obtain the priori SNR estimation value of present frame by the priori snr computation of former frame.By the priori SNR estimation value that obtains present frame is adjudicated again, judges whether it is noise:

If judgement is "No", it promptly not noise, then the estimated value of the noise energy of each frequency band is upgraded, the estimated value of the current renewal by signal energy and noise energy again, calculate the posteriority signal to noise ratio (S/N ratio) of each frequency band present frame, and obtain the priori SNR estimation value of present frame by the priori snr computation of former frame, recycle is adjudicated present frame, judge whether it is noise, whether the estimated value of noise energy needs to upgrade.

If judgement is noise for "Yes", according to the priori SNR estimation value that obtains, calculate the decay gain factor of each frequency band, continue next step;

Calculate the formula of the posteriority signal to noise ratio (S/N ratio) of current frame signal, as follows:

{SNR}_{post} (m, k) = \frac{Y (m, k)}{V (k)}

Wherein V (k) represents the energy value of the noise signal of current estimation.

Based on the priori SNR estimation formula of Ephraim and Malah, the formula of the priori SNR estimation value of calculating present frame is as follows then:

Among the present invention, the judgement of the noise energy of each frequency band has adopted the voice activation based on the priori signal to noise ratio (S/N ratio) to detect (VAD) method with renewal.Judge at first whether present frame is pure noise signal.

VAD (m) = Σ_{k = 1}^{N} [\frac{γ (m, k) ζ (m, k)}{1 + ζ (m, k)} - \lg (1 + ζ (m, k))]

Wherein γ (m, k)=min[SNR _Post(m, k), 40],

VAD (m) is judged, and carry out noise and upgrade, as follows:

V (m, k) = \{\begin{matrix} μV (m - 1, k) + (1 - μ) E (m, k) & VAD (m) < η \\ V (m - 1, k) & VAD (m) &GreaterEqual; η \end{matrix}

Wherein η is that noise upgrades the judgement factor, gets η=0.01 among the present invention.

μ is a smoothing factor, gets μ=0.9 here.

Next, calculating the decay gain factor of each frequency band.Based on the priori SNR estimation value that previous calculations draws, take different strategies.For the big frequency band of signal to noise ratio (S/N ratio), can think voice signal, adopt the method for spectral substraction to obtain decay factor, for the little frequency band of signal to noise ratio (S/N ratio), think noise signal, it is carried out to a certain degree decay.Its concrete formula is as follows.

Wherein, a, b, c are respectively different constants.

Consider that noise mainly concentrates on lower frequency band,, get different a, b, c therefore for medium and low frequency section and high frequency.

Among the present invention for the frequency band of k≤18, i.e. the following signal of 2kHz, a=10, b=5.5, c=8

For the frequency band of k＞18, i.e. the above signal of 2kHz, a=5, b=4.8, c=5

Obtain the gain factor of decaying, (m k), multiply by it, and what obtain is exactly voice signal after this frequency band signal to noise ratio (S/N ratio) strengthens with the Noisy Speech Signal X of each frequency band of present frame again.

\hat{S} (k) = q (k) * X (k)

0≤k≤N-1

Wherein, N=23 is the frequency band sum,

It is the voice signal estimated value after k frequency band strengthens.

At last, from the frequency domain transform to the time domain, the processing of postemphasising becomes output signal with the voice signal after the signal to noise ratio (S/N ratio) enhancing after handling.It is operating as:

The first step: inverse fast fourier transform (FFT) transforms to time domain to the voice signal of frequency domain, the time domain voice signal after being enhanced.

The conversion of time domain realizes with general contrary discrete Fourier transform (IDFT).

s (m, n) = \frac{1}{2} * Σ_{n = 0}^{M - 1} \hat{S} (k) e^{j 2 πnk / M}

0≤k≤M-1

Wherein, M=256 is frame length.S is the voice signal that transforms to after full range band after the time domain strengthens.

Second step: the processing of postemphasising.

With the pre-emphasis of front handle opposite, here with signal by a low-pass filter, farthest reduce original signal.The frequency response of wave filter is as follows;

H(z)＝1+αz ^-1

The coefficient here is corresponding with the processing of front pre-emphasis, gets α=0.9.

Compare with correlation technique, the method that voice of the present invention strengthen has realized real-time speech-enhancement system, voice collection device output be voice signal behind the direct noise reduction, saved the cost of other use respective algorithms, and improved the intelligibility that noise alleviation, signal to noise ratio (S/N ratio) has been improved, has guaranteed voice greatly, especially to automobile noise, a class such as the street noise attenuating of additive noise stably is particularly outstanding.

Above-described only is embodiments of the present invention, should be pointed out that for the person of ordinary skill of the art at this, under the prerequisite that does not break away from the invention design, can also make improvement, but these all belongs to protection scope of the present invention.

Claims

1, a kind of method of voice enhancing is characterized in that, may further comprise the steps:

2, the method that strengthens according to the described voice of claim 1 is characterized in that, is treated to the decay gain factor that the Noisy Speech Signal of present frame be multiply by frequency band in the described step (7).

3, the method that strengthens according to the described voice of claim 1 is characterized in that: operate described step (8) and include:

(82), increase the weight of to handle by low-pass filter.

4, a kind of voice acquisition system of voice increase is characterized in that, comprising: the chip of the method that voice collection device, integrated voice according to claim 1 increase.

5, voice acquisition system according to claim 4 is characterized in that: described chipset is formed in the voice collection device.

6, according to claim 4 or 5 described voice acquisition systems, it is characterized in that: described voice collection device is the microphone microphone.