CN110310650A

CN110310650A - A kind of voice enhancement algorithm based on second-order differential microphone array

Info

Publication number: CN110310650A
Application number: CN201910275383.6A
Authority: CN
Inventors: 李冬梅; 辜君龙; 刘润生
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-04-08
Filing date: 2019-04-08
Publication date: 2019-10-08

Abstract

The present invention proposes a kind of voice enhancement algorithm based on second-order differential microphone array, belongs to field of voice signal.This method builds microphone array first and acquires 3 road voice signals of speaker's voice, target voice Wave beam forming signal and noise Wave beam forming signal and framing split-band are extracted using second-order differential algorithm, it is any to choose voice signal framing split-band all the way, it calculates the masking value of each time frequency unit and is smoothed, obtain the time frequency unit value of the enhanced voice of each time frequency unit；Finally by inverse Fourier transform, simultaneously overlap-add obtains the corresponding enhancing signal of speaker's voice to adding window.This method combination beamforming algorithm and Computational auditory scene analysis algorithm, Wave beam forming result is only used as to the estimation of target voice and noise energy, and process is generated to masking value in Computational auditory scene analysis and is optimized, so that masking value is more smoothly suitable for practical application scene, so that reinforcing effect is obvious after final speech synthesis.

Description

A kind of voice enhancement algorithm based on second-order differential microphone array

Technical field

The invention belongs to audio digital signals process fields, and in particular to a kind of language based on second-order differential microphone array Sound enhances algorithm.

Background technique

With the development of electronic information technology, voice process technology is used widely in voice interactive system. And the various noises as present in surroundings are mainly reflected in voice so that the quality of speech processes is greatly affected The decline of discrimination and the decline of intelligibility.Therefore in speech signal processing, need to be located in advance by speech enhancement technique Reason removes the noise in voice and interference, to improve the quality of speech processes.

Voice enhancement algorithm based on second-order differential microphone array is more conducive to realize that the voice with good directive property drops It makes an uproar effect.Wherein mainly use beamforming algorithm and Computational auditory scene analysis algorithm.

Second-order differential microphone array beamforming algorithm often assumes that target sound source direction it is known that by other non-mesh The voice signal of mark Sounnd source direction carries out different degrees of decaying, to complete the speech enhan-cement on assigned direction, compares single order Directive property is stronger for algorithm.Assuming that sound is incident on microphone array with the angular direction θ, if signal strength is P₀, then when t=0 It carves, first microphone received signal is E₁=P₀, second microphone received signal is E₂(ω, θ)=P₀e^-jωdcosθ/c, third microphone received signal is E₃(ω, θ)=e^{-jω(d+d′)cosθ/c}.Wherein, ω is circular frequency, and d and d ' are The distance of adjacent microphone, c are the velocity of sound.Its principle is as shown in Figure 1.

Three microphones are arranged as straight line, are chosen wherein adjacent two to (per two adjacent microphone structures Two first differential Mikes couple are constituted in a pair), by one of signal delay τ of each centering₁After (the first delay parameter) With another signal subtraction, each first differential Mike is calculated separately in this approach to corresponding single order delay differential signal, is incited somebody to action Any one τ that is delayed again in the two single orders delay differential signal₂(the second delay parameter) afterwards with another single order delay inequality Sub-signal subtracts each other, and does second-order differential, and output signal is transformed to frequency domain and is handled, similar to point of first differential microphone array The available following analytic process of analysis method:

The signal of first microphone and second microphone are passed through into τ₁Time delayed signal subtract each other to obtain:

E₁₂(ω,θ)≈P₀ω(τ₁+dcosθ/c) (1-2)

Similarly, the signal of second microphone and third microphone are passed through into τ '₁Time delayed signal subtract each other to obtain:

E₂₃(ω,θ)≈P₀ω(τ₁’+d‘cosθ/c)e^-jωdcosθ/c (1-4)

By E₁₂Expression formula can be seen that the signal angular frequency for being delayed and subtracting each other in first-order difference microphone array as expression formula In the factor frequency component that keeps signal different there is different gains, to guarantee that signal Jing Guo difference is undistorted, need The low-pass filter that a frequency response is 1/ ω is added to eliminate the effect of angular frequency factor pair frequency component.In second-order differential Mike's battle array In column, second-order differential result is further calculated are as follows:

In order to guarantee that the gain of each frequency component of signal obtained after delay is subtracted each other does not change with the variation of frequency, E₁₂₃Angular frequency still only occurs in the form of the factor in (ω, θ) expression formula, if enabling τ₁=τ₁', d=d',

It is 1/ ω that second differnce signal, which is passed through frequency response,²Low-pass filter obtain:

Computational auditory scene analysis algorithm is simulated by computer and processing of the human ear to sound, the core of the algorithm are How the information of same sound source is extracted from from the sound bite in the time frequency unit after decomposition.In view of listening for human ear Feel masking characteristics, the calculating target of Computational auditory scene analysis is exactly desired proportions masking (IRM) value, and expression formula is as follows:

X (m, c) indicates the target voice estimation in m frame c frequency range in formula, and n (m, c) indicates the noise in m frame c frequency range Estimation.For each time frequency unit, wherein how many partly belongs to the sound source that we want to hear to computer " judgement " out, gives It gives appropriate masking value to be weighted, the time frequency unit after all maskings is finally reassembled into voice, target can be recovered The all information of sound source.

The shortcomings that above method, is:

Refer to 1. beamforming algorithm can efficiently use different microphones and collect signal difference in time and carry out height Tropism denoising, but acquired results and frequency term ω in formula²Correlation, if then output signal can be with without correlation filtering operation The variation of frequency have serious distortion, and when design of correlation filter is hardware realization, complex is difficult to realize.

2. IRM calculation is more inflexible in Computational auditory scene analysis algorithm, target voice and noise energy first is estimated Meter may not be accurate, may have an impact to the estimation of masking value, and secondly in practical applications, often target voice and noise come Source direction difference is and on Annual distribution and uneven away from larger, in all time frequency units with IRM value directly as masking value come It is weighted and is difficult to be optimal performance.

Summary of the invention

The purpose of the present invention is the shortcomings to overcome prior art, propose a kind of based on second-order differential microphone array Voice enhancement algorithm.This method novelty combines beamforming algorithm and Computational auditory scene analysis algorithm, by wave beam The estimation that result is only used as target voice and noise energy is formed, eliminates the influence of frequency term, and divide auditory scene is calculated Masking value generates process and is optimized and post-processes in analysis, so that masking value is more smoothly suitable for practical application scene, makes Reinforcing effect is obvious after obtaining final speech synthesis.

The present invention proposes a kind of voice enhancement algorithm based on second-order differential microphone array, which is characterized in that including with Lower step:

1) microphone array being made of 3 or 4 microphones is constructed；If microphone is 3,3 microphones are lined up One column, and the 1st microphone is to distance and the 2nd microphone being equidistant to the 3rd microphone of the 2nd microphone；If Microphone be 4, then 4 microphones form a line, and the 1st microphone to the 2nd microphone distance and the 2nd microphone Distance to the 3rd microphone is identical, the distance of the 1st microphone to the 3rd microphone and the 3rd microphone to the 4th wheat Gram wind is equidistant；

2) using the microphone array of step 1) building, choosing 3 microphones, acquisition speaker's voice obtains 3 in real time respectively Road voice signal；When acquiring signal, the adjusting to a line speaker that microphone is arranged, the rectilinear direction that will be arranged along microphone It is set to 0 ° of angle and 180 ° of angular direction, the direction for the sound that wherein speaker issues is that 0 ° of direction is denoted as enhancing direction, and is said It is that 180 ° of directions are denoted as main inhibition direction that words people, which makes a sound opposite direction,；

Wherein, if microphone array is classified as 4 microphones, the 1st, 2,3 microphone is selected to acquire voice signal, or select 1st, 3,4 microphone acquires voice signal；

3) the 3 road voice signals obtained using step 2), the wave beam of target voice is extracted using second-order differential algorithm respectively Form signal and noise Wave beam forming signal；Wherein, 0 ° of direction of beam direction direction is obtained into target voice Wave beam forming signal, 180 ° of directions of beam direction direction are obtained into noise Wave beam forming signal；Using target voice Wave beam forming signal as target voice Energy estimation, using noise Wave beam forming signal as estimation of noise energy；

4) calculating of masking value and smooth；Specific step is as follows:

4.1) framing split-band is carried out to the target voice Wave beam forming signal that step 3) obtains and obtains S (λ, μ), to step 3) the noise speech Wave beam forming signal obtained carries out framing split-band and obtains N (λ, μ), the 3 road voices letter obtained from step 2) Optionally voice signal progress framing split-band obtains Y (λ, μ) all the way in number, and wherein S (λ, μ) indicates target voice Wave beam forming letter Speech energy on number the μ frequency band of λ frame, N (λ, μ) indicate the noise on noise Wave beam forming signal λ the μ frequency band of frame Energy, Y (λ, μ) indicate the energy on selected voice signal λ the μ frequency band of frame；

The framing split-band method is as follows: to each signal, enabling 0.02 second as 1 frame, it is 0.01 second that frame, which moves, then to every One frame carries out Fast Fourier Transform (FFT), each frame is divided into M frequency band, so that it is several multiplied by frequency band that each signal is divided into frame number Time frequency unit；

4.2) masking value is calculated；It is as follows to shelter function expression:

Wherein, G (λ, μ) indicates the masking value on the μ frequency band of λ frame；

4.3) masking value is smooth；

Signal enhancing ratio δ (λ) is calculated to each frame signal first, expression formula is as follows:

Wherein, M is total frequency band number；

Smoothing filter is calculated, expression formula is as follows:

Wherein, round is the function to round up, and N (λ) is intermediate variable；

Shown in the smoothing method of masking value such as formula (5):

G_PF(λ, μ)=| G (λ, μ) | | * H (μ) (5)

Wherein, G_PF(λ, μ) indicates that smoothed out masking value on the μ frequency band of λ frame, * indicate convolution operation；

5) speech synthesis；

5.1) each time frequency unit value Y of voice signal selected by step 4.1) (λ, μ) is corresponding with the unit smoothed out Masking value G_PF(λ, μ) is multiplied, and obtains the time frequency unit value of the enhanced voice of the unit；

5.2) speech synthesis:

The time frequency unit of each enhanced voice is subjected to inverse Fourier transform, then simultaneously overlap-add is walked adding window The corresponding enhancing signal of speaker's voice of rapid 2) acquisition.

The features of the present invention and beneficial effect are:

1) what this method was innovative combines beamforming algorithm and Computational auditory scene analysis algorithm, by Wave beam forming knot Fruit is only used as the estimation of target voice and noise energy, eliminates the influence of frequency term, and to covering in Computational auditory scene analysis It covers value generation process to be optimized and post-process, so that masking value is more smoothly suitable for practical application scene.

2) present invention has very outstanding removal effect, signal-to-noise performance on main lobe direction to the noise on fixed-direction It is excellent.

3) all operations of the invention are completed in the time domain, are avoided and are grasped using Fourier transformation and inverse transformation etc. Make.

4) present invention solves Computational auditory scene analysis since masking value discontinuously may be used by the method for correcting masking value It can bring synthesis voice music noise problem.

5) spatial information is more fully utilized by linear three microphones array structure in the present invention, compared with traditional diamylose Gram wind array structure and single order beamforming algorithm have more excellent directive property, and decaying is brighter when deviateing main lobe direction It is aobvious.

6) present invention can be used for the microphone module of the equipment such as mobile phone, computer, conference telephone, vehicle-carrying communication, have larger Practicability.

Detailed description of the invention

Fig. 1 is three microphone second order beamforming algorithm schematic diagram of linear array.

Fig. 2 is target voice in the present invention, noise estimation schematic diagram.

Fig. 3 is to survey τ in the embodiment of the present invention₁=0.8d/c, τ₂=0 directional pattern.

Fig. 4 is the masking value function schematic diagram that the present invention constructs.

Specific embodiment

The present invention proposes a kind of voice enhancement algorithm based on second-order differential microphone array, with reference to the accompanying drawing and specifically That the present invention is described in more detail is as follows for embodiment.

This method is combined with Wave beam forming and two kinds of Computational auditory scene analysis calculations with the main distinction of prior art Method, the main lobe of Wave beam forming are only used for estimation target voice and noise, so according to masking function find out masking value to it is each when Frequency unit is weighted, and speech enhan-cement post-processing module is added later and is promoted to the sense of hearing comfort level of output voice.

The present invention proposes a kind of voice enhancement algorithm based on second-order differential microphone array, comprising the following steps:

1) microphone array is constructed, the quantity of microphone is 3 or 4 in the microphone array, and model is wanted without special It asks, if microphone is 3,3 microphone position relationships are to form a line, and first microphone is to second microphone Distance it is identical at a distance from second microphone to third microphone；If microphone is 4,4 microphone positions are closed System is forms a line, and first microphone is to the distance and second microphone to third microphone of second microphone Distance it is identical, while first microphone to third microphone distance and third microphone to the 4th microphone Apart from identical.(the present embodiment has used 4 mems microphones, and disposing way is to form a line, and enable its successively adjacent spacing Respectively 1cm, 1cm and 2cm.).

2) using the microphone array of step 1) building, choosing three microphones, acquisition speaker's voice is obtained in real time respectively Three road voice signals.When acquiring signal, by the adjusting to a line speaker of microphone arrangement, the length of time for obtaining signal is unlimited. The rectilinear direction arranged along microphone is set to 0 ° of angle and 180 ° of angular direction, the direction for the sound that wherein speaker issues (0 ° of angle) be enhancing direction, with speaker's exactly opposite direction (180 ° of angles) based on inhibit direction.

Wherein, if microphone array is classified as 4 microphones, the voice signal of first, second and third microphone acquisition can be selected, Also the voice signal of first and third, four microphone acquisition can be selected.

3) it using three road voice signals obtained in step 2), is calculated using second-order differential algorithm twice, is mentioned respectively Take the Wave beam forming signal and noise Wave beam forming signal of target voice.

Fig. 2 is target voice in the present invention, noise estimation schematic diagram.Shown in Fig. 2, beam direction is adjusted for the first time and is directed toward 0 ° Target voice, in the present embodiment, set the first delay parameter τ₁=0.8d/c, the second delay parameter τ₂=0, it can be obtained Wave beam on the right of Fig. 2 is target voice wave beam.Second of adjustment beam direction is directed toward 180 ° of target voice, in the present embodiment In, set the first delay parameter τ₁=-0.8d/c, the second delay parameter τ₂=0, the wave beam that the left side Fig. 2 can be obtained is noise waves Beam.

Wherein the Wave beam forming signal of 0 ° of main lobe, which can be used as, estimates the energy of target voice, and the wave beam of 180 ° of main lobes It forms signal and then can be used as and the energy of noise is estimated.

It is noted that being likely to occur the feelings that target voice and noise source direction can not be just opposite in practical application Condition, but we can fix (0 °) of principal direction of microphone line as enhancing direction, inhibit direction based on opposite direction (180 °).It is real In the application of border, by microphone array principal direction alignment target sound source, then the noise in all non-main lobe direction sources be will receive not With the inhibition of degree, deviates the inhibitory effect that the bigger noise of angle of principal direction is subject to and be more obvious, specific inhibitory effect is with angle The variation relation of degree depends on delay parameter.

In the present embodiment, the selection course of delay parameter is as follows:

By the τ in Fig. 1₂After being set to 0, totally 4 valve situations of change of second order algorithm are very ingenious, first 0 ° of maximum value of main lobe By τ₂It is fixed, possess 3 secondary lobes in the reverse direction.τ₁When being maximized d/c, 180 ° of secondary lobe is eliminated, and only remains 2 side directions Secondary lobe, and with τ₁Reduction, 180 ° of secondary lobes gradually protrude, the package " suction " of main lobe and other 2 secondary lobes are come in, therefore main lobe It becomes narrow gradually, until τ₁Other 2 secondary lobes completely disappear when taking 0, and main lobe also reaches most narrow.Based on such feature and I Find, two aspect demands of the characteristic extracting module directional of algorithm be it is contradictory, on the one hand wish main lobe as far as possible It is narrow, then τ₁Value as small as possible should just be taken；And on the other hand wish the Wave beam forming on the direction opposite with main lobe It exports as small as possible, speech energy on opposite direction is all filtered out as far as possible, then τ₁Just should be as big as possible, therefore must combine Practical situations are weighed.τ in experiment₁=0.8d/c, τ₂The directional pattern of algorithm is as shown in Figure 3 in the case of=0.It can be with Find out that Wave beam forming completes the extraction to voice on main lobe direction and the inhibition to sidelobe direction signal well.

4) calculating of masking value and smooth；Specific step is as follows:

4.1) framing split-band is carried out to the target voice Wave beam forming signal that step 3) obtains and obtains S (λ, μ), to step 3) the noise speech Wave beam forming signal obtained carries out framing split-band and obtains N (λ, μ), the three road voices letter obtained from step 2) Optionally voice signal (Noisy Speech Signal) progress framing split-band obtains Y (λ, μ) all the way in number, and wherein S (λ, μ) indicates target Speech energy on voice Wave beam forming signal λ the μ frequency band of frame, N (λ, μ) indicate noise Wave beam forming signal λ frame μ Noise energy on a frequency band, Y (λ, μ) indicate the energy on selected Noisy Speech Signal λ the μ frequency band of frame.

The framing split-band method is as follows: to each Wave beam forming signal or voice signal, the calculating side of time frequency unit Method is that 0.02 second is 1 frame, and it is 0.01 second that frame, which moves, then does Fast Fourier Transform (FFT) to each frame, and each frame is divided into M frequency Band (usual M can choose 64,128,256 equivalences, be 64 frequency bands in the present embodiment), thus by each Wave beam forming signal/language Sound signal is divided into frame number multiplied by the several time frequency units of frequency band.

4.2) masking value is calculated.

The present invention, with reference to actual speech characteristic, to reach best masking effect, is specifically calculated when construction shelters function For function as shown in figure 4, abscissa is the ratio of target voice energy and noise energy in figure, ordinate is covering for the time frequency unit Cover value.It is as follows to shelter function expression:

Wherein, G (λ, μ) indicates the masking value on the μ frequency band of λ frame.The masking function is with target voice energy and makes an uproar Acoustic energy ratio is equal to 1 is waypoint, and convex cubic function and convex desired proportions function are fitted under, preferably in high language Voice is remained in the time frequency unit of sound signal distribution, and with noise signal energy in the time frequency unit for sound signal distribution of speaking in a low voice The increase of amount is inhibited with being getting faster, and so reaches balance on the signal-to-noise ratio of synthesis voice and sense of hearing comfort level.

4.3) masking value is smooth:

The time frequency unit masking value gone out first according to Wave beam forming feature extraction and Computational auditory scene analysis combined calculation To estimate the signal-to-noise ratio distribution situation in time domain.Different from the signal-to-noise ratio (SNR) estimation in time frequency unit, time domain signal-to-noise ratio (SNR) estimation needs Total signal-to-noise ratio of the voice of each frame in time domain in all frequency ranges is calculated, the signal-to-noise ratio between different frame is then compared Relationship obtains input voice in signal-to-noise ratio distribution situation in different time periods.It is impure due to there was only mixing voice in practice Voice and pure noise can not directly calculate signal-to-noise ratio, therefore calculate signal enhancing ratio δ for each frame signal in time domain (λ), expression formula is as follows:

Wherein λ indicates that frame number, μ indicate frequency band number, and M is total frequency band number, and what above formula obtained is to calculate to own in same frame Frequency band masking after with the energy ratio before masking, as the signal enhancing ratio of this frame, in this, as present frame noise in original signal The estimation of ratio.Since all masking value G (λ, μ) value ranges are (0,1), so signal enhancing ratio is also between 0 and 1 Number, value illustrates that the frame signal is closer to purified signal in original signal closer to 1, and illustrates original closer to 0 The frame signal is closer to noise signal in beginning signal.

The signal-to-noise ratio distribution situation of the voice signal with noise in the time domain is obtained, so that it may which further corrected Calculation is listened Feel the masking value in scene analysis.Present invention employs a kind of modification method based on smoothing filter, the ginseng provided with reference to it The improvement effect that number may be significantly, smoothing filter expression formula are as follows:

Wherein round is the function to round up, and N (λ) is intermediate variable, for calculating smoothing filter H (μ).The filter Wave device ensure that all values between 0,1, do decaying by a relatively large margin to the masking value on low signal-to-noise ratio frame, and high s/n ratio On masking value it is then almost unchanged, shown in the smoothing method of masking value such as formula (5):

G_PF(λ, μ)=| G (λ, μ) | * H (μ) (5)

Wherein, G_PF(λ, μ) indicates smoothed out masking value on the μ frequency band of λ frame.* convolution operation is indicated.

5) speech synthesis；

5.1) each time frequency unit value Y (λ, μ) of Noisy Speech Signal selected by step 4.1) is corresponding with the unit flat Masking value G after cunning_PF(λ, μ) is multiplied, and obtains the time frequency unit value of the enhanced voice of the unit.

5.2) speech synthesis:

The time frequency unit of each enhanced voice is subjected to inverse Fourier transform, then simultaneously overlap-add can obtain adding window The corresponding enhancing signal of speaker's voice acquired to step 2), the parameter of overlapping is identical as the parameter of framing before, overlapping Length is the half of frame length, selects Hanning window here, time-domain expression is such as shown in (6):

Wherein, N is the length of window, and numerical value is equal to frame length.N gets N from 1, is the independent variable of window function.

It signal-to-noise ratio improving performance and conventional first order algorithm and is not done in the present embodiment, on final experimental result main lobe direction The smooth situation comparison of masking value is as shown in table 1.

Main lobe direction signal-to-noise ratio under 1 varying strength noise circumstance of table

It can be seen that the present invention can complete the voice enhanced function on main lobe direction well, for single order or two in table Order algorithm, and the smooth front and back of masking value is carried out, it can make the maximum enhancing signal-to-noise ratio of main lobe direction promote 10dB or more.

Claims

1. a kind of voice enhancement algorithm based on second-order differential microphone array, which comprises the following steps:

1) microphone array being made of 3 or 4 microphones is constructed；If microphone is 3,3 microphones form a line, And the 1st microphone is to distance and the 2nd microphone being equidistant to the 3rd microphone of the 2nd microphone；If microphone It is 4, then 4 microphones form a line, and the 1st microphone is to the distance and the 2nd microphone to the 3rd of the 2nd microphone The distance of a microphone is identical, the distance of the 1st microphone to the 3rd microphone and the 3rd microphone to the 4th microphone It is equidistant；

2) using the microphone array of step 1) building, choosing 3 microphones, acquisition speaker's voice obtains 3 road languages in real time respectively Sound signal；When acquiring signal, the adjusting to a line speaker that microphone is arranged distinguishes the rectilinear direction arranged along microphone It is set as 0 ° of angle and 180 ° of angular direction, the direction for the sound that wherein speaker issues is that 0 ° of direction is denoted as enhancing direction, with speaker Making a sound opposite direction is that 180 ° of directions are denoted as main inhibition direction；

Wherein, if microphone array is classified as 4 microphones, the 1st, 2,3 microphone acquisition voice signal of selection, or selection the 1st, 3,4 microphones acquire voice signal；

3) the 3 road voice signals obtained using step 2), the Wave beam forming of target voice is extracted using second-order differential algorithm respectively Signal and noise Wave beam forming signal；Wherein, 0 ° of direction of beam direction direction is obtained into target voice Wave beam forming signal, by wave Shu Fangxiang is directed toward 180 ° of directions and obtains noise Wave beam forming signal；Using target voice Wave beam forming signal as target voice energy Estimation, using noise Wave beam forming signal as estimation of noise energy；

4) calculating of masking value and smooth；Specific step is as follows:

4.1) framing split-band is carried out to the target voice Wave beam forming signal that step 3) obtains and obtains S (λ, μ), step 3) is obtained To noise speech Wave beam forming signal carry out framing split-band obtain N (λ, μ), from the 3 road voice signals that step 2) obtains Optionally voice signal carries out framing split-band and obtains Y (λ, μ) all the way, and wherein S (λ, μ) indicates target voice Wave beam forming signal the Speech energy on the μ frequency band of λ frame, N (λ, μ) indicate the noise energy on noise Wave beam forming signal λ the μ frequency band of frame Amount, Y (λ, μ) indicate the energy on selected voice signal λ the μ frequency band of frame；

The framing split-band method is as follows: to each signal, enabling 0.02 second as 1 frame, it is 0.01 second that frame, which moves, then to each frame Fast Fourier Transform (FFT) is carried out, each frame is divided into M frequency band, so that each signal is divided into frame number multiplied by the several time-frequencies of frequency band Unit；

4.3) masking value is smooth；

Wherein, M is total frequency band number；

Smoothing filter is calculated, expression formula is as follows:

Shown in the smoothing method of masking value such as formula (5):

G_PF(λ, μ)=| G (λ, μ) | * H (μ) (5)

5) speech synthesis；

5.1) by each time frequency unit value Y of voice signal selected by step 4.1) (λ, μ) smoothed out masking corresponding with the unit Value G_PF(λ, μ) is multiplied, and obtains the time frequency unit value of the enhanced voice of the unit；

5.2) speech synthesis:

The time frequency unit of each enhanced voice is subjected to inverse Fourier transform, then simultaneously overlap-add obtains step 2) to adding window The corresponding enhancing signal of speaker's voice of acquisition.