CN101778322B - Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic - Google Patents

Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic Download PDF

Info

Publication number
CN101778322B
CN101778322B CN2009102503930A CN200910250393A CN101778322B CN 101778322 B CN101778322 B CN 101778322B CN 2009102503930 A CN2009102503930 A CN 2009102503930A CN 200910250393 A CN200910250393 A CN 200910250393A CN 101778322 B CN101778322 B CN 101778322B
Authority
CN
China
Prior art keywords
noise
signal
power spectrum
voice signal
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009102503930A
Other languages
Chinese (zh)
Other versions
CN101778322A (en
Inventor
刘文举
程宁
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2009102503930A priority Critical patent/CN101778322B/en
Publication of CN101778322A publication Critical patent/CN101778322A/en
Application granted granted Critical
Publication of CN101778322B publication Critical patent/CN101778322B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic, aiming at two important factors influencing the postfiltering sound enhancement performance of a microphone array, i.e. accurate estimation for signal parameters and suitable compromise between increasing noise reduction performance and reducing voice distortion. Thescheme of the invention comprises the following steps of carrying out time domain alignment on signals collected by the microphone array, and carrying out short-time Fourier transform and characteristic value analysis based of power spectrum; determining the dimensionality of a signal subspace through the existence probability of target voice signal in maximation noise-carried voice signals; self-adaptively selecting a distribution model of a noise power spectrum in the noise-carried voice signals; estimating noise power spectrum by utilizing a conditional probability; estimating an auditory masking threshold value based on the signal subspace; and estimating a postfilter by combining Lagrange multipliers according to the auditory sensing characteristics.

Description

Based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties
Technical field
The present invention relates to the design of signal subspace method, auditory masking effect and the postfilter of microphone array.
Background technology
Real-life voice usually are subjected to The noise in the environment, and the multicenter voice Enhancement Method had been subjected to paying close attention to widely in the last few years.Microphone array voice enhancement method is that with respect to the advantage of single channel sound enhancement method it can utilize the correlation characteristic of estimated signal more accurately between the multiple signals, strengthens effect thereby reach better voice.Wherein, behind the microphone array filtering sound enhancement method especially since its outstanding anti-acoustic capability obtained in recent years widely using.(list of references 1:K.Uwe Simmer such as Simmer, et al, " Post-filtering techniques ", inMicrophone Arrays, M.Brandstein and D.Ward, Eds.New York:Springer, ch.3, pp.36-60,2001.) proved that optimum multicenter voice under the least mean-square error meaning strengthens solution and can be decomposed into the non-distortion response of a minimum variance Beam-former and add that a single pass dimension receives the form of postfilter.Although proved the optimality of back filtering method in theory, in actual applications, because the very difficult power spectrum that accurately estimates voice signal and noise signal obtains desirable postfilter, limited the performance of back filtering method.So, reasonably postfilter design, power spectrum signal is estimated to make that the performance of sound enhancement method is significantly improved accurately.Zelinski (list of references 2:R.Zelinski, " A microphone array with adaptive post-filteringfor noise reduction in reverberant rooms ", in Proc.of ICASSP-88,1988, Vol.5, pp.2578-2581.) suppose that the noise signal on each array element is incoherent, proposed a kind of postfilter method for designing.But owing in the actual environment, there is certain correlation between the array element noise, so this method poor-performing.McCowan (list of references 3:Iain A.McCowan, Herv é Bourlard, " Microphone array post-filter based on noise field coherence ", IEEETransaction on Speech and Audio Processing, Vol.11, pp.709-715, Nov.2003.) considered correlation between the noise, utilize the characteristic of shot noise field, proposed a kind of postfilter method for designing, have preferably voice and strengthen the property.But because its method is based on shot noise field hypothesis, so when the noise field in the practical matter did not meet the shot noise field, this method performance can significantly decrease.The present invention utilizes the auditory masking effect of people's ear, has proposed a kind of postfilter method for designing based on auditory perception property.For the spectrum of estimating noise power more accurately, the present invention is signal subspace and noise subspace with the signals with noise spatial decomposition, proposed to exist probability to maximize the method for estimator Spatial Dimension with target voice signal signal, reasonably estimate the dimension of signal subspace and noise subspace, on noise subspace, the method with conditional probability estimating noise power spectrum has been proposed.Experiment showed, that noise estimation method ratio noise estimation method in the past proposed by the invention is more accurate, the postfilter based on auditory perception property that proposes is also more effective than traditional postfilter.
The frequency domain representation of the Noisy Speech Signal vector that receives on the array of supposing to be made up of L microphone is: X=[X 1..., X L] HThe frequency domain representation of the voice signal after the enhancing that is obtained by the weighting summation of array input signal is as follows:
Y=w HX=w H[Sd+N](1)
Wherein, model w is the array weight coefficient, and S is echo signal, d=[d 1..., d L] TBe to propagate vector, N=[N 1..., N L] HBe the noise signal vector, [] HBe the conjugate transpose operator.
Error signal e=S-w HThe power of X is:
φ ee = E [ { S - w H X } { S H - X H w } ] = φ SS - w H φ XS - φ XS H w + w H Φ XX w - - - ( 2 )
Wherein, Φ XXBe the cross power spectrum matrix of multichannel Noisy Speech Signal X, φ XSBe the crosspower spectrum of multichannel Noisy Speech Signal X and single channel echo signal S, φ SSIt is the power spectrum of single channel target voice signal S.
Make φ EeWeight w is differentiated, is zero, can get optimal weighting coefficients:
w opt = Φ XX - 1 φ XS - - - ( 3 )
Under target voice signal and the incoherent hypothesis of noise, (3) formula becomes:
w opt = Φ XX - 1 φ SS d = [ φ SS dd H + Φ NN ] - 1 φ SS d - - - ( 4 )
Use the Sherman-Morrison-Woodbury identity, following formula can be expressed as again:
w opt = [ φ SS φ SS + ( d H Φ NN - 1 d ) - 1 ] Φ NN - 1 d d H Φ NN - 1 d = [ φ SS φ SS + φ Nn ] Φ NN - 1 d d H Φ NN - 1 d - - - ( 5 )
Wherein, φ NNBe respectively the auto-power spectrum of single channel noise, Φ NNIt is multi-channel noise cross power spectrum matrix.Formula (5) can be regarded the non-distortion response of a minimum variance Beam-former Φ as NN -1D/ (d HΦ NN -1D) add that a single pass dimension receives postfilter φ SS/ (φ SS+ φ NN).
Summary of the invention
In order to solve prior art problems, the objective of the invention is to the single channel postfilter is designed, utilize many distributed models adaptive selection method and auditory properties to design a kind of new postfilter.The problem that the design of single channel postfilter needs to consider comprises two aspects: good anti-acoustic capability and less target voice signal distortion.Usually, postfilter also may increase the distortion of target voice signal in noise reduction.So the two is reasonably compromised is the problem that the postfilter design must be considered.
For reaching described purpose, the invention provides a kind ofly based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, the concrete steps of this method are as follows:
Step a: the multi-path voice signal of the microphone array collection band noise of forming by L microphone, the voice signal of each road band noise is carried out time domain alignment, the frequency signal form of each the road signal indication value of pluralizing after using discrete Fourier transform in short-term to align is calculated the spectral power matrix of microphone array multiple signals and this spectral power matrix is carried out characteristic value decompose and obtain eigenvalue matrix and eigenvectors matrix;
Step b: by the probability that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace, and Q≤L;
Step c: based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal;
Steps d: utilize conditional probability estimating noise power spectrum;
Step e: estimate according to signal subspace dimension and noise power spectrum, utilize auditory masking effect, estimate to obtain the auditory masking threshold of each frequency based on signal subspace;
Step f: according to noise power spectrum, auditory masking threshold, estimate postfilter in conjunction with Lagrange multiplier, residual noise in the feasible enhancing voice is less than the auditory masking threshold of people's ear, thereby eliminate the residual noise influence, and make the distortion of target voice signal as much as possible little, finish that the filtering voice strengthen behind the microphone array.
Wherein, describedly spectral power matrix carried out characteristic value decompose, comprising:
Utilize characteristic value to decompose the Noisy Speech Signal space is divided into two sub spaces, i.e. signal subspace: comprise target voice signal and noise; Noise subspace: only comprise noise; The spectral power matrix Φ of Noisy Speech Signal X at time frame t and frequency k XX(k, t) characteristic value is decomposed into:
Φ XX(k,t)=UΛ XXU H=U(Λ SSNN(k,t)I)U H
Wherein, X=S+N, X are Noisy Speech Signal, and S is the target voice signal, and N is noise; Λ XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, Λ SSBe the target voice signal power spectrum characteristic value matrix of characteristic value descending, U is eigenvectors matrix, φ NN(k t) is the noise power of time frame t and frequency k, and I is L rank unit matrix, [] HBe the conjugate transpose operator.
Wherein, described definite signal subspace dimension is to get the probability maximum that only Q value makes that the target voice signal exists in the noisy speech; Utilize conditional probability to calculate, step comprises:
Definition exclusive events H 0And H 1:
Event H 0: in the Noisy Speech Signal, only there is noise, do not have the target voice signal;
Event H 1: in the Noisy Speech Signal, target voice signal and noise exist simultaneously;
Signal subspace dimension Q is defined as:
arg max Q P ( S ( k , t ) | H 1 )
Wherein, (k t) is the power spectrum of target voice signal signal on k Frequency point of t frame to S, and P () is the distribution function of target voice signal spectrum, and argmax () is the operator of seeking the parameter value with maximum scores.
Wherein, described stationarity based on spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal may further comprise the steps:
Step c1: define a discriminant function Ω who is used for explaining the stationarity of power spectrum:
Ω = Π i = Q + 1 L λ X i ( L - Q ) 1 L - Q Σ i = Q + 1 L λ X i
That is, Ω is geometric average
Figure G2009102503930D00043
To arithmetic average
Figure G2009102503930D00044
Ratio, wherein,
Figure G2009102503930D00045
Be Noisy Speech Signal power spectrum characteristic value matrix Λ XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1;
Step c2: compare according to discriminant score and predetermined threshold value, determine to be useful in the noise power spectrum distributed model in the Noisy Speech Signal.
Wherein, described comparison step according to discriminant score and predetermined threshold value comprises:
Step c21: determine two predetermined threshold value Ω 1And Ω 2, Ω 1<Ω 2
Step c22: compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω 1, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω 2, then select Gamma distribution for use; Otherwise select laplacian distribution for use.
Wherein, utilize the step of conditional probability estimating noise power spectrum to comprise:
For each frame Noisy Speech Signal, the probability that it only contains noise is P (H 0| X), namely containing the probability that noise contains the target voice signal again is P (H 1| X); At both of these case, the estimating noise power spectrum is as follows respectively:
H 0 : φ NN 0 = 1 L Σ i = 1 L λ X i H 1 : φ NN 1 = 1 L - Q Σ i = Q + 1 L λ X i
Wherein, φ NN 0And φ NN 1Be respectively that noise is at exclusive events H 0And H 1Power spectrum under a situation arises, i ∈ 1 ..., L} is the subscript of characteristic value;
According to condition probability formula, noise power spectrum is estimated as follows:
φ ~ NN = P ( H 0 | X ) φ NN 0 + P ( H 1 | X ) φ NN 1 .
Wherein, the step of described estimation auditory masking threshold comprises:
Step f1: auditory frequency range 0-15500Hz is divided into several crucial sub-bands;
Step f2: calculate the auditory masking threshold in each sub-band respectively.
Wherein, auditory masking threshold in each sub-band of described calculating is the energy that calculates each frequency on each sub-band, calculate people's ear basement membrane for the propagation coefficient of each frequency range sound, then the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound being multiplied each other obtains the epilamellar excitation energy value of people's ear, and the functional relation according to the epilamellar excitation energy value of people's ear and auditory masking threshold calculates masking threshold again.
Wherein, described step in conjunction with Lagrange multiplier estimation postfilter G is as follows:
Step fa: under the constraints of residual noise power less than masking threshold, minimize the distortion of target voice signal, set up optimization problem with this;
Step fb: find the solution in conjunction with Lagrange multiplier, obtain the optimal estimation of postfilter;
Step fc: bring auditory masking threshold and noise power spectrum into and estimate, finish the design of postfilter.
Beneficial effect of the present invention: the present invention utilizes the auditory masking effect of people's ear to propose a kind of rational half-way house, has designed a kind of new postfilter based on auditory perception property.Traditional noise estimation method is based on the noise estimation method of VAD, just detects the pure noise frame in the noisy speech, estimates noise power spectrum on voice and the noise hybrid frame with the average power spectra on these frames.Because noise changes, the noise on each frame is actually different.So, compose to estimate that with the average noise power on the pure noise frame noise power spectrum on all frames can cause bigger evaluated error based on the noise estimation method of VAD.At this situation, the present invention proposes a kind of noise power spectrum method of estimation based on the signals with noise Subspace Decomposition, all estimating noise power is composed on each frame signal, has reduced the noise evaluated error greatly.Then, the present invention utilizes the auditory masking effect design postfilter of people's ear, makes the residual noise that strengthens in the voice of back be sheltered by the target voice, has also reduced the distortion of target voice in noise reduction.
Description of drawings
The further characteristic of the present invention and advantage will be described below with reference to illustrative accompanying drawing.
Fig. 1 illustrate an application based on the microphone array of multi-model and auditory properties after the example flow diagram of filtering sound enhancement method;
Fig. 2 is the flow chart of a definite signal subspace dimension method;
Fig. 3 is the flow chart of noise power spectrum distributed model in the definite Noisy Speech Signal;
Fig. 4 is a flow chart that utilizes conditional probability estimating noise power spectrum;
Fig. 5 is a flow chart that calculates auditory masking threshold;
Fig. 6 is the flow chart of a design postfilter.
Embodiment
The following detailed description that should be appreciated that different examples and accompanying drawing is not to be intended to the present invention is limited to special illustrative embodiment; The illustrative embodiment that is described only is illustration each step of the present invention, and its scope is defined by additional claim.
The present invention utilizes the auditory masking effect of people's ear to propose a kind of rational half-way house, has designed a kind of new postfilter based on auditory perception property.The auditory masking effect of people's ear refers to, under normal conditions, target voice signal signal is strong signal, and background noise relatively a little less than, auditory system can be determined auditory masking threshold on the frequency domain according to concrete target voice signal signal like this, if filtered residual noise is limited under the auditory masking threshold of people's ear, this noise just can not be perceived by the human ear so, thereby realizes the enhancing to Noisy Speech Signal.Concrete step is as follows:
A kind of new for filtering sound enhancement method behind the microphone array of multi-model and auditory properties, comprise the following steps:
Step a: the multi-path voice signal of the microphone array collection band noise of forming by L microphone, the voice signal of each road band noise is carried out time domain alignment, the frequency signal form of each the road signal indication value of pluralizing after using discrete Fourier transform in short-term to align is calculated the spectral power matrix of microphone array multiple signals and this spectral power matrix is carried out characteristic value decompose and obtain eigenvalue matrix and eigenvectors matrix;
Step b: by the probability that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace;
Step c: based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal;
Steps d: utilize conditional probability estimating noise power spectrum;
Step e: estimate according to signal subspace dimension and noise power spectrum, utilize auditory masking effect, estimate to obtain the auditory masking threshold of each frequency based on signal subspace;
Step f: according to noise power spectrum, auditory masking threshold, estimate postfilter in conjunction with Lagrange multiplier, residual noise in the feasible enhancing voice is less than the auditory masking threshold of people's ear, thereby eliminate the residual noise influence, and make the distortion of target voice signal as much as possible little, finish that the filtering voice strengthen behind the microphone array.
Normally used noise estimation method is based on the noise estimation method of VAD.Just detect the pure noise frame in the noisy speech, estimate noise power spectrum on voice and the noise hybrid frame with the average power spectra on these frames.Because noise changes, the noise on each frame is actually different.So, compose to estimate that with the average noise power on the pure noise frame noise power spectrum on all frames can cause bigger evaluated error based on the noise estimation method of VAD.
At this situation, step b) of the present invention and step d) have adopted a kind of method based on the signals with noise Subspace Decomposition to come dimension and the noise power spectrum of estimating noise subspace, all estimating noise power is composed on each frame signal, has greatly reduced the noise evaluated error.
Under target voice signal and the incoherent hypothesis of noise, Noisy Speech Signal is at the spectral power matrix Φ of time frame t and frequency k XX(k t) can be expressed as target voice signal signal power spectrum matrix Φ SS(k is t) with noise signal spectral power matrix Φ NN(k, t) sum:
Φ XX(k,t)=Φ SS(k,t)+Φ NN(k,t)(6)
For microphone array signals, can suppose that the auto-power spectrum of noise signal on each array element equates, and noise signal is uncorrelated between array element, then following formula is set up:
Φ NN(k,t)=φ NN(k,t)I (7)
Wherein, I is L rank unit matrixs, φ NN(k t) is the auto-power spectrum of single channel noise.
Make the characteristic value of target voice signal spectral power matrix be decomposed into:
Φ SS(k,t)=UΛ SSU H (8)
Wherein, Λ SSBe the eigenvalue matrix of characteristic value descending, U is the characteristic of correspondence vector matrix, and Q is rank of matrix, and Q≤L.
Utilize characteristic value to decompose and the signals with noise space can be divided into two sub spaces: signal subspace (comprising target voice signal and noise) and noise subspace (only comprising noise).If signals with noise spectral power matrix characteristic value is decomposed into:
Φ XX(k,t)=UΛ XXU H=U(Λ SSNN(k,t)I)U H (9)
Λ XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, I is L rank unit matrix.
The present invention proposes and from noise subspace, estimate to obtain noise auto-power spectrum φ NNMethod.At first need to determine dimension Q and the noise subspace dimension P of signal subspace.
In step b), provide a kind of probability that exists by target voice signal in the maximization Noisy Speech Signal to determine the method for Q, namely get the probability maximum that only Q value makes that the target voice signal exists.
Utilize conditional probability to calculate, definition exclusive events H 0And H 1:
Event H 0: in the Noisy Speech Signal, only there is noise, do not have the target voice signal;
Event H 1: in the Noisy Speech Signal, target voice signal and noise exist simultaneously;
Signal subspace dimension Q is defined as:
arg max Q P ( S ( k , t ) | H 1 ) - - - ( 10 )
Wherein, (k t) is the power spectrum of target voice signal signal on k Frequency point of t frame to S, and P () is the distribution function of target voice signal spectrum, and argmax () is the operator of seeking the parameter value with maximum scores.
Step c) provides a kind of adaptive approach based on noise power spectrum distributed model in the stationarity select tape noisy speech signal of spectrum.This method comprises the following steps:
At first, definition discriminant function Ω
Ω = Π i = Q + 1 L λ X i ( L - Q ) 1 L - Q Σ i = Q + 1 L λ X i - - - ( 11 )
That is, Ω is geometric average
Figure G2009102503930D00092
To arithmetic average
Figure G2009102503930D00093
Ratio wherein,
Figure G2009102503930D00094
Be Noisy Speech Signal power spectrum characteristic value matrix Λ XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1.
Then, determine two predetermined threshold value, Ω 1And Ω 21<Ω 2), compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω 1, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω 2, then select Gamma distribution for use; Otherwise select laplacian distribution for use.
In step d), provide a kind of method of utilizing conditional probability estimating noise power spectrum.For each frame Noisy Speech Signal, the probability that it only contains noise is P (H 0| X), namely containing the probability that noise contains the target voice signal again is P (H 1| X); At both of these case, the estimating noise power spectrum is as follows respectively:
H 0 : φ NN 0 = 1 L Σ i = 1 L λ X i H 1 : φ NN 1 = 1 L - Q Σ i = Q + 1 L λ X i - - - ( 12 )
Wherein, i ∈ 1 ..., L} is the subscript of characteristic value, φ NN 0And φ NN 1Be respectively that noise is at exclusive events H 0And H 1Power spectrum under a situation arises.
According to condition probability formula, the noise power spectrum method of estimation is as follows:
φ ~ NN = P ( H 0 | X ) φ NN 0 + P ( H 1 | X ) φ NN 1 - - - ( 13 )
Step e) provides a kind of to be estimated according to signal subspace dimension and noise power spectrum, and utilize auditory masking effect, estimation obtains the method for the auditory masking threshold of each frequency based on signal subspace.
Auditory frequency range is 0 to 15500Hz, has covered 24 critical sub-bands, need calculate auditory masking threshold in each sub-band.At first calculate the energy of each frequency on each sub-band, calculate people's ear basement membrane again for the propagation coefficient of each frequency range sound, the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound is multiplied each other obtains the epilamellar excitation energy value of people's ear then.At last, the functional relation according to the epilamellar excitation energy value of people's ear and auditory masking threshold further calculates masking threshold again.
It is a kind of according to noise power spectrum, auditory masking threshold that step f) provides, and estimates postfilter G (e in conjunction with Lagrange multiplier J ω) method.Residual noise in the feasible enhancing voice influences thereby eliminate residual noise, and makes the distortion of target voice signal as much as possible little less than the auditory masking threshold of people's ear.The filtering voice strengthen after finishing microphone array.
The output signal of supposing the non-distortion response of minimum variance Beam-former is
Figure G2009102503930D00101
Target voice signal signal is S (e J ω), the voice signal after back filtering strengthens and the error of target voice signal signal can be expressed as follows:
E ( e jω ) = G ( e jω ) S ~ ( e jω ) - S ( e jω ) = [ G ( e jω ) - 1 ] S ( e jω ) + G ( e jω ) N ~ ( e jω ) - - - ( 14 )
Wherein,
Figure G2009102503930D00103
For
Figure G2009102503930D00104
In noise.
Describe the distortion that strengthens target voice signal in the voice for first in the formula (14), described the size that strengthens residual noise in the voice for second.Can calculate a suitable postfilter G (e J ω) make to strengthen residual noise in the voice less than the auditory masking threshold of people's ear, thus its influence eliminated.At formula (14), the present invention proposes following goal constraint:
min E T = [ G ( e jω ) - 1 ] 2 S ( e jω ) 2 + G ( e jω ) 2 N ~ ( e jω ) 2 - - - ( 15 )
Constraints:
G ( e jω ) 2 N ~ ( e jω ) 2 ≤ C thr - - - ( 16 )
Wherein, C ThrBe auditory masking threshold.
Find the solution order with method of Lagrange multipliers:
J = E T + μ ( G ( e jω ) 2 N ~ ( e jω ) 2 - C thr ) - - - ( 17 )
Wherein, μ is Lagrange multiplier.
Make the G (e of J J ω) differentiate, and be zero, can get:
G ( e jω ) = S ( e jω ) 2 S ( e jω ) 2 + ( 1 + μ ) N ~ ( e jω ) 2 - - - ( 18 )
Can be found out under goal constraint of the present invention by formula (18), be exactly the Weiner filter of more reasonably having estimated noise at expression-form based on the postfilter of auditory perception property.
Make the μ differentiate of J, and be zero, can get:
G ( e jω ) = C thr N ~ ( e jω ) 2 - - - ( 19 )
Equated by (18) and (19) two formulas, can get:
1 + μ = S ( e jω ) 2 N ~ ( e jω ) 2 max ( N ~ ( e jω ) 2 C thr - 1,0 ) - - - ( 20 )
(20) are brought into (18), and with in the formula (13) Replace
Figure G2009102503930D00114
It is as follows to obtain the postfilter based on auditory perception property that this paper carries:
G ( e jω ) = 1 1 + max ( φ ~ NN C thr - 1,0 ) - - - ( 21 )
In Fig. 1, go out an application based on the microphone array of multi-model and auditory properties after filtering sound enhancement method flow chart.System comprises the microphone array of at least two microphones 101.
The microphone of microphone array may have different arrangements, and especially, microphone 101 is placed in a row, and wherein each microphone and adjoining microphone have predeterminable range.For example, the distance between two microphones may approximately be 5 centimetres.For different applied environments and specification requirement, microphone array may be set in place.
The voice signal of gathering from microphone 101 is sent to signal processing unit 102.Before being sent to signal processing unit, voice signal can come the preliminary treatment voice signal through low pass filter.
The defeated voice signal of gathering of 102 pairs of different microphones of signal processing unit carries out delay compensation to realize time domain alignment.Each microphone signal after using discrete Fourier transform in short-term to align is expressed as the frequency signal form of complex values, calculates the multichannel Noisy Speech Signal of microphone array collection at the spectral power matrix Φ of time frame t, frequency k XX(k t) and to this matrix carries out the characteristic value decomposition, obtains eigenvalue matrix Λ XXWith eigenvectors matrix U.
In following step 103, utilize eigenvalue matrix Λ XX, by the probability method that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace.
Then, step 104 is utilized the dimension Q of signal subspace, based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal.
Step 105 is utilized signal subspace dimension Q and noise power spectrum distributed model, composes according to the conditional probability estimating noise power.
Step 106 utilizes signal subspace dimension and noise power spectrum to estimate, according to auditory masking effect, estimates to obtain the auditory masking threshold of each frequency based on signal subspace.
At last, step 107 utilizes noise power spectrum to estimate and auditory masking threshold, in conjunction with Lagrange multiplier design postfilter.
At Fig. 2, the flow process of the method for a definite signal subspace dimension has been described, this method is corresponding to the step 103 among Fig. 1.
Through step 101 and step 102, the voice signal that microphone array is gathered has passed through time domain alignment, Short Time Fourier Transform.And to the power spectrum Φ of multichannel Noisy Speech Signal XXCarry out characteristic value and decompose, obtain eigenvalue matrix Λ XXWith eigenvectors matrix U.By (9) formula, signals with noise power spectrum characteristic value matrix be broken down into power spectrum signal characteristic value and noise power spectrum characteristic value and, Q is the dimension of signal subspace.
In first step 201, the dimension Q of initializing signal subspace, making it is 1.
Next, step 202 is upgraded noise power spectrum and target voice signal power spectrum.Because Noisy Speech Signal power spectrum characteristic value matrix Λ XXBe descending, and the hypothesis signal strength signal intensity is greater than noise, so when the dimension of signal subspace was Q, the power of noise was
φ NN = 1 L - Q Σ i = Q + 1 L λ X i - - - ( 22 )
Wherein, i ∈ Q+1 ..., L} is the subscript of characteristic value.
And the power of target voice signal is
S = 1 Q Σ i = 1 Q ( λ X i - φ NN ) 1 2 - - - ( 23 )
Wherein, i ∈ 1 ..., Q} is the subscript of characteristic value.
So, the variance of target voice signal is
v s = λ X 1 - φ NN Q = 1 1 Q Σ i = 1 Q [ ( λ X i - φ NN ) 1 2 - S ] 2 Q > 1 - - - ( 24 )
Wherein, wherein, i ∈ 1 ..., Q} is the subscript of characteristic value.
Step 203 selects a spectrum of describing the target voice signal to distribute from Gauss model, laplace model and gamma model arbitrarily.Calculate the conditional probability P of target voice signal G(S (k, t) | H 1), especially, when selecting Gauss model,
P G ( S ( k , t ) | H 1 ) = 1 2 π v s ( k , t ) exp { - S 2 ( k , t ) 2 v s ( k , t ) }
Step 204 realizes that variable Q and j's adds computing certainly:
Q=Q+1
Then step 205 is judged loop termination condition Q>L, especially, when condition does not satisfy, returns step 202; Otherwise carry out step 206.
Formula that step 206 is utilized (10) of the present invention has finally been determined the dimension Q of signal subspace, namely
arg max Q P ( S ( k , t ) | H 1 ) .
In Fig. 3, the flow chart of noise power spectrum distributed model in the definite Noisy Speech Signal has been described.This method is corresponding to the step 104 among Fig. 1.
Gauss model, laplace model and gamma model can be used to describe the spectral coefficient of voice signal and noise signal, but also can be different for its noise characteristic of different noise types, so Model Selection should be carried out targetedly according to the characteristic of target noise.In this example, the statistics according to the computer fan noise has provided the method that a kind of stationarity based on spectrum is carried out Model Selection.
In step 301, calculate discriminant score Ω by (11) formula.
Step 302 judges that whether discriminant score Ω is less than Ω 1If judged result is true, then selects Gauss model; Otherwise execution in step 303 judges that whether discriminant score Ω is less than Ω 2If judged result is true, then selects laplace model; Otherwise select the gamma model.
The model adaptation selection algorithm that the present invention embodies is based on the result to the data statistics of a large amount of computer fan noise experiment.Experiment finds that Gauss model is optimal models when Ω gets smaller value, when the Ω value is big, and the laplace model optimum, and the total average noise evaluated error of gamma model is minimum.Accordingly, to carry out Model Selection as follows in the present invention:
Figure G2009102503930D00133
In Fig. 4, a method flow diagram that utilizes conditional probability estimating noise power spectrum has been described.This method is corresponding to the step 105 among Fig. 1.
Step 401 is calculated the average power spectra φ of the pure noise frame of Noisy Speech Signal The initial segment NN Pre
Step 402 is calculated the power spectrum of present frame
φ NN cur = 1 L Σ i = 1 L λ X i
Wherein, i ∈ 1 ..., L} is the subscript of characteristic value.
Next step 403 is calculated the ratio of present frame power spectrum and pure noise power spectrum
r = φ NN cur φ NN pre
Step 403 has been finished conditional probability P (H jointly to step 408 0| calculating X).The size of r and setting threshold α at first relatively, α gets and is slightly larger than 1 smaller value, and especially, α is taken as 1.2.When r<α, present frame more may be pure noise frame, so P (H 0| X) should get bigger value, the present invention arranges under it and is limited to 0.8.If work as r>α, present frame more may be speech frame, at this moment P (H 0| X) should get a suitable value.Because the energy of signal is distributed inhomogeneous on each frequency, so, different P (H got according to different frequencies here 0| X) value.When low frequency, P (H 0| value X) should be greater than the value of high frequency, because the energy of signal concentrates on low frequency region mostly.Namely
P ( H 0 | X ) = max ( 1 1 + r β 1 , 0.8 ) r ≤ 1.2 1 1 + r β 2 if f ≤ f thr 1 1 + r β 3 if f > f thr r > 1.2 - - - ( 26 )
Wherein, f ThrBe the threshold frequency of low-and high-frequency, β 1And β 2It is weight coefficient.
Step 409 design conditions probability P (H 1| X)=1-P (H 0| X).
Obtain conditional probability P (H 0| X) and P (H 1| X), step 410 utilizes (13) formula to obtain the estimated value of noise power spectrum
Figure G2009102503930D00144
In Fig. 5, a kind of flow chart that calculates the method for auditory masking threshold has been described.This method is corresponding to the step 106 among Fig. 1.For the masking by noise in the signal is fallen, thereby realize enhancing to target voice signal signal, need be with noise limit at this below threshold value.
Step 501 is divided into 24 sub-frequency bands with 0 to 15500Hz human auditory system scope, so that calculate auditory masking threshold in each sub-band.
In step 502, utilize the signal subspace dimension of step 206 gained, calculated the energy of each frequency.(j, b) expression is the energy on b frequency in the j sub-frequency bands to H, can calculate according to signal subspace characteristic value and characteristic vector.
H ( j , b ) = mean ( 1 L Σ i = 1 Q λ S i | U 1 , i | 2 ) - - - ( 27 )
Wherein, λ S i = λ X i - φ ~ NN For the characteristic value of target voice signal spectral power matrix is estimated U 1, iBe i base of signal subspace, i ∈ 1 ..., Q} is that the subscript m ean () of characteristic value is for getting the average operator.
SF (j) is the function of expressing people's ear basement membrane propagation characteristic on the j sub-frequency bands, j ∈ 1 ..., 24}.
In step 503, calculate the propagation function of each sub-band
SF ( j ) = 15.81 + 7.5 ( j + 0.474 ) - 17.5 1 + ( j + 0.474 ) 2 , j∈{1,…,24}(28)
Next, the excitation energy value of energy on the step 504 computational chart traveller on a long journey ear basement membrane
C(j,b)=SF(j)*H(j,b),j∈{1,…,24}(29)
Step 505 is calculated auditory masking threshold
C thr = 10 log 10 | C ( j , b ) | - | O ( j ) 10 | - | φ ~ NN 10 | - - - ( 30 )
Wherein, O (j) is side-play amount, j ∈ 1 ..., 24} represents the j sub-frequency bands.
In Fig. 6, the flow chart of a design postfilter has been described.This method is corresponding to the step 107 among Fig. 1.
The power of residual noise is lower than under the condition of auditory masking threshold in the voice after guaranteeing enhancing, for the distortion that makes target voice signal signal reaches minimum.
Step 601 is described constrained optimization problem, and is as follows:
Target:
min E T = [ G ( e jω ) - 1 ] 2 S ( e jω ) 2 + G ( e jω ) 2 N ~ ( e jω ) 2
Constraints:
G ( e jω ) 2 N ~ ( e jω ) 2 ≤ C thr
Step 602 utilizes method of Lagrange multipliers to find the solution, order:
J = E T + μ ( G ( e jω ) 2 N ~ ( e jω ) 2 - C thr )
Make the G (e of J J ω) and μ differentiate respectively, and be zero, can get:
G ( e jω ) = S ( e jω ) 2 S ( e jω ) 2 + ( 1 + μ ) N ~ ( e jω ) 2 G ( e jω ) = C thr N ~ ( e jω ) 2
Step 603 is found the solution this equation, obtains the optimal estimation of postfilter, that is:
G ( e jω ) = 1 1 + max ( φ ~ NN C thr - 1,0 )
The noise power spectrum that again step 410 is obtained is estimated
Figure G2009102503930D00163
With the 505 auditory masking threshold C that obtain ThrBring into, step 604 is finished the design of postfilter.
According to this specification, the further modifications and variations of the present invention are apparent for the technical staff in described field.Therefore, this explanation will be regarded as illustrative and its objective is to one of ordinary skill in the art lecturing for carrying out conventional method of the present invention.Should be appreciated that the form of the present invention that this specification illustrates and describes just is counted as current preferred embodiment.

Claims (8)

1. one kind based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, it is characterized in that, comprises the following steps:
Step a: the multi-path voice signal of the microphone array collection band noise of forming by L microphone, the voice signal of each road band noise is carried out time domain alignment, the frequency signal form of each the road signal indication value of pluralizing after using discrete Fourier transform in short-term to align is calculated the spectral power matrix of microphone array multiple signals and this spectral power matrix is carried out characteristic value decompose and obtain eigenvalue matrix and eigenvectors matrix;
Step b: by the probability that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace, and Q≤L;
Step c: based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal;
Steps d: utilize conditional probability estimating noise power spectrum;
Step e: estimate according to signal subspace dimension and noise power spectrum, utilize auditory masking effect, estimate to obtain the auditory masking threshold of each frequency based on signal subspace;
Step f: according to noise power spectrum, auditory masking threshold, estimate postfilter in conjunction with Lagrange multiplier, residual noise in the feasible enhancing voice is less than the auditory masking threshold of people's ear, thereby eliminate the residual noise influence, and make the distortion of target voice signal as much as possible little, the filtering voice strengthen after finishing microphone array, wherein:
Described step in conjunction with Lagrange multiplier estimation postfilter G is as follows:
Step fa: under the constraints of residual noise power less than masking threshold, minimize the distortion of target voice signal, set up optimization problem with this;
Step fb: find the solution in conjunction with Lagrange multiplier, obtain the optimal estimation of postfilter;
Step fc: bring auditory masking threshold and noise power spectrum into and estimate, finish the design of postfilter.
2. the method for claim 1 is characterized in that, describedly spectral power matrix is carried out characteristic value decomposes, and comprising:
Utilize characteristic value to decompose the Noisy Speech Signal space is divided into two sub spaces, i.e. signal subspace: comprise target voice signal and noise; Noise subspace: only comprise noise; The spectral power matrix Φ of Noisy Speech Signal X at time frame t and frequency k XX(k, t) characteristic value is decomposed into:
Φ XX(k,t)=UΛ XXU H=U(Λ SSNN(k,t)I)U H
Wherein, X=S+N, X are Noisy Speech Signal, and S is the target voice signal, and N is noise; Λ XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, Λ SSBe the target voice signal power spectrum characteristic value matrix of characteristic value descending, U is eigenvectors matrix, φ NN(k t) is the noise power of time frame t and frequency k, and I is L rank unit matrix,
Figure FDA00002679787400021
Be the conjugate transpose operator.
3. the method for claim 1 is characterized in that, described definite signal subspace dimension is to get the probability maximum that only Q value makes that the target voice signal exists in the noisy speech; Utilize conditional probability to calculate, step comprises:
Definition exclusive events H 0And H 1:
Event H 0: in the Noisy Speech Signal, only there is noise, do not have the target voice signal;
Event H 1: in the Noisy Speech Signal, target voice signal and noise exist simultaneously;
Signal subspace dimension Q is defined as:
arg max Q P ( S ( k , t ) | H 1 )
Wherein, (k t) is the power spectrum of target voice signal signal on k Frequency point of t frame to S, and P () is the distribution function of target voice signal spectrum, and argmax () is the operator of seeking the parameter value with maximum scores.
4. the method for claim 1 is characterized in that, described stationarity based on spectrum, and noise power spectrum distributed model in the adaptively selected Noisy Speech Signal may further comprise the steps:
Step c1: define a discriminant function Ω who is used for explaining the stationarity of power spectrum:
Ω = ( L - Q ) Π i = Q + 1 L λ X i 1 L - Q Σ i = Q + 1 L λ X i
That is, Ω is geometric average
Figure FDA00002679787400024
To arithmetic average
Figure FDA00002679787400025
Ratio, wherein,
Figure FDA00002679787400026
Be Noisy Speech Signal power spectrum characteristic value matrix Λ XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1;
Step c2: compare according to discriminant score and predetermined threshold value, determine to be useful in the noise power spectrum distributed model in the Noisy Speech Signal.
5. method as claimed in claim 4 is characterized in that, described comparison step according to discriminant score and predetermined threshold value comprises:
Step c21: determine two predetermined threshold value Ω 1And Ω 2, Ω 1<Ω 2
Step c22: compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω 1, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω 2, then select Gamma distribution for use; Otherwise select laplacian distribution for use.
6. the method for claim 1 is characterized in that, utilizes the step of conditional probability estimating noise power spectrum to comprise:
For each frame Noisy Speech Signal, the probability that it only contains noise is P (H 0| X), namely containing the probability that noise contains the target voice signal again is P (H 1| X); At both of these case, the estimating noise power spectrum is as follows respectively:
H 0 : φ NN 0 = 1 L Σ i = 1 L λ X i H 1 : φ NN 1 = 1 L - Q Σ i = Q + 1 L λ X i
Wherein,
Figure FDA00002679787400032
With Be respectively that noise is at exclusive events H 0And H 1Power spectrum under a situation arises, i ∈ 1 ..., L} is the subscript of characteristic value;
According to condition probability formula, noise power spectrum is estimated as follows:
φ ~ NN = P ( H 0 | X ) φ NN 0 + P ( H 1 | X ) φ NN 1 .
7. the method for claim 1 is characterized in that, the step of described estimation auditory masking threshold comprises:
Step f1: auditory frequency range 0-15500Hz is divided into several crucial sub-bands;
Step f2: calculate the auditory masking threshold in each sub-band respectively.
8. method as claimed in claim 7, it is characterized in that, auditory masking threshold in each sub-band of described calculating is the energy that calculates each frequency on each sub-band, calculate people's ear basement membrane for the propagation coefficient of each frequency range sound, then the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound being multiplied each other obtains the epilamellar excitation energy value of people's ear, and the functional relation according to the epilamellar excitation energy value of people's ear and auditory masking threshold calculates masking threshold again.
CN2009102503930A 2009-12-07 2009-12-07 Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic Expired - Fee Related CN101778322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102503930A CN101778322B (en) 2009-12-07 2009-12-07 Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102503930A CN101778322B (en) 2009-12-07 2009-12-07 Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic

Publications (2)

Publication Number Publication Date
CN101778322A CN101778322A (en) 2010-07-14
CN101778322B true CN101778322B (en) 2013-09-25

Family

ID=42514612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102503930A Expired - Fee Related CN101778322B (en) 2009-12-07 2009-12-07 Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic

Country Status (1)

Country Link
CN (1) CN101778322B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102157156B (en) * 2011-03-21 2012-10-10 清华大学 Single-channel voice enhancement method and system
US20140114650A1 (en) * 2012-10-22 2014-04-24 Mitsubishi Electric Research Labs, Inc. Method for Transforming Non-Stationary Signals Using a Dynamic Model
CN102945674A (en) * 2012-12-03 2013-02-27 上海理工大学 Method for realizing noise reduction processing on speech signal by using digital noise reduction algorithm
CN104575511B (en) * 2013-10-22 2019-05-10 陈卓 Sound enhancement method and device
EP2876900A1 (en) * 2013-11-25 2015-05-27 Oticon A/S Spatial filter bank for hearing system
CN105244036A (en) * 2014-06-27 2016-01-13 中兴通讯股份有限公司 Microphone speech enhancement method and microphone speech enhancement device
US9401158B1 (en) * 2015-09-14 2016-07-26 Knowles Electronics, Llc Microphone signal fusion
KR102070965B1 (en) * 2015-11-18 2020-01-29 후아웨이 테크놀러지 컴퍼니 리미티드 Sound signal processing apparatus and method for enhancing the sound signal
CN105792074B (en) * 2016-02-26 2019-02-05 西北工业大学 A kind of audio signal processing method and device
CN107370898B (en) * 2016-05-11 2020-07-07 华为终端有限公司 Ring tone playing method, terminal and storage medium thereof
CN110858485B (en) * 2018-08-23 2023-06-30 阿里巴巴集团控股有限公司 Voice enhancement method, device, equipment and storage medium
CN110875052A (en) * 2018-08-31 2020-03-10 深圳市优必选科技有限公司 Robot voice denoising method, robot device and storage device
CN109979478A (en) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 Voice de-noising method and device, storage medium and electronic equipment
CN113362856A (en) * 2021-06-21 2021-09-07 国网上海市电力公司 Sound fault detection method and device applied to power Internet of things
CN113658605B (en) * 2021-10-18 2021-12-17 成都启英泰伦科技有限公司 Speech enhancement method based on deep learning assisted RLS filtering processing
JP2024510225A (en) * 2021-12-20 2024-03-06 深▲セン▼市韶音科技有限公司 Voice activity detection method, system, voice enhancement method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于高斯-拉普拉斯-伽玛模型和人耳听觉掩蔽效应的信号子空间语音增强算法;程宁等;《声学学报》;20091130;第34卷(第6期);第555-561页 *
程宁等.基于高斯-拉普拉斯-伽玛模型和人耳听觉掩蔽效应的信号子空间语音增强算法.《声学学报》.2009,第34卷(第6期),第555-561页.

Also Published As

Publication number Publication date
CN101778322A (en) 2010-07-14

Similar Documents

Publication Publication Date Title
CN101778322B (en) Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic
EP3696814A1 (en) Speech enhancement method and apparatus, device and storage medium
Doclo et al. GSVD-based optimal filtering for single and multimicrophone speech enhancement
CN108922554B (en) LCMV frequency invariant beam forming speech enhancement algorithm based on logarithmic spectrum estimation
Hu et al. A generalized subspace approach for enhancing speech corrupted by colored noise
US7761291B2 (en) Method for processing audio-signals
US7146315B2 (en) Multichannel voice detection in adverse environments
US20030055627A1 (en) Multi-channel speech enhancement system and method based on psychoacoustic masking effects
US11082780B2 (en) Kalman filtering based speech enhancement using a codebook based approach
CN109308904A (en) A kind of array voice enhancement algorithm
KR102236471B1 (en) A source localizer using a steering vector estimator based on an online complex Gaussian mixture model using recursive least squares
US9854368B2 (en) Method of operating a hearing aid system and a hearing aid system
US8296135B2 (en) Noise cancellation system and method
CN111081267B (en) Multi-channel far-field speech enhancement method
EP2395506B1 (en) Method and acoustic signal processing system for interference and noise suppression in binaural microphone configurations
CN102903368A (en) Method and equipment for separating convoluted blind sources
Fontaine et al. Explaining the parameterized Wiener filter with alpha-stable processes
CN108597531A (en) A method of improving binary channels Blind Signal Separation by more sound source activity detections
Mohammadiha et al. A new approach for speech enhancement based on a constrained nonnegative matrix factorization
Araki et al. Hybrid approach for multichannel source separation combining time-frequency mask with multi-channel Wiener filter
US8306249B2 (en) Method and acoustic signal processing device for estimating linear predictive coding coefficients
Miyazaki et al. Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction
Rosenkranz Noise codebook adaptation for codebook-based noise reduction
Martín-Doñas et al. An extended kalman filter for RTF estimation in dual-microphone smartphones
Martın-Donas et al. A postfiltering approach for dual-microphone smartphones

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130925