CN106504763A

CN106504763A - Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction

Info

Publication number: CN106504763A
Application number: CN201611191478.2A
Authority: CN
Inventors: 于鸿洋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-12-22
Filing date: 2016-12-21
Publication date: 2017-03-15

Abstract

The invention discloses a kind of microphone array multiple target sound enhancement method based on blind source separating and spectrum-subtraction, which comprises the following steps：Multi-channel multi-target signal is gathered by microphone array；Bandpass filtering treatment is carried out respectively to gathering single channel signal, to shield non-speech segment noise and interference, and preemphasis is processed；Carry out voice adding window sub-frame processing again, obtain frame signal, and each frame is transformed into frequency domain using short time discrete Fourier transform, extract amplitude spectrum, the phase spectrum of each frame；Detection voice signal starting endpoint and end caps, estimating noise power are composed；Based on the background noise that spectrum-subtraction reduces speech frame；The signal exported after spectrum-subtraction is combined with phase spectrum, Short-time Fourier inverse transformation is carried out, is obtained the voice signal of time domain；Blind source separating is carried out finally, each echo signal is obtained.The implementation method of the present invention is simple, and low to resource requirement, computation complexity is low, and can realize that Multiple Target Signals strengthen.

Description

Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction

Technical field

The invention belongs to signal processing technology and computer speech signal processing technology field, and in particular to a kind of based on battle array The sound enhancement method of row mike.

Background technology

The target of speech enhan-cement is from containing noisy voice signal to extract raw tone as pure as possible, suppresses the back of the body Scape noise, improves the quality of voice and improves the comfort level of hearer, make hearer not feel fatigue.It solve sound pollution, Improve the aspects such as voice quality, the raising intelligibility of speech and play more and more important effect.Speech enhancement technique is voice letter Number processing needs the problem of urgent solution after developing into the practical stage.In speech recognition, anti-noise jamming is improve discrimination one Individual important step.With speech recognition application continuous expansion and enter the practical stage, needed using more efficiently language in a hurry Sound enhancement techniques, strengthen speech recognition features, make voice readily identified.Voice signal is a kind of complicated nonlinear properties, such as What from various mixing voice signals, particularly from Co-channel SPEECH interference in isolate required for voice signal be one very Difficult Digital Signal Processing problem, any algorithm be impossible to filter noise completely, all it is difficult to all exist in all noises In the case of keep higher subjective and objective appraisal performance.

Based on the typical workflow of microphone array voice enhancement method as shown in figure 1, under idiographic flow mainly includes Row step：

1) according to demand, design meets the microphone array array structure of demand.

2) multicenter voice signal acquiring system, for gathering multicenter voice signal；

3) the multicenter voice signal to gathering carries out pretreatment, voice activation detection, communication channel delay estimation, echo signal The pretreatment operation of orientation estimation etc..

4) speech enhan-cement is carried out using array voice enhancement algorithm, obtain more pure voice signal.

Step 1) in, one suitable microphone array array structure of design is very important.

Microphone array topological structure can be divided into one-dimensional linear array (including equidistant array, nested linear battle array and non- Equidistant array), two-dimensional array (including uniform non-homogeneous circle, square formation) and three-dimensional volumetric array.In practice, apply more Have uniform line array, nested linear array, uniform surface battle array etc..Research shows that array topology is to Microphone Array Speech system Impact larger.And the design of array topology has substantial connection with the selection of multi channel signals model.

According to the distance of sound source distance arrays, acoustical signal model can be divided into far field model and near field model.Both Difference be：Far field model uses plane wave model, and it ignores the amplitude difference of each channel receiving signal, information source facing arrays For only one of which incident angle, the delay length between each array element is linear；Near field model uses sphere waveform, it Consideration receives the amplitude difference between signal, and has an incident angle, the time delay between each array element for each array element Length does not have obvious relation.Near field and far field divide the absolute standard of neither one, it is generally recognized that when information source and array center Distance when being far longer than signal wavelength the information source be in far field；Otherwise it is then near field.

Generally, microphone array can be regarded as a device for carrying out spatial sampling, battle array similar to time sampling Row sample frequency sufficiently high just must will not cause ambiguity of space angle, it is to avoid spacial aliasing.For a uniform line-array, spatial sampling Rate is defined as：That is spatial sampling frequencies U_sDetermined by microphone array spacing d.Consider that the neighbouring sample of same signal is poor Not Wei a phase shift, defining normalized spatial frequency is：Wherein λ represents that wavelength, Φ represent incident angle.For Avoid spacial aliasing, it is desirable to which normalized frequency U meets：Now the corresponding scope of incident angle is -90^°≤Φ≤ 90^°, therefore neighboring microphones interval (microphone array spacing) should be：

Above-mentioned spatial sampling theorem discloses microphone array spacing, signal frequency and arrival bearing (incident angle Φ) three Between relation.If being unsatisfactory for spatial sampling theorem, spacial aliasing phenomenon occurs.

For a homogenous linear microphone array, r is defined_mFor sound source to m-th microphone array center straight line away from From.Then the discrete signal of m-th mike output can be expressed as：x_m[n]=s [n- Δ n_m]+η_m[n], wherein s [n] are sound source Signal, Δ n_mSample point between the signal received for m-th mike and sound-source signal postpones, η_m[n] is m-th Mike The noise signal that wind is received.Make Δ τ_mTime delay between the signal received for m-th mike and sound-source signal, then There is following relation：Wherein f_sFor time sampling frequency, c is sound wave propagation in space Speed.It is possible thereby to set up the array signal matrix of microphone array output：

x₁[n]=s [n- Δ n₁]+η₁[n]

x₂[n]=s [n- Δ n₂]+η₂[n]

x_N[n]=s [n- Δ n_N]+η_N[n]

Element number of array of the N for array microphone.

In step 3) in, visually different Enhancement Method or once or subtract.

In pretreatment, preemphasis and pre-filtering are determined by voice signal characteristic.The purpose of pre-filtering has two： 1. each frequency domain components medium frequency of input signal is suppressed to exceed f_s/ 2 institute is important, to prevent aliasing from disturbing；2. suppress 50Hz's Power supply Hz noise.So prefilter must be a band filter, if thereon, lower limiting frequency be respectively f_HAnd f_L, then f_H=3400Hz, f_L=60～100Hz, sample frequency f_s=16000Hz.

As the average power spectra of voice signal is by glottal excitation and mouth and nose radiation effect, front end about 800Hz with On fall by 6dB/ frequencys multiplication, so when speech signal spec-trum is sought, the higher corresponding composition of frequency is less, the frequency spectrum of HFS Ask than the difficulty of low frequency part, be that this will carry out preemphasis process in pretreatment.The purpose of preemphasis is to lift HFS, makes The frequency spectrum of signal becomes flat, is maintained at low frequency in the whole frequency band of high frequency, can seek frequency spectrum with same signal to noise ratio, in order to Spectrum analyses or channel parameters analysis.Preemphasis can be realized by the preemphasis digital filter for lifting high frequency characteristics that it is general It is order digital filter, based on its operation principle, can obtain the corresponding mode that increases is：S ' (n)=s (n)-α s (n+1), In order to recover original signal, need to carry out the signal spectrum for doing preemphasis process of postemphasising, i.e. s ' ' (n)=s ' (n)+β s ' (n + 1) wherein, s (n) represents that sound-source signal, s ' (n) represent that increasing the signal after processing, s " (n) represents the letter after processing that postemphasises Number,It is weighting factor with β, typically takes -0.8～0.95.

As voice signal is a kind of time varying signal of non-stationary, which produces process and the tight phase of the motion of phonatory organ Close.And the state velocity of phonatory organ is slow-paced more compared with acoustical vibration, therefore voice signal may be considered and put down in short-term Steady.Research finds, in the range of 5～50ms, speech spectral characteristics and some physical characteristic parameters are held essentially constant.Cause This can be incorporated into the processing method in stationary process and theory in the middle of the short time treatment of voice signal, and voice signal is divided For a lot of voice segments in short-term, each voice segments in short-term referred to as analysis frame.So, to the process of each frame signal just quite In processing to the persistent signal that feature is fixed.Frame can both be continuous, it would however also be possible to employ overlapping framing, general frame length take 10～30ms.When fetching data, the overlapped portion of former frame and a later frame is referred to as frame and moves, frame move with the ratio of frame length be typically taken as 0～ 1/2.Will be through windowing process, i.e., with certain window function w (n) and signal multiplication, so as to form adding window to the speech frame for taking out Voice.The Main Function of adding window is to reduce the spectral leakage that is brought by sub-frame processing, this is because, framing is to voice signal Block suddenly, equivalent to the periodic convolution of frequency spectrum and the rectangular window function frequency spectrum of voice signal.Side due to rectangular window frequency spectrum Lobe is higher, and the frequency spectrum of signal can produce " hangover ", i.e. spectral leakage.For this purpose, Hamming window can be adopted, because Hamming window secondary lobe is most Low, can be efficiently against leakage phenomenon, with smoother low-pass characteristic, the frequency spectrum for obtaining is smoother.

Between array element, the estimation of time delayses plays a very important role in whole microphone array speech enhancement：It and Signal frequency has together decided on the directivity of wave beam and has estimated for the orientation to sound source.The direct shadow of time delay estimation precision Ring the performance of speech processing system.Due to spatial sampling of the microphone array to voice signal so that the letter that mike is received Number relative with reference microphone for have certain time delay.For making the maximum of the output of Wave beam forming point to alignment target signal Source, the expectation voice signal for keeping each mike to receive are synchronously the important means for solving the problem.Typical time delay is estimated Meter method has broad sense correlation time-delay estimate method, based on the delay time estimation method of adaptive-filtering, self-adaptive features decomposition, height Rank accumulation amount estimation method etc..Wherein broad sense correlation time-delay estimate method application is the most universal.Assume that a pair of mikes connect Receiving voice signal model is：x₁(t)=s (t)+η₁、x₂(t)=s (t-D)+η₂, wherein s (t) be sound-source signal, x₁(t) and x₂ T () is the signal that two mikes are received respectively, D is the sound transmission time delay between two mikes, η₁And η₂For additivity background Noise.Assume s (t), η₁、η₂Orthogonal, ignore signal amplitude decay here.Then x₁(t) and x₂T the broad sense between () is mutual Close function R₁₂(τ) it is：

Wherein X₁(ω) and X₂(ω) x is respectively₁(t) and x₂The Fourier transform of (t), ψ₁₂(ω) add for broad sense cross-correlation Weight function.Different weighting functions are selected for different situations so that R₁₂(τ) peak value for having comparison sharp, then at peak value Time delay between as two mikes.

Voice activation detection also known as speech detection, speech terminals detection, for be accurately determined input voice starting point and Terminal, good the to ensure speech processing system performance, for voice different with the processing method of noise, if can not judge to work as If front speech frame is noisy speech frame or noise frame, appropriate process cannot be carried out.In speech-enhancement system, in order to More background noise characteristics are obtained, speech terminals detection is more concerned with how accurately detecting without segment.Phonic knowledge The accumulation that study and noise source information are estimated all relies on accurate end-point detection.Common voice activation detection is based on voice , come carried out, the length of speech frame is in 10～30ms for frame.The method of voice activation detection can be summarized and be：From input signal Middle extract one or a series of contrast characteristic's parameters, then itself and one or a series of threshold values are compared.Such as Fruit then represents current for there is segment more than thresholding, otherwise to mean that be currently without segment.

Speech detection typically has two steps：

The first step：Feature based on voice signal.With parameters such as energy, zero-crossing rate, entropy (entropy), pitches, and it Derivative parameter judging the speech/non-speech section in signal stream.

Second step：After voice signal is detected in signal stream, the starting point or end point for being voice herein is judged.? In voice system, it is easier to make in sentence, have pause (non-voice) due to the changeable background of signal and natural dialogue pattern, especially Be outburst initial consonant before be silence gap again.Therefore the judgement of this beginning or end is particularly important.

The at present method taken by speech terminals detection can substantially be divided into two classes：

The first kind be under noise circumstance based on HMM model speech sound signal terminal point detect method, the method require background Noise held stationary and signal to noise ratio is higher.

Equations of The Second Kind method is the algorithm detected based on the short-time energy of signal, and it is by the system to background noise energy Meter, is made energy threshold, is determined voice signal starting point using energy threshold.

Step 4) in, more pure voice signal is obtained using voice enhancement algorithm.

Speech enhancement technique is can be largely classified into based on single pass method and the method for multichannel array mike.Single-pass Road sound enhancement method species is various, and the feature for being all based on greatly various noise cancellation methods with reference to voice signal is studying with pin Algorithm to property, it is spectrum-subtraction (SS which is theoretical ripe and most simple and effective：Spectral Subtraction) voice increasing By force.Single sensor pickup can be limited by place, distance, application scenario, and therefore pickup effect will be had a greatly reduced quality, follow-up Speech enhan-cement also will be difficult.

The ultimate principle of spectrum-subtraction is：In the power spectrum that the power spectrum of noisy speech is deducted frequency domain noise, voice is obtained Power Spectral Estimation, just obtain after evolution voice amplitudes estimation, by after its phase recovery again using inverse Fourier transform recover when Domain signal.Consider that human ear feels insensitive to phase place, the phase place adopted during phase recovery is the phase place letter of noisy speech Breath.As voice is short-term stationarity, so thinking that in short-time spectrum amplitude Estimation it is stationary random signal.

Assume that s (n), η (n) and x (n) represent voice, noise and noisy speech respectively, S (ω), Γ (ω) and X (ω) divide Its short-time spectrum is not represented.Assume that s (n), η (n) are uncorrelated and noise is additive noise.Then the additive model of signal is obtained：x N ()=s (n)+η (n), the signal after windowing process are expressed as x_w(n), s_w(n), η_wN (), then have：x_w(n)=s_w (n)+η_wN (), does Fourier transform to which, obtain：X_w(ω)=S_w(ω)+Γ_w(ω), therefore power spectrum is had：|X_w(ω)|²= |S_w(ω)|²+According to observation data estimation | X_w()|², remaining item must Average statistical must be approximately.As s (n), η (n) are independent, then cross-power average statistical is 0, so the valuation of raw tone is：Wherein, estimated valueIt cannot be guaranteed that be non-negative, this is because Estimate there is error during noise, when estimating that noise average power is more than certain frame noisy speech power, the estimated value that the frame drawsNegative situation will be appeared as, these negative values can by change their symbol be allowed to be changed on the occasion of, it is also possible to Directly to their zero setting.WillRecover phase place and do the time domain that Short-time Fourier inverse transformation IFFT can just obtain voice signal Estimate：

Currently, microphone array speech enhancement mainly has Wave beam forming, Subspace Decomposition, blind source separating etc..Wherein, Blind source separating (BSS) refer to do not know cannot or in the case of source signal and hybrid mode only by observation signal recovery resource The process of signal.I.e. blind source separating can be independent of the priori conditions of current event, just be able to can be carried out with less mike Speech enhan-cement, the central issue of the algorithm are to solve multiple people's voices to interfere the voice of each speaker point under aliasing situation Separate out and, reach to the enhanced purpose of each target voice.

Independent component analysis (ICA) are one of effective ways of Blind Signal Separation, belong at linear instantaneous mixing fanaticism number Reason, the method do not rely on the accurate identification of the related detailed knowledge of source signal type or signal transmission system characteristic, are a kind of Effectively redundancy cancels technology.Difference of the method according to cost function, can obtain different ICA algorithms, such as information maximization Change (infomax) algorithm, Fast ICA algorithms, maximum entropy (M E) and Minimum mutual information (MM I) algorithm, maximum likelihood (ML) to calculate Method etc..Its ultimate principle is：Obtained signal is seen as echo signal to mix through a linear transformation, in order to obtain mesh Mark signal is accomplished by finding an inverse linear transformation and the signal decomposition of acquisition comes, so as to reach the detached purpose of information source.

In the case of muting, with X=[x₁(t) x₂(t) … x_N(t)] ' represent microphone array receive one Group observation signal, wherein t are time or sample sequence number, and N is mike number, it is assumed which is by independent element linear hybrid Into that is,Wherein A is certain unknown non-singular matrix.So the vector of its signal model Expression formula is X=AS.

In the presence of noise, it is assumed that noise is additive noise.Then the expression formula of its signal model is：X=AS+ Γ, Wherein Γ=[η₁η₂… η_N] it is noise vector.X=AS+ Γ are converted and can be obtained：X=A (S+ Γ₀),Γ=A Γ₀,Therefore can draw, signals with noise modelIt is so basic ICA models to appoint, simply independent element By S-transformation it isUnder ICA baseband signal models, it is assumed that separation matrix to be asked is W, and after separating, signal matrix is Y, then have as follows Expression formula：Y=WX=WAS.The final purpose of ICA is that one optimum of searching or preferably separation matrix W cause letter after separation In number matrix Y, each signal is separate and approaches source signal as far as possible.

Currently, the speech enhan-cement based on microphone array is all that single target is carried out, so as to limit array pickup dress The effective pickup effect that puts, and traditional single goal strengthens the demand that can not meet practical application.

Content of the invention

The present invention is in order to solve at present in the enhanced technical problem of multiple target based on array voice signal, it is proposed that a kind of Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction.

The microphone array multiple target sound enhancement method based on blind source separating and spectrum-subtraction of the present invention, including following step Suddenly：

Step 1：Noisy Speech Signal is gathered by the microphone array of two-dimensional array, each passage of microphone array is obtained Collection signal, wherein microphone array column number be more than or equal to 4；

Step 2：Collection signal execution step 201～205 to each passage respectively：

Step 201：Bandpass filtering treatment, shielding non-speech segment noise and interference are carried out to gathering signal；Again band logical is filtered Signal after ripple carries out preemphasis process, and framing, windowing process obtain frame signal；

Then frequency domain conversion is carried out to every frame signal, i.e., short time discrete Fourier transform is carried out to each frame signal, and calculates every frame Power spectrum；Calculate and retain the phase spectrum of every frame simultaneously, in case the phase recovery during spectrum-subtraction；

Step 203：Speech detection is carried out to the frame signal of every frame, is judged that present frame is speech frame or noise frame, is based on Noise frame estimating noise power is composed；

Step 204：Noise power spectrum in the power spectrum of speech frame is removed based on spectrum-subtraction, the phonetic speech power of every frame is obtained Power estimation；

Step 205：To phonetic speech power Power estimation evolution, and after the phase spectrum based on corresponding frame carries out phase recovery, then enter Row Short-time Fourier inverse transformation, the time domain for obtaining speech frame estimate signal；

Signal Pretreatment is carried out to single pass collection signal in step 2, by collection signal be divided into a lot of in short-term Voice segments (band is made an uproar), i.e. frame signal, then spectrum-subtraction process is carried out to the frame signal of every frame respectively, made an uproar with reducing the background of speech frame Sound.

Step 3：Signal carries out information source separating treatment using blind source separating method to be estimated to the time domain of the speech frame of all passages, Obtain the echo signal of various information source；

Step 4：The echo signal of same information source is carried out postemphasising, goes window, frame reorganization, the mesh of various information source is obtained Mark voice signal.

In sum, as a result of above-mentioned technical proposal, the invention has the beneficial effects as follows：(1) traditional list is solved Channel speech Enhancement Method processing environment background noise, algorithm are simple, the technical problem not high to resource requirement；(2) no longer according to Bad Array Signal Processing algorithm carries out space filtering, it is not necessary to consider broadband beams algorithm, reduces the complexity of algorithm structure； (3) achieved using blind source separation algorithm and echo signal is strengthened, no longer single or rotation carries out single goal signal enhancing.

Description of the drawings

Fig. 1 is traditional voice strengthening system schematic diagram.

Fig. 2 is that the specific embodiment of the invention realizes system schematic.

Fig. 3 is the flow chart of speech detection.

Fig. 4 is spectrum-subtraction single-channel voice Enhancement Method flow chart.

Specific embodiment

For making the object, technical solutions and advantages of the present invention clearer, with reference to embodiment and accompanying drawing, to this Bright it is described in further detail.

Referring to Fig. 2, the multiple target sound enhancement method of the present invention, first the microphone array of two-dimensional array is gathered are each Single channel signal (voice signal) carries out Signal Pretreatment, and single pass voice signal is divided into a lot of voice segments in short-term, Frame signal is obtained, in order to follow-up voice activation detection, spectrum-subtraction process.Wherein Signal Pretreatment includes bandpass filtering, pre- Increase to process, overlap framing, Hamming window windowing process.

Carry out voice activation detection, spectrum-subtraction process to single pass frame signal respectively, then same speech frame is owned Passage carries out blind source separating, obtains the echo signal of not homologous signal, then the correspondingly converse operation of Signal Pretreatment, to same source The echo signal of signal is postemphasised, and goes Hamming window to process, and frame restructuring obtains each targeted voice signal, realizes to multiple target language The enhancement process of sound.

Sound source characteristic and noise field characteristic in for indoor environment, using shot noise field model and near-field sound source mould Type, to actual environment in multichannel Noisy Speech Signal be modeled.By the 8 × 8 of 64 mike compositions planar array Arrange to gather the voice signal in space.

With X=[x₁(t) x₂(t) … x_j(t) … x_N(t)] ' Noisy Speech Signal that each passage is exported is represented, wherein J represents microphone channel sequence number.

After then carrying out Signal Pretreatment to the Noisy Speech Signal that each passage is exported, the array signal (frame signal) that obtains is X_pw, then X_pw=[x_1pw(n) x_2pw(n) …_jpw(n) … x_Npw(n)] ', wherein n=1,2 ... ... L, L are frame length, and w is frame Number.

To frame signal X_pwMake short time discrete Fourier transform, obtain amplitude spectrum | X_pw(ω) with phase spectrum Φ_pw(ω).Wherein ω is Stepped-frequency signal is angular frequency from point uniform samplings such as the N of 0 to 2 π.So having：

|X_pw|=[| X_1pw(ω)| |X_2pw(ω)| … |X_jpw(ω)| … |X_Npw(ω)|]′

Using formulaCalculate the speech energy of each frame, wherein N is frame length, numberings of the w for frame, 1 ≤ w≤L, L are frame number, and ω is each point in each frame；

Initialization threshold value T, by the statistics to background noise energy, arranges the initial value of threshold value T

Be then based on threshold value T carries out kind judging to every frame, judges that present frame is noise frame or speech frame, while Threshold value T is updated based on the noise frame of nearest k frames：

A. the speech energy M of present frame is calculated_wIf, M_wMore than T, then judge that present frame is speech frame, be otherwise judged to noise Frame；

If b present frames are noise frame, based on nearest k, (empirical value, usual value are more than or equal to 10) frame noise frame Threshold value T is updated：

b1：Average speech ENERGY E MN, speech energy maximum and the energy-minimum EAX of nearest k frames noise frame is calculated, EMIN；

b2：According to formula T=min [a × (EAX-EMIN)+EMN, b × EMN] (0<a<1,1<b<10) after being updated Threshold value T；

If c present frames are speech frame, judge whether all frames are disposed, if so, then end-point detection is finished, otherwise, Continue to next frame repeat step a～c.

Further, the result of determination of speech frame and noise frame can also be tested using short-time zero-crossing rate, in case Only judge by accident.

Referring to Fig. 4, based on all noise frames for detecting, can estimate to obtain noise power spectrum, be then based on spectrum-subtraction The estimation noise of each speech frame is removed, i.e., the noise power spectrum for currently estimating to obtain is deducted with the power spectrum of speech frame, language is obtained The phonetic speech power Power estimation of sound frame, then to phonetic speech power Power estimation evolution, and it is extensive to enter line phase based on the phase spectrum of each speech frame After multiple, then Short-time Fourier inverse transformation is carried out, the time domain for obtaining speech frame estimates signal, i.e., single pass enhancing voice signal.

After the completion of above-mentioned, blind source separating is completed using natural gradient ICA, its concrete processing procedure is：

(1) if the average of Current observation signal X (multiple single channels strengthen voice signal sequence) is not zero, then just first from Its average is deducted in observation signal X；

(2) matrix B is selected, makes covariance matrix E { VV^TBe unit matrix I, wherein V=BX, each point of vectorial V Between amount be incoherent, and have unit variance；

(3) whitening processing based on singular value decomposition：Estimate variance R of X first_x=E { XX^T, R_xIt is a reality Hermitian battle arrays；Secondly to R_xCarry out singular value decompositionWherein U=[u₁,u₂,…,u_n] Row be R_xLeft singular vector, the purpose of whitening processing is to weaken the mixed dependency of voice signal；

(4) due to σ₁≥σ₂≥…≥σ_m>0；σ_m+1=... σ_n；(m≤n) is m so as to the number for estimating source signal；

(5) orthogonal transformation is finally carried out：

U=[u₁,₂,…,_n]

=BX, U_m=[u₁,₂,…,_m]

According to formulaReturned Complex signal Y, wherein,P both can be described as performance matrix, can be described as convergent matrix again, and W represents independent component analysis Separation matrix in ICA, A represent that the hybrid matrix in ICA, S represent source signal.

The above, specific embodiment only of the invention, any feature disclosed in this specification, except non-specifically Narration, can equivalent by other or with similar purpose alternative features replaced；Disclosed all features or all sides Method or during the step of, in addition to mutually exclusive feature and/or step, can be combined in any way.

Claims

1. based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction, it is characterised in that including following Step：

Step 1：Noisy Speech Signal gathered by the microphone array of two-dimensional array, each passage for obtaining microphone array is adopted Collection signal, wherein microphone array column number are more than or equal to 4；

Step 201：Bandpass filtering treatment, shielding non-speech segment noise and interference are carried out to gathering signal；Again to bandpass filtering after Signal carry out preemphasis process, framing, windowing process obtain frame signal；

Short time discrete Fourier transform is carried out to the frame signal of every frame, and calculates the power spectrum and phase spectrum of each frame；

Step 203：Speech detection is carried out to the frame signal of every frame, judges that present frame is speech frame or noise frame, based on noise Frame estimating noise power is composed；

Step 204：Noise power spectrum in the power spectrum of speech frame is removed based on spectrum-subtraction, the phonetic speech power spectrum for obtaining every frame is estimated Meter；

Step 205：To phonetic speech power Power estimation evolution, and after the phase spectrum based on corresponding frame carries out phase recovery, then carry out short When inverse fourier transform, obtain speech frame time domain estimate signal；

Step 3：Signal carries out information source separating treatment using blind source separating method to be estimated to the time domain of the speech frame of all passages, is obtained The echo signal of various information source；

Step 4：The echo signal of same information source is carried out postemphasising, goes window, frame reorganization, the target language of various information source is obtained Message number.

2. the method for claim 1, it is characterised in that the microphone array is classified as 8*8 rectangle plane arrays, each battle array Unit is uniformly distributed on the whole, can carry out Subarray partition, and different submatrixs can work independently.

3. method as claimed in claim 1 or 2, it is characterised in that in step 3, using self adaptation natural gradient blind source separating Carry out information source separating treatment.