CN106373589B

CN106373589B - A kind of ears mixing voice separation method based on iteration structure

Info

Publication number: CN106373589B
Application number: CN201610824648.XA
Authority: CN
Inventors: 周琳; 李楠; 束佳明; 吴镇扬
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-09-14
Filing date: 2016-09-14
Publication date: 2019-07-26
Anticipated expiration: 2036-09-14
Also published as: CN106373589A

Abstract

The invention discloses a kind of ears mixing voice separation method based on iteration structure.Utilize ears spatial cues, interaural difference ITD (Interaural Time Difference) and interaural intensity difference IID (Interaural Intensity Difference) parameter, Primary Location is carried out to the multi-acoustical in mixing voice, using the attitude information of the sound source number positioned for the first time and each sound source as separation foundation, the separation and reconstruct of each sound source data stream based on attitude information are realized；Sound bearing then is reevaluated to the voice signal after reconstruct, mixing voice is separated again using revised azimuth information；After being handled according to above-mentioned steps iteration, using each sound source data stream of last time separation reconstruct as final Sound seperation result.It is proposed by the present invention that traditional ears speech separating method is compared with the ears speech separating method of spatial information based on iteration structure under low signal-to-noise ratio and strong reverberant ambiance, significantly improve the perceived quality of separation voice.

Description

A kind of ears mixing voice separation method based on iteration structure

Technical field

The present invention relates to auditory localizations and speech Separation field, and in particular to one kind is believed based on iteration structure and dimensional orientation The ears mixing voice separation method of breath.

Background technique

Ears mixing voice isolation technics is an emerging edge crossing subject, it is related to artificial intelligence, the sense of hearing heart Multiple research fields such as of science, physiology of hearing and signal processing.With the rapid development of modern science and technology, speech Separation skill Art has wide practical use in multiple fields.For example, space can be added in the voice of multiple speakers in videoconference Information carries out the separation and enhancing of voice to main speaker；It can also be used in hearing aid device, help hearing impaired people that will infuse Meaning power focuses on single speaker.Therefore, ears mixing voice isolation technics is studied, for improving Speech processing Robustness, solving mixing Sound seperation under complicated acoustic enviroment involved in " cocktail effect " has important theory and application Value.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of based on iteration structure and sky Between azimuth information ears mixing voice separation method, auditory localization and speech Separation are combined, sound source dimensional orientation is utilized Characteristic parameter of the information as speech Separation, and by speech Separation, improve auditory localization performance, forms positioning and change with what is separated For structure, the mixing voice separating property based on spatial information is improved.

Technical solution: the present invention provides a kind of ears mixing voice separation method based on iteration structure, comprising steps of

1) the parameter training stage:

1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Close ears white noise signal known to the orientation that impulse response function HRIR data and monophonic white noise signal convolution generate, sound Source azimuth angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], is divided into 5°；

1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1)；It is described pre- Processing includes amplitude normalization, framing adding window and end-point detection；

1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline interpolation Function carries out interpolation processing to the cross-correlation function, and the ITD estimated value of single frames binaural signals is calculated；Same orientation institute There is the mean value of frame ITD estimated value as the ITD trained values in the orientation, is denoted as ITD (θ)；

1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, counted The ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes is calculated, IID estimated value is obtained；The same all frames in orientation IID trained values of the mean value of IID estimated value as the orientation, are denoted as IID (ω, θ), and ω is angular frequency；

2) positioning stage in test process:

2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including amplitude normalizing Change processing, framing adding window and end-point detection；

2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: simultaneously The ITD trained values in each orientation in the ITD test value being calculated and step 1) are subjected to distance, each frame ears language is calculated The azimuth estimated value of sound signal；

2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), by detection histogram Peak value come estimate test ears mixing voice signal in sound source number and sound bearing；

3) each orientation ITD trained values and each orientation different frequencies 1) obtained the speech Separation stage in test process: are utilized The IID estimated value of point calculate in test ears mixing voice signal in each frame each frequency point and 2.3) obtained in each sound source Distance；Binary mask is established to each frequency point of every frame according to minimal distance principle, it is each to every frame according to binary mask Frequency point signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, by the corresponding all frames of same sound source, Suo Youpin Point signal is reconstructed, and realizes the separation of the test ears mixing voice signal of different direction sound source；

4) iteration phase:

4.1) 3) the ears voice signal of the different direction sound source obtained is obtained by 2) reevaluating sound bearing information To revised sound bearing information；

4.2) according to the revised sound bearing information 4.1) obtained, by 3) to test ears mixing voice signal into Row separates again, the different direction sound source data stream after being separated again；

4.3) it repeats 4.1) and 4.2) is iterated, after iteration, the data flow of multi-acoustical is final mixing voice Isolated result.

1.3) the ITD estimated value process for single frames signal being calculated using cubic spline interpolation is as follows:

In [k_i,k_i+1] on section, R (τ, k) is fitted using cubic polynomial, it may be assumed that

R (τ, x)=a_ix³+b_ix²+c_ix+d_i

Wherein a_i、b_i、c_iAnd d_iFor undetermined coefficient；When i representative polynomial is fitted, corresponding i-th of coordinate section；

According to second dervative is continuous and boundary on second dervative be zero condition, solved to obtain base using three-moment method The delay based on the sampling time is indicated in the cross-correlation function R (τ, μ) of delay time, μ, and τ indicates τ frame.

The then ITD of single frames_τIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:

By the ITD of all frames of azimuth binaural signals_τExpectation computing is carried out, the ITD trained values that orientation is θ are obtained, It is denoted as ITD (θ), it may be assumed that

ITD (θ)=mean (ITD_τ)。

It is described that 3) detailed process is as follows:

3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with it is each according to what is 2.3) obtained The distance of a sound source, to carry out the sound source classification of frequency point:

Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers；L is sound source number, and δ (l) is l The orientation of a sound source；

3.2) binary mask is established to each sound source according to lowest distance value:

Classified according to binaural signals of the binary mask to each frequency point signal of every frame, obtains different direction sound source Corresponding each frame, each frequency point data, are shown below:

WhereinIndicate the frequency point data of first of sound source τ frame；

Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, obtains the τ of sound source l Frame time-domain signal s_l(τ, m):

WhereinIndicate the τ frame time-domain signal of first of sound source；

Adding window is gone later, the τ frame signal after going adding window are as follows:

Whereinw_HIt (m) is Hamming window；

Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound source_l。

It is described 2.3) in estimation mixing voice in sound source be histogram in effective peak；The judgement at effective peak according to It is greater than threshold value according to the ratio of frame number and totalframes for peak value.

4) the number of iterations is 3 times.

The utility model has the advantages that compared with existing ears mixing voice isolation technics, it is proposed by the present invention based on iteration structure and The ears mixing voice separation method of attitude information can significantly improve auditory localization and isolated effect.In low signal-to-noise ratio Under the conditions of strong reverberation, more auditory localization accuracy based on binaural signals effectively improve.Meanwhile being based on iteration structure Speech separating method, speech Separation can be carried out according to the attitude information of different sound sources, avoid single-channel voice point From method can not to the deficiency that Unvoiced signal is separated, while compared with traditional ears speech separating method, introducing repeatedly The accuracy rate of more auditory localizations is improved for structure, therefore significantly improves the perceived quality of voice after separation.

Detailed description of the invention

Fig. 1 is the system block diagram of inventive algorithm.

Fig. 2 is each orientation ITD trained values and azimuth relational graph that the present invention obtains.

Fig. 3 is that ears spatial cues of the present invention extract flow chart.

Fig. 4 mixing voice time frequency point seperated schematic diagram of the present invention.

Specific embodiment

The present invention will be further explained with reference to the accompanying drawing.

The invention proposes a kind of ears mixing voice separation method based on iteration structure and attitude information, the party Method mainly includes two stages: the test phase in parameter training stage and iteration structure.

The present invention carries out parameter training first, pre-processes to the ears white noise signal containing azimuth information, including Amplitude normalization processing, framing adding window and end-point detection.Ears cross-correlation letter is calculated to pretreated single frames binaural signals Cubic Spline Interpolation is counted and carried out, for estimating the interaural difference ITD trained values in each orientation, while calculating binaural sound Amplitude ratio between the ear of signal obtains the IID trained values in each orientation.

In actual ears mixing voice separation phase, then the speech Separation based on iteration structure is used: in test environment Under, the single frames signal ITD test value of the ears mixing voice signal comprising different direction sound source is calculated first, is united according to histogram Count sound source number and azimuth information according to a preliminary estimate；Using the sound source number and orientation of estimation, to the ears mixing voice of each frame Signal is separated, and the data flow of different sound sources is obtained；Spatial localization cues are extracted again using the data flow of each sound source, weight New estimation sound bearing information, obtains revised sound bearing, the separation of sound source is carried out again, to form positioning+separation Iteration structure, after iteration, the corresponding reconstruct data flow of each sound source is obtained, to realize based on attitude information Ears mixing voice Signal separator.

Fig. 1 gives the ears mixing voice separation algorithm overview flow chart based on iteration structure and attitude information. The specific embodiment of technical solution of the present invention is described in detail with reference to the accompanying drawing:

1, the parameter training stage:

1.1) the ears white noise signal of user's tropism is trained in the present invention, the ears white noise signal be with Head coherent pulse receptance function HRIR (Head Related Impulse Response) data and monophonic white noise signal are rolled up Ears white noise signal known to the orientation that product generates, sound bearing angle θ are defined as direction vector and hang down in the projection of horizontal plane The angle in face is divided into 5 ° in the range of [- 90 °, 90 °]；

1.2) the ears white noise signal of known orientation is pre-processed, including amplitude normalization, framing adding window and end Point detection.

Since the amplitude of the ears voice signal under varying environment differs greatly, need to carry out width before feature extraction Value normalization, expression formula are as follows:

x_L=x_L/max(x_L,x_R)

x_R=x_R/max(x_L,x_R)

Wherein x_LAnd x_RRespectively indicate left and right ear signal, max (x_L,x_R) indicate the maximum value of left and right otoacoustic signal amplitude.

The present invention carries out sub-frame processing, frame length 32ms to binaural signals, and frame moves 16ms.

Windowing process, Hamming window expression are carried out to the binaural signals after framing using Hamming window are as follows:

Wherein N is frame length.This paper sample rate is 16kHz, frame length 32ms, and corresponding N is 512.

The present invention carries out end-point detection using dynamic double threshold method, according to the corresponding short-time energy E of voice signal_τWith it is short When zero-crossing rate Z_τOne high and one low two thresholdings are respectively set, τ indicates frame number.Use right ear signals as end-point detection in the present invention Foundation.

Wherein x_R(τ, m) is the auris dextra voice signal after framing, and sgn () is sign function.

The end-point detection of dynamic double threshold is divided into four sections: mute section, changeover portion, voice segments and end.

Mute section: after end-point detection starts, first detection voice whether enter mute section, when certain frame voice signal in short-term Energy or zero-crossing rate are more than low threshold, then marking present frame is the starting point of voice, and voice enters changeover portion.

Changeover portion: when voice is in changeover portion, continue to observe short-time energy and zero-crossing rate, if certain frame voice is in short-term When energy or zero-crossing rate are lower than low threshold, voice returns to mute section, short-time energy or zero-crossing rate if there is continuous three frames voice Higher than high threshold, then it represents that voice enters voice segments.

Voice segments: when the short-time energy of voice or zero-crossing rate are higher than low threshold, voice is in voice segments, if certain frame voice Short-time energy or zero-crossing rate be lower than low threshold when, then mark present frame be voice suspicious terminal, start detect voice whether Terminate.

Terminate: short-time energy or zero-crossing rate are less than the speech frame of low threshold, if it, which continues frame number, is greater than maximum mute segment length Degree, then it is assumed that voice terminates, the terminal marked before the i.e. terminal of voice, whereas if continuing frame number is less than mute section of maximum Length has the short-time energy of certain frame voice or zero-crossing rate to be higher than low threshold later, then cancels the label of terminal, voice is still located In voice segments, terminal is continued to test.

It is above-mentioned minimum voice length be to refer to identified voice segments minimum length, be voice segments it is most short continue when Between；Maximum mute length refers to the longest muting duration between the two neighboring word of voice.

The high-low threshold calculation formula of short-time energy and zero-crossing rate is as follows:

Wherein E_H、E_L、Z_HAnd Z_LIt is the high-low threshold of short-time energy and zero-crossing rate, E respectively_maxAnd E_minIt is original language respectively The maximum value and minimum value of short-time energy, μ in sound signal_zAnd σ_zThe mean value of respectively preceding 15 frame voice signal short-time zero-crossing rate and Variance, Z_cFor empirical value, 25 are generally taken.

1.3) azimuthal is the binaural signals of θ, and single frames acoustical signal ITD is calculated after framing adding window.It calculates first left The cross-correlation function R (τ, k) of auris dextra acoustical signal:

Wherein x_L(τ, m) and x_R(τ, m) respectively indicates the left and right otoacoustic signal of τ frame, and m indicates sampling sequence number；K is delay Sampling number, value range is [- 16,16] here.

Interpolation is carried out to the cross-correlation function R (τ, k) based on sampled point using cubic spline interpolation, certainly by cross-correlation function The corresponding time scale of variable k is promoted from 62.5 μ s to 1 μ s.

Its cubic spline interpolation process is, since R (τ, k) is series of discrete value, in [k_i,k_i+1] on section, wherein i When representative polynomial is fitted, corresponding i-th of coordinate section；R (τ, k) is fitted using cubic polynomial, it may be assumed that

R (τ, x)=a_ix³+b_ix²+c_ix+d_i

Wherein a_i、b_i、c_iAnd d_iFor undetermined coefficient, according to second dervative is continuous and boundary on second dervative be zero condition, Solve using three-moment method determining coefficient.

It obtains to be the cross-correlation based on delay time based on cross-correlation function R (τ, the k) interpolation of delay sampling point in this way Function R (τ, μ), μ indicate the delay based on the sampling time.

ITD (θ)=mean (ITD_τ)

In [- 90 °, 90 °] azimuth coverage, ITD (θ) and azimuth angle theta relationship are as shown in Figure 2.

1.4) binaural signals for being θ for azimuth, calculate the IID trained values of single frames.The calculating of IID is enterprising in frequency domain Row carries out Short Time Fourier Transform STFT (Short-time Fourier Transform) to the binaural signals after framing, Transformed to frequency domain:

ω indicates angular frequency in formula, and range is divided into 2 π/512 between 0 and 2 π.X_L(τ, ω) and X_R(τ, ω) is respectively The frequency spectrum of the left and right ear signal of single frames.

The interaural intensity difference IID (τ, ω) of τ frame signal is defined as:

The IID vector of 256 dimensions is calculated.The IID (τ, ω) of all frames carries out expectation computing, obtains the azimuth angle theta pair The IID trained values answered

IID (θ, ω)=mean (IID (τ, ω))

Thereby establish the model of angle, θ and ITD parameter, IID parameter.Features described above parameter extraction process is as shown in Figure 3.

2, the first positioning stage in the iteration tests stage

2.1) the test ears mixing voice signal in test process includes multi-acoustical, and each sound source is corresponding different Orientation.Test ears mixing voice signal comprising different direction sound source is pre-processed, including amplitude normalization processing, Framing adding window and end-point detection, detailed process are identical as described in step 1.2).

2.2) according to the test ears mixing voice signal obtained after step 1.2) framing, using above-mentioned steps 1.3) method Estimate the ITD (τ) of each frame test ears voice signal, while using above-mentioned steps 1.4) calculate ears mixing voice signal Frequency domain signal X_L(τ, ω) and X_R(τ,ω)。

2.3) by test each frame of ears mixing voice signal ITD measured value ITD (τ) and trained values ITD (θ) carry out away from From calculating, the corresponding azimuth estimated values theta (τ) of τ frame test ears mixing voice signal is obtained, discriminate is as follows:

Here using the orientation of absolute error estimation τ frame.ITD (τ) is indicated in test ears mixing voice signal in formula The ITD measured value of τ frame, ITD (θ) indicate the corresponding ITD trained values in the azimuth θ.

2.4) azimuth for testing all frame estimations of ears mixing voice signal is subjected to statistics with histogram, according to histogram Peak Distribution be arranged threshold value, judge effective sound bearing.Judgment basis is that the frame number of peak value and the ratio of totalframes are greater than threshold Value, then be judged as effective peak:

The number at effective peak is sound source number, and the corresponding azimuth in effective peak is to test ears mixing voice signal to include Sound bearing, be denoted as L and δ (l) respectively, l value is respectively 1,2 ..., L.

3, the speech Separation stage in test process

3.1) the sound source number and azimuth information obtained according to step 2), to each of test ears mixing voice signal Frame, each frequency point calculate at a distance from each sound source, to carry out the sound source classification of frequency point:

Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers, and the molecular moiety of the formula indicates base In given azimuth δ (l), difference of the ears voice signal of the frequency point relative to orientation IID, ITD.

WhereinIndicate the frequency point data of first of sound source τ frame.Mixing voice time frequency point separation process such as Fig. 4 institute Show.

Inverse Short Time Fourier Transform ISTFT (Inverse Short is carried out to the frequency-region signal of first of sound source after separation Time Fourier Transform), obtain the τ frame time-domain signal s of sound source l_l(τ, m):

WhereinIndicate the τ frame time-domain signal of first of sound source.

After being converted to time-domain signal, adding window is carried out, the τ frame signal after going adding window can indicate are as follows:

Whereinw_HIt (m) is above Hamming window.

Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound source_l, from And realize the separation of different direction sound-source signal.

4, iteration phase:

4.1) the different sound source ears voice signals for obtaining step 3.2), according to above-mentioned steps 2) reevaluate sound source side Position information, obtains revised sound bearing information.

4.2) the amendment sound bearing obtained according to step 4.1), according to above-mentioned steps 3), using spatial cues to mixing Ears voice signal is separated again, the different direction sound source data stream after obtaining second of separation.

4.3) step 4.1) is repeated with 4.2), forms a kind of iteration structure of positioning+separation, the number of iterations is set as 3 times, After iteration, the data flow of multi-acoustical is the result of final test ears mixing voice separation.In the present invention, iteration time Number is that the result based on entire emulation testing algorithm obtains for 3 times, and after iteration 3 times, the result of basic fixed position is intended to stablize.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of ears mixing voice separation method based on iteration structure, which is characterized in that comprising steps of

1) the parameter training stage:

1.1) it is trained using the directive ears white noise signal of tool；The ears white noise signal by with head phase Guan pulse It rushes receptance function HRIR data and monophonic white noise signal convolution generates, sound bearing angle θ is defined as direction vector in level The projection in face and the angle of middle vertical plane are divided into 5 ° in the range of [- 90 °, 90 °]；

1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1)；The pretreatment Including amplitude normalization, framing adding window and end-point detection；

1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline functions Interpolation processing is carried out to the cross-correlation function, the ITD estimated value of single frames binaural signals is calculated；The same all frames in orientation ITD trained values of the mean value of ITD estimated value as the orientation, are denoted as ITD (θ)；

1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, calculated left The ratio that otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, obtains IID estimated value；The same all frame IID in orientation estimate IID trained values of the mean value of evaluation as the orientation, are denoted as IID (ω, θ), and ω is angular frequency；

2) positioning stage in test process:

2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including at amplitude normalization Reason, framing adding window and end-point detection；

2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: and will meter The ITD trained values in each orientation carry out distance and each frame ears voice letter are calculated in obtained ITD test value and step 1) Number azimuth estimated value；

2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), passes through the peak value in detection histogram To estimate the sound source number and sound bearing tested in ears mixing voice signal；

3) each orientation ITD trained values and each orientation different frequent points 1) obtained the speech Separation stage in test process: are utilized IID estimated value calculate in test ears mixing voice signal each frequency point in each frame and 2.3) obtained in each sound source away from From；Binary mask is established to each frequency point of every frame according to minimal distance principle, according to binary mask to each frequency point of every frame Signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, and the corresponding all frames of same sound source, all frequency points are believed It number is reconstructed, realizes the separation of the test ears mixing voice signal of different direction sound source；

4) iteration phase:

4.1) to the test ears mixing voice signal of the different direction sound source 3) obtained by 2) reevaluating sound bearing letter Breath, obtains revised sound bearing information；

4.2) according to the revised sound bearing information 4.1) obtained, by 3) to revised test ears mixing voice into Row separates again, the different direction sound source data stream after being separated again；

4.3) it repeats 4.1) and 4.2) is iterated, after iteration, multi-acoustical data flow is final test ears creolized language Cent from result.

2. ears mixing voice separation method according to claim 1, which is characterized in that described 1.3) to use cubic spline The ITD estimated value process that interpolation calculation obtains single frames binaural signals is as follows:

R (τ, k)=a_ik³+b_ik²+c_ik+d_i

Wherein a_i、b_i、c_iAnd d_iFor undetermined coefficient；K is delay sampling points, when i representative polynomial is fitted, corresponding i-th of seat Mark section；

Continuously and second dervative is zero on boundary condition according to second dervative, it is solved to obtain using three-moment method and is based on prolonging The cross-correlation function R (τ, μ) of slow time, μ indicate the delay based on the sampling time, and τ indicates τ frame；

The then ITD of single frames binaural signals_τIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:

By the ITD of all frames of azimuth binaural signals_τExpectation computing is carried out, the ITD trained values that orientation is θ is obtained, is denoted as ITD (θ), it may be assumed that

ITD (θ)=mean (ITD_τ)。

3. ears mixing voice separation method according to claim 1, which is characterized in that 3) the detailed process is such as Under:

3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with according to each sound 2.3) obtained The distance in source, to carry out the sound source classification of frequency point:

Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers；L is sound source number, and δ (l) is first of sound The orientation in source；X_L(τ, ω) and X_R(τ, ω) is the frequency spectrum of the left and right ear signal of single frames respectively；

Classified according to ears voice signal of the binary mask to each frequency point signal of every frame, obtains different direction sound source pair Each frame for answering, each frequency point data, are shown below:

WhereinIndicate the frequency point data of first of sound source τ frame；

Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, when obtaining the τ frame of sound source l Domain signalM indicates sampling sequence number:

WhereinIndicate the τ frame time-domain signal of first of sound source；

Whereinw_HIt (m) is Hamming window；N is frame length；

4. ears mixing voice separation method according to claim 1, which is characterized in that it is described 2.3) in estimation mixing Sound source in voice is effective peak in histogram；The judgment basis at effective peak is the frame number of peak value and the ratio of totalframes Greater than threshold value.

5. ears mixing voice separation method according to claim 1, which is characterized in that 4) the number of iterations is 3 times.