CN106373589B - A kind of ears mixing voice separation method based on iteration structure - Google Patents

A kind of ears mixing voice separation method based on iteration structure Download PDF

Info

Publication number
CN106373589B
CN106373589B CN201610824648.XA CN201610824648A CN106373589B CN 106373589 B CN106373589 B CN 106373589B CN 201610824648 A CN201610824648 A CN 201610824648A CN 106373589 B CN106373589 B CN 106373589B
Authority
CN
China
Prior art keywords
ears
sound source
frame
signal
itd
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610824648.XA
Other languages
Chinese (zh)
Other versions
CN106373589A (en
Inventor
周琳
李楠
束佳明
吴镇扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610824648.XA priority Critical patent/CN106373589B/en
Publication of CN106373589A publication Critical patent/CN106373589A/en
Application granted granted Critical
Publication of CN106373589B publication Critical patent/CN106373589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a kind of ears mixing voice separation method based on iteration structure.Utilize ears spatial cues, interaural difference ITD (Interaural Time Difference) and interaural intensity difference IID (Interaural Intensity Difference) parameter, Primary Location is carried out to the multi-acoustical in mixing voice, using the attitude information of the sound source number positioned for the first time and each sound source as separation foundation, the separation and reconstruct of each sound source data stream based on attitude information are realized;Sound bearing then is reevaluated to the voice signal after reconstruct, mixing voice is separated again using revised azimuth information;After being handled according to above-mentioned steps iteration, using each sound source data stream of last time separation reconstruct as final Sound seperation result.It is proposed by the present invention that traditional ears speech separating method is compared with the ears speech separating method of spatial information based on iteration structure under low signal-to-noise ratio and strong reverberant ambiance, significantly improve the perceived quality of separation voice.

Description

A kind of ears mixing voice separation method based on iteration structure
Technical field
The present invention relates to auditory localizations and speech Separation field, and in particular to one kind is believed based on iteration structure and dimensional orientation The ears mixing voice separation method of breath.
Background technique
Ears mixing voice isolation technics is an emerging edge crossing subject, it is related to artificial intelligence, the sense of hearing heart Multiple research fields such as of science, physiology of hearing and signal processing.With the rapid development of modern science and technology, speech Separation skill Art has wide practical use in multiple fields.For example, space can be added in the voice of multiple speakers in videoconference Information carries out the separation and enhancing of voice to main speaker;It can also be used in hearing aid device, help hearing impaired people that will infuse Meaning power focuses on single speaker.Therefore, ears mixing voice isolation technics is studied, for improving Speech processing Robustness, solving mixing Sound seperation under complicated acoustic enviroment involved in " cocktail effect " has important theory and application Value.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of based on iteration structure and sky Between azimuth information ears mixing voice separation method, auditory localization and speech Separation are combined, sound source dimensional orientation is utilized Characteristic parameter of the information as speech Separation, and by speech Separation, improve auditory localization performance, forms positioning and change with what is separated For structure, the mixing voice separating property based on spatial information is improved.
Technical solution: the present invention provides a kind of ears mixing voice separation method based on iteration structure, comprising steps of
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Close ears white noise signal known to the orientation that impulse response function HRIR data and monophonic white noise signal convolution generate, sound Source azimuth angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], is divided into 5°;
1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1);It is described pre- Processing includes amplitude normalization, framing adding window and end-point detection;
1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline interpolation Function carries out interpolation processing to the cross-correlation function, and the ITD estimated value of single frames binaural signals is calculated;Same orientation institute There is the mean value of frame ITD estimated value as the ITD trained values in the orientation, is denoted as ITD (θ);
1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, counted The ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes is calculated, IID estimated value is obtained;The same all frames in orientation IID trained values of the mean value of IID estimated value as the orientation, are denoted as IID (ω, θ), and ω is angular frequency;
2) positioning stage in test process:
2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including amplitude normalizing Change processing, framing adding window and end-point detection;
2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: simultaneously The ITD trained values in each orientation in the ITD test value being calculated and step 1) are subjected to distance, each frame ears language is calculated The azimuth estimated value of sound signal;
2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), by detection histogram Peak value come estimate test ears mixing voice signal in sound source number and sound bearing;
3) each orientation ITD trained values and each orientation different frequencies 1) obtained the speech Separation stage in test process: are utilized The IID estimated value of point calculate in test ears mixing voice signal in each frame each frequency point and 2.3) obtained in each sound source Distance;Binary mask is established to each frequency point of every frame according to minimal distance principle, it is each to every frame according to binary mask Frequency point signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, by the corresponding all frames of same sound source, Suo Youpin Point signal is reconstructed, and realizes the separation of the test ears mixing voice signal of different direction sound source;
4) iteration phase:
4.1) 3) the ears voice signal of the different direction sound source obtained is obtained by 2) reevaluating sound bearing information To revised sound bearing information;
4.2) according to the revised sound bearing information 4.1) obtained, by 3) to test ears mixing voice signal into Row separates again, the different direction sound source data stream after being separated again;
4.3) it repeats 4.1) and 4.2) is iterated, after iteration, the data flow of multi-acoustical is final mixing voice Isolated result.
1.3) the ITD estimated value process for single frames signal being calculated using cubic spline interpolation is as follows:
In [ki,ki+1] on section, R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, x)=aix3+bix2+cix+di
Wherein ai、bi、ciAnd diFor undetermined coefficient;When i representative polynomial is fitted, corresponding i-th of coordinate section;
According to second dervative is continuous and boundary on second dervative be zero condition, solved to obtain base using three-moment method The delay based on the sampling time is indicated in the cross-correlation function R (τ, μ) of delay time, μ, and τ indicates τ frame.
The then ITD of single framesτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ are obtained, It is denoted as ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)。
It is described that 3) detailed process is as follows:
3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with it is each according to what is 2.3) obtained The distance of a sound source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers;L is sound source number, and δ (l) is l The orientation of a sound source;
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to binaural signals of the binary mask to each frequency point signal of every frame, obtains different direction sound source Corresponding each frame, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame;
Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, obtains the τ of sound source l Frame time-domain signal sl(τ, m):
WhereinIndicate the τ frame time-domain signal of first of sound source;
Adding window is gone later, the τ frame signal after going adding window are as follows:
WhereinwHIt (m) is Hamming window;
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel
It is described 2.3) in estimation mixing voice in sound source be histogram in effective peak;The judgement at effective peak according to It is greater than threshold value according to the ratio of frame number and totalframes for peak value.
4) the number of iterations is 3 times.
The utility model has the advantages that compared with existing ears mixing voice isolation technics, it is proposed by the present invention based on iteration structure and The ears mixing voice separation method of attitude information can significantly improve auditory localization and isolated effect.In low signal-to-noise ratio Under the conditions of strong reverberation, more auditory localization accuracy based on binaural signals effectively improve.Meanwhile being based on iteration structure Speech separating method, speech Separation can be carried out according to the attitude information of different sound sources, avoid single-channel voice point From method can not to the deficiency that Unvoiced signal is separated, while compared with traditional ears speech separating method, introducing repeatedly The accuracy rate of more auditory localizations is improved for structure, therefore significantly improves the perceived quality of voice after separation.
Detailed description of the invention
Fig. 1 is the system block diagram of inventive algorithm.
Fig. 2 is each orientation ITD trained values and azimuth relational graph that the present invention obtains.
Fig. 3 is that ears spatial cues of the present invention extract flow chart.
Fig. 4 mixing voice time frequency point seperated schematic diagram of the present invention.
Specific embodiment
The present invention will be further explained with reference to the accompanying drawing.
The invention proposes a kind of ears mixing voice separation method based on iteration structure and attitude information, the party Method mainly includes two stages: the test phase in parameter training stage and iteration structure.
The present invention carries out parameter training first, pre-processes to the ears white noise signal containing azimuth information, including Amplitude normalization processing, framing adding window and end-point detection.Ears cross-correlation letter is calculated to pretreated single frames binaural signals Cubic Spline Interpolation is counted and carried out, for estimating the interaural difference ITD trained values in each orientation, while calculating binaural sound Amplitude ratio between the ear of signal obtains the IID trained values in each orientation.
In actual ears mixing voice separation phase, then the speech Separation based on iteration structure is used: in test environment Under, the single frames signal ITD test value of the ears mixing voice signal comprising different direction sound source is calculated first, is united according to histogram Count sound source number and azimuth information according to a preliminary estimate;Using the sound source number and orientation of estimation, to the ears mixing voice of each frame Signal is separated, and the data flow of different sound sources is obtained;Spatial localization cues are extracted again using the data flow of each sound source, weight New estimation sound bearing information, obtains revised sound bearing, the separation of sound source is carried out again, to form positioning+separation Iteration structure, after iteration, the corresponding reconstruct data flow of each sound source is obtained, to realize based on attitude information Ears mixing voice Signal separator.
Fig. 1 gives the ears mixing voice separation algorithm overview flow chart based on iteration structure and attitude information. The specific embodiment of technical solution of the present invention is described in detail with reference to the accompanying drawing:
1, the parameter training stage:
1.1) the ears white noise signal of user's tropism is trained in the present invention, the ears white noise signal be with Head coherent pulse receptance function HRIR (Head Related Impulse Response) data and monophonic white noise signal are rolled up Ears white noise signal known to the orientation that product generates, sound bearing angle θ are defined as direction vector and hang down in the projection of horizontal plane The angle in face is divided into 5 ° in the range of [- 90 °, 90 °];
1.2) the ears white noise signal of known orientation is pre-processed, including amplitude normalization, framing adding window and end Point detection.
Since the amplitude of the ears voice signal under varying environment differs greatly, need to carry out width before feature extraction Value normalization, expression formula are as follows:
xL=xL/max(xL,xR)
xR=xR/max(xL,xR)
Wherein xLAnd xRRespectively indicate left and right ear signal, max (xL,xR) indicate the maximum value of left and right otoacoustic signal amplitude.
The present invention carries out sub-frame processing, frame length 32ms to binaural signals, and frame moves 16ms.
Windowing process, Hamming window expression are carried out to the binaural signals after framing using Hamming window are as follows:
Wherein N is frame length.This paper sample rate is 16kHz, frame length 32ms, and corresponding N is 512.
The present invention carries out end-point detection using dynamic double threshold method, according to the corresponding short-time energy E of voice signalτWith it is short When zero-crossing rate ZτOne high and one low two thresholdings are respectively set, τ indicates frame number.Use right ear signals as end-point detection in the present invention Foundation.
Wherein xR(τ, m) is the auris dextra voice signal after framing, and sgn () is sign function.
The end-point detection of dynamic double threshold is divided into four sections: mute section, changeover portion, voice segments and end.
Mute section: after end-point detection starts, first detection voice whether enter mute section, when certain frame voice signal in short-term Energy or zero-crossing rate are more than low threshold, then marking present frame is the starting point of voice, and voice enters changeover portion.
Changeover portion: when voice is in changeover portion, continue to observe short-time energy and zero-crossing rate, if certain frame voice is in short-term When energy or zero-crossing rate are lower than low threshold, voice returns to mute section, short-time energy or zero-crossing rate if there is continuous three frames voice Higher than high threshold, then it represents that voice enters voice segments.
Voice segments: when the short-time energy of voice or zero-crossing rate are higher than low threshold, voice is in voice segments, if certain frame voice Short-time energy or zero-crossing rate be lower than low threshold when, then mark present frame be voice suspicious terminal, start detect voice whether Terminate.
Terminate: short-time energy or zero-crossing rate are less than the speech frame of low threshold, if it, which continues frame number, is greater than maximum mute segment length Degree, then it is assumed that voice terminates, the terminal marked before the i.e. terminal of voice, whereas if continuing frame number is less than mute section of maximum Length has the short-time energy of certain frame voice or zero-crossing rate to be higher than low threshold later, then cancels the label of terminal, voice is still located In voice segments, terminal is continued to test.
It is above-mentioned minimum voice length be to refer to identified voice segments minimum length, be voice segments it is most short continue when Between;Maximum mute length refers to the longest muting duration between the two neighboring word of voice.
The high-low threshold calculation formula of short-time energy and zero-crossing rate is as follows:
Wherein EH、EL、ZHAnd ZLIt is the high-low threshold of short-time energy and zero-crossing rate, E respectivelymaxAnd EminIt is original language respectively The maximum value and minimum value of short-time energy, μ in sound signalzAnd σzThe mean value of respectively preceding 15 frame voice signal short-time zero-crossing rate and Variance, ZcFor empirical value, 25 are generally taken.
1.3) azimuthal is the binaural signals of θ, and single frames acoustical signal ITD is calculated after framing adding window.It calculates first left The cross-correlation function R (τ, k) of auris dextra acoustical signal:
Wherein xL(τ, m) and xR(τ, m) respectively indicates the left and right otoacoustic signal of τ frame, and m indicates sampling sequence number;K is delay Sampling number, value range is [- 16,16] here.
Interpolation is carried out to the cross-correlation function R (τ, k) based on sampled point using cubic spline interpolation, certainly by cross-correlation function The corresponding time scale of variable k is promoted from 62.5 μ s to 1 μ s.
Its cubic spline interpolation process is, since R (τ, k) is series of discrete value, in [ki,ki+1] on section, wherein i When representative polynomial is fitted, corresponding i-th of coordinate section;R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, x)=aix3+bix2+cix+di
Wherein ai、bi、ciAnd diFor undetermined coefficient, according to second dervative is continuous and boundary on second dervative be zero condition, Solve using three-moment method determining coefficient.
It obtains to be the cross-correlation based on delay time based on cross-correlation function R (τ, the k) interpolation of delay sampling point in this way Function R (τ, μ), μ indicate the delay based on the sampling time.
The then ITD of single framesτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ are obtained, It is denoted as ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)
In [- 90 °, 90 °] azimuth coverage, ITD (θ) and azimuth angle theta relationship are as shown in Figure 2.
1.4) binaural signals for being θ for azimuth, calculate the IID trained values of single frames.The calculating of IID is enterprising in frequency domain Row carries out Short Time Fourier Transform STFT (Short-time Fourier Transform) to the binaural signals after framing, Transformed to frequency domain:
ω indicates angular frequency in formula, and range is divided into 2 π/512 between 0 and 2 π.XL(τ, ω) and XR(τ, ω) is respectively The frequency spectrum of the left and right ear signal of single frames.
The interaural intensity difference IID (τ, ω) of τ frame signal is defined as:
The IID vector of 256 dimensions is calculated.The IID (τ, ω) of all frames carries out expectation computing, obtains the azimuth angle theta pair The IID trained values answered
IID (θ, ω)=mean (IID (τ, ω))
Thereby establish the model of angle, θ and ITD parameter, IID parameter.Features described above parameter extraction process is as shown in Figure 3.
2, the first positioning stage in the iteration tests stage
2.1) the test ears mixing voice signal in test process includes multi-acoustical, and each sound source is corresponding different Orientation.Test ears mixing voice signal comprising different direction sound source is pre-processed, including amplitude normalization processing, Framing adding window and end-point detection, detailed process are identical as described in step 1.2).
2.2) according to the test ears mixing voice signal obtained after step 1.2) framing, using above-mentioned steps 1.3) method Estimate the ITD (τ) of each frame test ears voice signal, while using above-mentioned steps 1.4) calculate ears mixing voice signal Frequency domain signal XL(τ, ω) and XR(τ,ω)。
2.3) by test each frame of ears mixing voice signal ITD measured value ITD (τ) and trained values ITD (θ) carry out away from From calculating, the corresponding azimuth estimated values theta (τ) of τ frame test ears mixing voice signal is obtained, discriminate is as follows:
Here using the orientation of absolute error estimation τ frame.ITD (τ) is indicated in test ears mixing voice signal in formula The ITD measured value of τ frame, ITD (θ) indicate the corresponding ITD trained values in the azimuth θ.
2.4) azimuth for testing all frame estimations of ears mixing voice signal is subjected to statistics with histogram, according to histogram Peak Distribution be arranged threshold value, judge effective sound bearing.Judgment basis is that the frame number of peak value and the ratio of totalframes are greater than threshold Value, then be judged as effective peak:
The number at effective peak is sound source number, and the corresponding azimuth in effective peak is to test ears mixing voice signal to include Sound bearing, be denoted as L and δ (l) respectively, l value is respectively 1,2 ..., L.
3, the speech Separation stage in test process
3.1) the sound source number and azimuth information obtained according to step 2), to each of test ears mixing voice signal Frame, each frequency point calculate at a distance from each sound source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers, and the molecular moiety of the formula indicates base In given azimuth δ (l), difference of the ears voice signal of the frequency point relative to orientation IID, ITD.
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to binaural signals of the binary mask to each frequency point signal of every frame, obtains different direction sound source Corresponding each frame, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame.Mixing voice time frequency point separation process such as Fig. 4 institute Show.
Inverse Short Time Fourier Transform ISTFT (Inverse Short is carried out to the frequency-region signal of first of sound source after separation Time Fourier Transform), obtain the τ frame time-domain signal s of sound source ll(τ, m):
WhereinIndicate the τ frame time-domain signal of first of sound source.
After being converted to time-domain signal, adding window is carried out, the τ frame signal after going adding window can indicate are as follows:
WhereinwHIt (m) is above Hamming window.
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel, from And realize the separation of different direction sound-source signal.
4, iteration phase:
4.1) the different sound source ears voice signals for obtaining step 3.2), according to above-mentioned steps 2) reevaluate sound source side Position information, obtains revised sound bearing information.
4.2) the amendment sound bearing obtained according to step 4.1), according to above-mentioned steps 3), using spatial cues to mixing Ears voice signal is separated again, the different direction sound source data stream after obtaining second of separation.
4.3) step 4.1) is repeated with 4.2), forms a kind of iteration structure of positioning+separation, the number of iterations is set as 3 times, After iteration, the data flow of multi-acoustical is the result of final test ears mixing voice separation.In the present invention, iteration time Number is that the result based on entire emulation testing algorithm obtains for 3 times, and after iteration 3 times, the result of basic fixed position is intended to stablize.
The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (5)

1. a kind of ears mixing voice separation method based on iteration structure, which is characterized in that comprising steps of
1) the parameter training stage:
1.1) it is trained using the directive ears white noise signal of tool;The ears white noise signal by with head phase Guan pulse It rushes receptance function HRIR data and monophonic white noise signal convolution generates, sound bearing angle θ is defined as direction vector in level The projection in face and the angle of middle vertical plane are divided into 5 ° in the range of [- 90 °, 90 °];
1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1);The pretreatment Including amplitude normalization, framing adding window and end-point detection;
1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline functions Interpolation processing is carried out to the cross-correlation function, the ITD estimated value of single frames binaural signals is calculated;The same all frames in orientation ITD trained values of the mean value of ITD estimated value as the orientation, are denoted as ITD (θ);
1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, calculated left The ratio that otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, obtains IID estimated value;The same all frame IID in orientation estimate IID trained values of the mean value of evaluation as the orientation, are denoted as IID (ω, θ), and ω is angular frequency;
2) positioning stage in test process:
2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including at amplitude normalization Reason, framing adding window and end-point detection;
2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: and will meter The ITD trained values in each orientation carry out distance and each frame ears voice letter are calculated in obtained ITD test value and step 1) Number azimuth estimated value;
2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), passes through the peak value in detection histogram To estimate the sound source number and sound bearing tested in ears mixing voice signal;
3) each orientation ITD trained values and each orientation different frequent points 1) obtained the speech Separation stage in test process: are utilized IID estimated value calculate in test ears mixing voice signal each frequency point in each frame and 2.3) obtained in each sound source away from From;Binary mask is established to each frequency point of every frame according to minimal distance principle, according to binary mask to each frequency point of every frame Signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, and the corresponding all frames of same sound source, all frequency points are believed It number is reconstructed, realizes the separation of the test ears mixing voice signal of different direction sound source;
4) iteration phase:
4.1) to the test ears mixing voice signal of the different direction sound source 3) obtained by 2) reevaluating sound bearing letter Breath, obtains revised sound bearing information;
4.2) according to the revised sound bearing information 4.1) obtained, by 3) to revised test ears mixing voice into Row separates again, the different direction sound source data stream after being separated again;
4.3) it repeats 4.1) and 4.2) is iterated, after iteration, multi-acoustical data flow is final test ears creolized language Cent from result.
2. ears mixing voice separation method according to claim 1, which is characterized in that described 1.3) to use cubic spline The ITD estimated value process that interpolation calculation obtains single frames binaural signals is as follows:
In [ki,ki+1] on section, R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, k)=aik3+bik2+cik+di
Wherein ai、bi、ciAnd diFor undetermined coefficient;K is delay sampling points, when i representative polynomial is fitted, corresponding i-th of seat Mark section;
Continuously and second dervative is zero on boundary condition according to second dervative, it is solved to obtain using three-moment method and is based on prolonging The cross-correlation function R (τ, μ) of slow time, μ indicate the delay based on the sampling time, and τ indicates τ frame;
The then ITD of single frames binaural signalsτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ is obtained, is denoted as ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)。
3. ears mixing voice separation method according to claim 1, which is characterized in that 3) the detailed process is such as Under:
3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with according to each sound 2.3) obtained The distance in source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers;L is sound source number, and δ (l) is first of sound The orientation in source;XL(τ, ω) and XR(τ, ω) is the frequency spectrum of the left and right ear signal of single frames respectively;
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to ears voice signal of the binary mask to each frequency point signal of every frame, obtains different direction sound source pair Each frame for answering, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame;
Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, when obtaining the τ frame of sound source l Domain signalM indicates sampling sequence number:
WhereinIndicate the τ frame time-domain signal of first of sound source;
Adding window is gone later, the τ frame signal after going adding window are as follows:
WhereinwHIt (m) is Hamming window;N is frame length;
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel
4. ears mixing voice separation method according to claim 1, which is characterized in that it is described 2.3) in estimation mixing Sound source in voice is effective peak in histogram;The judgment basis at effective peak is the frame number of peak value and the ratio of totalframes Greater than threshold value.
5. ears mixing voice separation method according to claim 1, which is characterized in that 4) the number of iterations is 3 times.
CN201610824648.XA 2016-09-14 2016-09-14 A kind of ears mixing voice separation method based on iteration structure Active CN106373589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610824648.XA CN106373589B (en) 2016-09-14 2016-09-14 A kind of ears mixing voice separation method based on iteration structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610824648.XA CN106373589B (en) 2016-09-14 2016-09-14 A kind of ears mixing voice separation method based on iteration structure

Publications (2)

Publication Number Publication Date
CN106373589A CN106373589A (en) 2017-02-01
CN106373589B true CN106373589B (en) 2019-07-26

Family

ID=57896703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610824648.XA Active CN106373589B (en) 2016-09-14 2016-09-14 A kind of ears mixing voice separation method based on iteration structure

Country Status (1)

Country Link
CN (1) CN106373589B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107942290B (en) * 2017-11-16 2019-10-11 东南大学 Binaural sound sources localization method based on BP neural network
CN108091345B (en) * 2017-12-27 2020-11-20 东南大学 Double-ear voice separation method based on support vector machine
CN108647556A (en) * 2018-03-02 2018-10-12 重庆邮电大学 Sound localization method based on frequency dividing and deep neural network
CN109949821B (en) * 2019-03-15 2020-12-08 慧言科技(天津)有限公司 Method for removing reverberation of far-field voice by using U-NET structure of CNN
CN110491410B (en) * 2019-04-12 2020-11-20 腾讯科技(深圳)有限公司 Voice separation method, voice recognition method and related equipment
CN110275138B (en) * 2019-07-16 2021-03-23 北京工业大学 Multi-sound-source positioning method using dominant sound source component removal
CN110491412B (en) * 2019-08-23 2022-02-25 北京市商汤科技开发有限公司 Sound separation method and device and electronic equipment
CN111539449B (en) * 2020-03-23 2023-08-18 广东省智能制造研究所 Sound source separation and positioning method based on second-order fusion attention network model
CN113079452B (en) * 2021-03-30 2022-11-15 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio direction information generating method, electronic device, and medium
CN113782047B (en) * 2021-09-06 2024-03-08 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102438189A (en) * 2011-08-30 2012-05-02 东南大学 Dual-channel acoustic signal-based sound source localization method
CN102565759A (en) * 2011-12-29 2012-07-11 东南大学 Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN103983946A (en) * 2014-05-23 2014-08-13 北京神州普惠科技股份有限公司 Method for processing singles of multiple measuring channels in sound source localization process
CN104464750A (en) * 2014-10-24 2015-03-25 东南大学 Voice separation method based on binaural sound source localization

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6253226B2 (en) * 2012-10-29 2017-12-27 三菱電機株式会社 Sound source separation device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102438189A (en) * 2011-08-30 2012-05-02 东南大学 Dual-channel acoustic signal-based sound source localization method
CN102565759A (en) * 2011-12-29 2012-07-11 东南大学 Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN103983946A (en) * 2014-05-23 2014-08-13 北京神州普惠科技股份有限公司 Method for processing singles of multiple measuring channels in sound source localization process
CN104464750A (en) * 2014-10-24 2015-03-25 东南大学 Voice separation method based on binaural sound source localization

Also Published As

Publication number Publication date
CN106373589A (en) 2017-02-01

Similar Documents

Publication Publication Date Title
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
CN109839612B (en) Sound source direction estimation method and device based on time-frequency masking and deep neural network
CN102565759B (en) Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN102438189B (en) Dual-channel acoustic signal-based sound source localization method
CN107346664A (en) A kind of ears speech separating method based on critical band
Liu et al. Continuous sound source localization based on microphone array for mobile robots
CN106226739A (en) Merge the double sound source localization method of Substrip analysis
CN111429939B (en) Sound signal separation method of double sound sources and pickup
CN104464750A (en) Voice separation method based on binaural sound source localization
CN109188362A (en) A kind of microphone array auditory localization signal processing method
CN106019230B (en) A kind of sound localization method based on i-vector Speaker Identification
CN103901400A (en) Binaural sound source positioning method based on delay compensation and binaural coincidence
Wang et al. Pseudo-determined blind source separation for ad-hoc microphone networks
Dumortier et al. Blind RT60 estimation robust across room sizes and source distances
CN110265060B (en) Speaker number automatic detection method based on density clustering
Talagala et al. Binaural localization of speech sources in the median plane using cepstral HRTF extraction
Wu et al. Spatial feature learning for robust binaural sound source localization using a composite feature vector
Hu et al. Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients
Zohourian et al. Multi-channel speaker localization and separation using a model-based GSC and an inertial measurement unit
Wu et al. Binaural localization of speech sources in 3-D using a composite feature vector of the HRTF
Hu et al. Robust binaural sound localisation with temporal attention
Fang et al. A robust interaural time differences estimation and dereverberation algorithm based on the coherence function
Malek et al. Speaker extraction using LCMV beamformer with DNN-based SPP and RTF identification scheme
Habib et al. Auditory inspired methods for localization of multiple concurrent speakers
Liu et al. Robust interaural time difference estimation based on convolutional neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant