CN106373589B - A kind of ears mixing voice separation method based on iteration structure - Google Patents
A kind of ears mixing voice separation method based on iteration structure Download PDFInfo
- Publication number
- CN106373589B CN106373589B CN201610824648.XA CN201610824648A CN106373589B CN 106373589 B CN106373589 B CN 106373589B CN 201610824648 A CN201610824648 A CN 201610824648A CN 106373589 B CN106373589 B CN 106373589B
- Authority
- CN
- China
- Prior art keywords
- ears
- sound source
- frame
- signal
- itd
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 210000005069 ears Anatomy 0.000 title claims abstract description 81
- 238000002156 mixing Methods 0.000 title claims abstract description 55
- 238000000926 separation method Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000012360 testing method Methods 0.000 claims description 38
- 238000001514 detection method Methods 0.000 claims description 14
- 238000009432 framing Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000005314 correlation function Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 240000006409 Acacia auriculiformis Species 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000013100 final test Methods 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims 1
- 230000004807 localization Effects 0.000 description 7
- 238000002955 isolation Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000010415 tropism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
Abstract
The invention discloses a kind of ears mixing voice separation method based on iteration structure.Utilize ears spatial cues, interaural difference ITD (Interaural Time Difference) and interaural intensity difference IID (Interaural Intensity Difference) parameter, Primary Location is carried out to the multi-acoustical in mixing voice, using the attitude information of the sound source number positioned for the first time and each sound source as separation foundation, the separation and reconstruct of each sound source data stream based on attitude information are realized;Sound bearing then is reevaluated to the voice signal after reconstruct, mixing voice is separated again using revised azimuth information;After being handled according to above-mentioned steps iteration, using each sound source data stream of last time separation reconstruct as final Sound seperation result.It is proposed by the present invention that traditional ears speech separating method is compared with the ears speech separating method of spatial information based on iteration structure under low signal-to-noise ratio and strong reverberant ambiance, significantly improve the perceived quality of separation voice.
Description
Technical field
The present invention relates to auditory localizations and speech Separation field, and in particular to one kind is believed based on iteration structure and dimensional orientation
The ears mixing voice separation method of breath.
Background technique
Ears mixing voice isolation technics is an emerging edge crossing subject, it is related to artificial intelligence, the sense of hearing heart
Multiple research fields such as of science, physiology of hearing and signal processing.With the rapid development of modern science and technology, speech Separation skill
Art has wide practical use in multiple fields.For example, space can be added in the voice of multiple speakers in videoconference
Information carries out the separation and enhancing of voice to main speaker;It can also be used in hearing aid device, help hearing impaired people that will infuse
Meaning power focuses on single speaker.Therefore, ears mixing voice isolation technics is studied, for improving Speech processing
Robustness, solving mixing Sound seperation under complicated acoustic enviroment involved in " cocktail effect " has important theory and application
Value.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the present invention provides a kind of based on iteration structure and sky
Between azimuth information ears mixing voice separation method, auditory localization and speech Separation are combined, sound source dimensional orientation is utilized
Characteristic parameter of the information as speech Separation, and by speech Separation, improve auditory localization performance, forms positioning and change with what is separated
For structure, the mixing voice separating property based on spatial information is improved.
Technical solution: the present invention provides a kind of ears mixing voice separation method based on iteration structure, comprising steps of
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase
Close ears white noise signal known to the orientation that impulse response function HRIR data and monophonic white noise signal convolution generate, sound
Source azimuth angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], is divided into
5°;
1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1);It is described pre-
Processing includes amplitude normalization, framing adding window and end-point detection;
1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline interpolation
Function carries out interpolation processing to the cross-correlation function, and the ITD estimated value of single frames binaural signals is calculated;Same orientation institute
There is the mean value of frame ITD estimated value as the ITD trained values in the orientation, is denoted as ITD (θ);
1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, counted
The ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes is calculated, IID estimated value is obtained;The same all frames in orientation
IID trained values of the mean value of IID estimated value as the orientation, are denoted as IID (ω, θ), and ω is angular frequency;
2) positioning stage in test process:
2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including amplitude normalizing
Change processing, framing adding window and end-point detection;
2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: simultaneously
The ITD trained values in each orientation in the ITD test value being calculated and step 1) are subjected to distance, each frame ears language is calculated
The azimuth estimated value of sound signal;
2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), by detection histogram
Peak value come estimate test ears mixing voice signal in sound source number and sound bearing;
3) each orientation ITD trained values and each orientation different frequencies 1) obtained the speech Separation stage in test process: are utilized
The IID estimated value of point calculate in test ears mixing voice signal in each frame each frequency point and 2.3) obtained in each sound source
Distance;Binary mask is established to each frequency point of every frame according to minimal distance principle, it is each to every frame according to binary mask
Frequency point signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, by the corresponding all frames of same sound source, Suo Youpin
Point signal is reconstructed, and realizes the separation of the test ears mixing voice signal of different direction sound source;
4) iteration phase:
4.1) 3) the ears voice signal of the different direction sound source obtained is obtained by 2) reevaluating sound bearing information
To revised sound bearing information;
4.2) according to the revised sound bearing information 4.1) obtained, by 3) to test ears mixing voice signal into
Row separates again, the different direction sound source data stream after being separated again;
4.3) it repeats 4.1) and 4.2) is iterated, after iteration, the data flow of multi-acoustical is final mixing voice
Isolated result.
1.3) the ITD estimated value process for single frames signal being calculated using cubic spline interpolation is as follows:
In [ki,ki+1] on section, R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, x)=aix3+bix2+cix+di
Wherein ai、bi、ciAnd diFor undetermined coefficient;When i representative polynomial is fitted, corresponding i-th of coordinate section;
According to second dervative is continuous and boundary on second dervative be zero condition, solved to obtain base using three-moment method
The delay based on the sampling time is indicated in the cross-correlation function R (τ, μ) of delay time, μ, and τ indicates τ frame.
The then ITD of single framesτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ are obtained,
It is denoted as ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)。
It is described that 3) detailed process is as follows:
3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with it is each according to what is 2.3) obtained
The distance of a sound source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers;L is sound source number, and δ (l) is l
The orientation of a sound source;
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to binaural signals of the binary mask to each frequency point signal of every frame, obtains different direction sound source
Corresponding each frame, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame;
Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, obtains the τ of sound source l
Frame time-domain signal sl(τ, m):
WhereinIndicate the τ frame time-domain signal of first of sound source;
Adding window is gone later, the τ frame signal after going adding window are as follows:
WhereinwHIt (m) is Hamming window;
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel。
It is described 2.3) in estimation mixing voice in sound source be histogram in effective peak;The judgement at effective peak according to
It is greater than threshold value according to the ratio of frame number and totalframes for peak value.
4) the number of iterations is 3 times.
The utility model has the advantages that compared with existing ears mixing voice isolation technics, it is proposed by the present invention based on iteration structure and
The ears mixing voice separation method of attitude information can significantly improve auditory localization and isolated effect.In low signal-to-noise ratio
Under the conditions of strong reverberation, more auditory localization accuracy based on binaural signals effectively improve.Meanwhile being based on iteration structure
Speech separating method, speech Separation can be carried out according to the attitude information of different sound sources, avoid single-channel voice point
From method can not to the deficiency that Unvoiced signal is separated, while compared with traditional ears speech separating method, introducing repeatedly
The accuracy rate of more auditory localizations is improved for structure, therefore significantly improves the perceived quality of voice after separation.
Detailed description of the invention
Fig. 1 is the system block diagram of inventive algorithm.
Fig. 2 is each orientation ITD trained values and azimuth relational graph that the present invention obtains.
Fig. 3 is that ears spatial cues of the present invention extract flow chart.
Fig. 4 mixing voice time frequency point seperated schematic diagram of the present invention.
Specific embodiment
The present invention will be further explained with reference to the accompanying drawing.
The invention proposes a kind of ears mixing voice separation method based on iteration structure and attitude information, the party
Method mainly includes two stages: the test phase in parameter training stage and iteration structure.
The present invention carries out parameter training first, pre-processes to the ears white noise signal containing azimuth information, including
Amplitude normalization processing, framing adding window and end-point detection.Ears cross-correlation letter is calculated to pretreated single frames binaural signals
Cubic Spline Interpolation is counted and carried out, for estimating the interaural difference ITD trained values in each orientation, while calculating binaural sound
Amplitude ratio between the ear of signal obtains the IID trained values in each orientation.
In actual ears mixing voice separation phase, then the speech Separation based on iteration structure is used: in test environment
Under, the single frames signal ITD test value of the ears mixing voice signal comprising different direction sound source is calculated first, is united according to histogram
Count sound source number and azimuth information according to a preliminary estimate;Using the sound source number and orientation of estimation, to the ears mixing voice of each frame
Signal is separated, and the data flow of different sound sources is obtained;Spatial localization cues are extracted again using the data flow of each sound source, weight
New estimation sound bearing information, obtains revised sound bearing, the separation of sound source is carried out again, to form positioning+separation
Iteration structure, after iteration, the corresponding reconstruct data flow of each sound source is obtained, to realize based on attitude information
Ears mixing voice Signal separator.
Fig. 1 gives the ears mixing voice separation algorithm overview flow chart based on iteration structure and attitude information.
The specific embodiment of technical solution of the present invention is described in detail with reference to the accompanying drawing:
1, the parameter training stage:
1.1) the ears white noise signal of user's tropism is trained in the present invention, the ears white noise signal be with
Head coherent pulse receptance function HRIR (Head Related Impulse Response) data and monophonic white noise signal are rolled up
Ears white noise signal known to the orientation that product generates, sound bearing angle θ are defined as direction vector and hang down in the projection of horizontal plane
The angle in face is divided into 5 ° in the range of [- 90 °, 90 °];
1.2) the ears white noise signal of known orientation is pre-processed, including amplitude normalization, framing adding window and end
Point detection.
Since the amplitude of the ears voice signal under varying environment differs greatly, need to carry out width before feature extraction
Value normalization, expression formula are as follows:
xL=xL/max(xL,xR)
xR=xR/max(xL,xR)
Wherein xLAnd xRRespectively indicate left and right ear signal, max (xL,xR) indicate the maximum value of left and right otoacoustic signal amplitude.
The present invention carries out sub-frame processing, frame length 32ms to binaural signals, and frame moves 16ms.
Windowing process, Hamming window expression are carried out to the binaural signals after framing using Hamming window are as follows:
Wherein N is frame length.This paper sample rate is 16kHz, frame length 32ms, and corresponding N is 512.
The present invention carries out end-point detection using dynamic double threshold method, according to the corresponding short-time energy E of voice signalτWith it is short
When zero-crossing rate ZτOne high and one low two thresholdings are respectively set, τ indicates frame number.Use right ear signals as end-point detection in the present invention
Foundation.
Wherein xR(τ, m) is the auris dextra voice signal after framing, and sgn () is sign function.
The end-point detection of dynamic double threshold is divided into four sections: mute section, changeover portion, voice segments and end.
Mute section: after end-point detection starts, first detection voice whether enter mute section, when certain frame voice signal in short-term
Energy or zero-crossing rate are more than low threshold, then marking present frame is the starting point of voice, and voice enters changeover portion.
Changeover portion: when voice is in changeover portion, continue to observe short-time energy and zero-crossing rate, if certain frame voice is in short-term
When energy or zero-crossing rate are lower than low threshold, voice returns to mute section, short-time energy or zero-crossing rate if there is continuous three frames voice
Higher than high threshold, then it represents that voice enters voice segments.
Voice segments: when the short-time energy of voice or zero-crossing rate are higher than low threshold, voice is in voice segments, if certain frame voice
Short-time energy or zero-crossing rate be lower than low threshold when, then mark present frame be voice suspicious terminal, start detect voice whether
Terminate.
Terminate: short-time energy or zero-crossing rate are less than the speech frame of low threshold, if it, which continues frame number, is greater than maximum mute segment length
Degree, then it is assumed that voice terminates, the terminal marked before the i.e. terminal of voice, whereas if continuing frame number is less than mute section of maximum
Length has the short-time energy of certain frame voice or zero-crossing rate to be higher than low threshold later, then cancels the label of terminal, voice is still located
In voice segments, terminal is continued to test.
It is above-mentioned minimum voice length be to refer to identified voice segments minimum length, be voice segments it is most short continue when
Between;Maximum mute length refers to the longest muting duration between the two neighboring word of voice.
The high-low threshold calculation formula of short-time energy and zero-crossing rate is as follows:
Wherein EH、EL、ZHAnd ZLIt is the high-low threshold of short-time energy and zero-crossing rate, E respectivelymaxAnd EminIt is original language respectively
The maximum value and minimum value of short-time energy, μ in sound signalzAnd σzThe mean value of respectively preceding 15 frame voice signal short-time zero-crossing rate and
Variance, ZcFor empirical value, 25 are generally taken.
1.3) azimuthal is the binaural signals of θ, and single frames acoustical signal ITD is calculated after framing adding window.It calculates first left
The cross-correlation function R (τ, k) of auris dextra acoustical signal:
Wherein xL(τ, m) and xR(τ, m) respectively indicates the left and right otoacoustic signal of τ frame, and m indicates sampling sequence number;K is delay
Sampling number, value range is [- 16,16] here.
Interpolation is carried out to the cross-correlation function R (τ, k) based on sampled point using cubic spline interpolation, certainly by cross-correlation function
The corresponding time scale of variable k is promoted from 62.5 μ s to 1 μ s.
Its cubic spline interpolation process is, since R (τ, k) is series of discrete value, in [ki,ki+1] on section, wherein i
When representative polynomial is fitted, corresponding i-th of coordinate section;R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, x)=aix3+bix2+cix+di
Wherein ai、bi、ciAnd diFor undetermined coefficient, according to second dervative is continuous and boundary on second dervative be zero condition,
Solve using three-moment method determining coefficient.
It obtains to be the cross-correlation based on delay time based on cross-correlation function R (τ, the k) interpolation of delay sampling point in this way
Function R (τ, μ), μ indicate the delay based on the sampling time.
The then ITD of single framesτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ are obtained,
It is denoted as ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)
In [- 90 °, 90 °] azimuth coverage, ITD (θ) and azimuth angle theta relationship are as shown in Figure 2.
1.4) binaural signals for being θ for azimuth, calculate the IID trained values of single frames.The calculating of IID is enterprising in frequency domain
Row carries out Short Time Fourier Transform STFT (Short-time Fourier Transform) to the binaural signals after framing,
Transformed to frequency domain:
ω indicates angular frequency in formula, and range is divided into 2 π/512 between 0 and 2 π.XL(τ, ω) and XR(τ, ω) is respectively
The frequency spectrum of the left and right ear signal of single frames.
The interaural intensity difference IID (τ, ω) of τ frame signal is defined as:
The IID vector of 256 dimensions is calculated.The IID (τ, ω) of all frames carries out expectation computing, obtains the azimuth angle theta pair
The IID trained values answered
IID (θ, ω)=mean (IID (τ, ω))
Thereby establish the model of angle, θ and ITD parameter, IID parameter.Features described above parameter extraction process is as shown in Figure 3.
2, the first positioning stage in the iteration tests stage
2.1) the test ears mixing voice signal in test process includes multi-acoustical, and each sound source is corresponding different
Orientation.Test ears mixing voice signal comprising different direction sound source is pre-processed, including amplitude normalization processing,
Framing adding window and end-point detection, detailed process are identical as described in step 1.2).
2.2) according to the test ears mixing voice signal obtained after step 1.2) framing, using above-mentioned steps 1.3) method
Estimate the ITD (τ) of each frame test ears voice signal, while using above-mentioned steps 1.4) calculate ears mixing voice signal
Frequency domain signal XL(τ, ω) and XR(τ,ω)。
2.3) by test each frame of ears mixing voice signal ITD measured value ITD (τ) and trained values ITD (θ) carry out away from
From calculating, the corresponding azimuth estimated values theta (τ) of τ frame test ears mixing voice signal is obtained, discriminate is as follows:
Here using the orientation of absolute error estimation τ frame.ITD (τ) is indicated in test ears mixing voice signal in formula
The ITD measured value of τ frame, ITD (θ) indicate the corresponding ITD trained values in the azimuth θ.
2.4) azimuth for testing all frame estimations of ears mixing voice signal is subjected to statistics with histogram, according to histogram
Peak Distribution be arranged threshold value, judge effective sound bearing.Judgment basis is that the frame number of peak value and the ratio of totalframes are greater than threshold
Value, then be judged as effective peak:
The number at effective peak is sound source number, and the corresponding azimuth in effective peak is to test ears mixing voice signal to include
Sound bearing, be denoted as L and δ (l) respectively, l value is respectively 1,2 ..., L.
3, the speech Separation stage in test process
3.1) the sound source number and azimuth information obtained according to step 2), to each of test ears mixing voice signal
Frame, each frequency point calculate at a distance from each sound source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers, and the molecular moiety of the formula indicates base
In given azimuth δ (l), difference of the ears voice signal of the frequency point relative to orientation IID, ITD.
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to binaural signals of the binary mask to each frequency point signal of every frame, obtains different direction sound source
Corresponding each frame, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame.Mixing voice time frequency point separation process such as Fig. 4 institute
Show.
Inverse Short Time Fourier Transform ISTFT (Inverse Short is carried out to the frequency-region signal of first of sound source after separation
Time Fourier Transform), obtain the τ frame time-domain signal s of sound source ll(τ, m):
WhereinIndicate the τ frame time-domain signal of first of sound source.
After being converted to time-domain signal, adding window is carried out, the τ frame signal after going adding window can indicate are as follows:
WhereinwHIt (m) is above Hamming window.
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel, from
And realize the separation of different direction sound-source signal.
4, iteration phase:
4.1) the different sound source ears voice signals for obtaining step 3.2), according to above-mentioned steps 2) reevaluate sound source side
Position information, obtains revised sound bearing information.
4.2) the amendment sound bearing obtained according to step 4.1), according to above-mentioned steps 3), using spatial cues to mixing
Ears voice signal is separated again, the different direction sound source data stream after obtaining second of separation.
4.3) step 4.1) is repeated with 4.2), forms a kind of iteration structure of positioning+separation, the number of iterations is set as 3 times,
After iteration, the data flow of multi-acoustical is the result of final test ears mixing voice separation.In the present invention, iteration time
Number is that the result based on entire emulation testing algorithm obtains for 3 times, and after iteration 3 times, the result of basic fixed position is intended to stablize.
The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (5)
1. a kind of ears mixing voice separation method based on iteration structure, which is characterized in that comprising steps of
1) the parameter training stage:
1.1) it is trained using the directive ears white noise signal of tool;The ears white noise signal by with head phase Guan pulse
It rushes receptance function HRIR data and monophonic white noise signal convolution generates, sound bearing angle θ is defined as direction vector in level
The projection in face and the angle of middle vertical plane are divided into 5 ° in the range of [- 90 °, 90 °];
1.2) pre-processed to obtain the single frames binaural signals after framing to ears white noise signal 1.1);The pretreatment
Including amplitude normalization, framing adding window and end-point detection;
1.3) cross-correlation function operation is carried out to single frames binaural signals obtained in 1.2), and utilizes cubic spline functions
Interpolation processing is carried out to the cross-correlation function, the ITD estimated value of single frames binaural signals is calculated;The same all frames in orientation
ITD trained values of the mean value of ITD estimated value as the orientation, are denoted as ITD (θ);
1.4) Short Time Fourier Transform is carried out to single frames binaural signals obtained in 1.2), is transformed to frequency domain, calculated left
The ratio that otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, obtains IID estimated value;The same all frame IID in orientation estimate
IID trained values of the mean value of evaluation as the orientation, are denoted as IID (ω, θ), and ω is angular frequency;
2) positioning stage in test process:
2.1) test ears mixing voice signal is pre-processed to obtain single frames ears voice signal, including at amplitude normalization
Reason, framing adding window and end-point detection;
2.2) its ITD test value is calculated using method 1.3) to the single frames ears voice signal 2.1) obtained: and will meter
The ITD trained values in each orientation carry out distance and each frame ears voice letter are calculated in obtained ITD test value and step 1)
Number azimuth estimated value;
2.3) statistics with histogram is carried out to the azimuth information of all frames obtained in 2.2), passes through the peak value in detection histogram
To estimate the sound source number and sound bearing tested in ears mixing voice signal;
3) each orientation ITD trained values and each orientation different frequent points 1) obtained the speech Separation stage in test process: are utilized
IID estimated value calculate in test ears mixing voice signal each frequency point in each frame and 2.3) obtained in each sound source away from
From;Binary mask is established to each frequency point of every frame according to minimal distance principle, according to binary mask to each frequency point of every frame
Signal is sorted out, and the corresponding frequency components of different direction sound source are obtained, and the corresponding all frames of same sound source, all frequency points are believed
It number is reconstructed, realizes the separation of the test ears mixing voice signal of different direction sound source;
4) iteration phase:
4.1) to the test ears mixing voice signal of the different direction sound source 3) obtained by 2) reevaluating sound bearing letter
Breath, obtains revised sound bearing information;
4.2) according to the revised sound bearing information 4.1) obtained, by 3) to revised test ears mixing voice into
Row separates again, the different direction sound source data stream after being separated again;
4.3) it repeats 4.1) and 4.2) is iterated, after iteration, multi-acoustical data flow is final test ears creolized language
Cent from result.
2. ears mixing voice separation method according to claim 1, which is characterized in that described 1.3) to use cubic spline
The ITD estimated value process that interpolation calculation obtains single frames binaural signals is as follows:
In [ki,ki+1] on section, R (τ, k) is fitted using cubic polynomial, it may be assumed that
R (τ, k)=aik3+bik2+cik+di
Wherein ai、bi、ciAnd diFor undetermined coefficient;K is delay sampling points, when i representative polynomial is fitted, corresponding i-th of seat
Mark section;
Continuously and second dervative is zero on boundary condition according to second dervative, it is solved to obtain using three-moment method and is based on prolonging
The cross-correlation function R (τ, μ) of slow time, μ indicate the delay based on the sampling time, and τ indicates τ frame;
The then ITD of single frames binaural signalsτIt is defined as the corresponding delay of cross-correlation function R (τ, μ) maximum value:
By the ITD of all frames of azimuth binaural signalsτExpectation computing is carried out, the ITD trained values that orientation is θ is obtained, is denoted as
ITD (θ), it may be assumed that
ITD (θ)=mean (ITDτ)。
3. ears mixing voice separation method according to claim 1, which is characterized in that 3) the detailed process is such as
Under:
3.1) to test ears mixing voice signal each frame in each frequency point, calculate its with according to each sound 2.3) obtained
The distance in source, to carry out the sound source classification of frequency point:
Wherein J (τ, ω) represents sound source belonging to the ω frequency point in τ frame and numbers;L is sound source number, and δ (l) is first of sound
The orientation in source;XL(τ, ω) and XR(τ, ω) is the frequency spectrum of the left and right ear signal of single frames respectively;
3.2) binary mask is established to each sound source according to lowest distance value:
Classified according to ears voice signal of the binary mask to each frequency point signal of every frame, obtains different direction sound source pair
Each frame for answering, each frequency point data, are shown below:
WhereinIndicate the frequency point data of first of sound source τ frame;
Inverse Short Time Fourier Transform ISTFT is carried out to the frequency-region signal of first of sound source after separation, when obtaining the τ frame of sound source l
Domain signalM indicates sampling sequence number:
WhereinIndicate the τ frame time-domain signal of first of sound source;
Adding window is gone later, the τ frame signal after going adding window are as follows:
WhereinwHIt (m) is Hamming window;N is frame length;
Each frame voice after adding window will be gone to carry out overlap-add, synthesis obtains the separation voice signal s of first of sound sourcel。
4. ears mixing voice separation method according to claim 1, which is characterized in that it is described 2.3) in estimation mixing
Sound source in voice is effective peak in histogram;The judgment basis at effective peak is the frame number of peak value and the ratio of totalframes
Greater than threshold value.
5. ears mixing voice separation method according to claim 1, which is characterized in that 4) the number of iterations is 3 times.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610824648.XA CN106373589B (en) | 2016-09-14 | 2016-09-14 | A kind of ears mixing voice separation method based on iteration structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610824648.XA CN106373589B (en) | 2016-09-14 | 2016-09-14 | A kind of ears mixing voice separation method based on iteration structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106373589A CN106373589A (en) | 2017-02-01 |
CN106373589B true CN106373589B (en) | 2019-07-26 |
Family
ID=57896703
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610824648.XA Active CN106373589B (en) | 2016-09-14 | 2016-09-14 | A kind of ears mixing voice separation method based on iteration structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106373589B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107942290B (en) * | 2017-11-16 | 2019-10-11 | 东南大学 | Binaural sound sources localization method based on BP neural network |
CN108091345B (en) * | 2017-12-27 | 2020-11-20 | 东南大学 | Double-ear voice separation method based on support vector machine |
CN108647556A (en) * | 2018-03-02 | 2018-10-12 | 重庆邮电大学 | Sound localization method based on frequency dividing and deep neural network |
CN109949821B (en) * | 2019-03-15 | 2020-12-08 | 慧言科技(天津)有限公司 | Method for removing reverberation of far-field voice by using U-NET structure of CNN |
CN110491410B (en) * | 2019-04-12 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Voice separation method, voice recognition method and related equipment |
CN110275138B (en) * | 2019-07-16 | 2021-03-23 | 北京工业大学 | Multi-sound-source positioning method using dominant sound source component removal |
CN110491412B (en) * | 2019-08-23 | 2022-02-25 | 北京市商汤科技开发有限公司 | Sound separation method and device and electronic equipment |
CN111539449B (en) * | 2020-03-23 | 2023-08-18 | 广东省智能制造研究所 | Sound source separation and positioning method based on second-order fusion attention network model |
CN113079452B (en) * | 2021-03-30 | 2022-11-15 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method, audio direction information generating method, electronic device, and medium |
CN113782047B (en) * | 2021-09-06 | 2024-03-08 | 云知声智能科技股份有限公司 | Voice separation method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102438189A (en) * | 2011-08-30 | 2012-05-02 | 东南大学 | Dual-channel acoustic signal-based sound source localization method |
CN102565759A (en) * | 2011-12-29 | 2012-07-11 | 东南大学 | Binaural sound source localization method based on sub-band signal to noise ratio estimation |
CN103983946A (en) * | 2014-05-23 | 2014-08-13 | 北京神州普惠科技股份有限公司 | Method for processing singles of multiple measuring channels in sound source localization process |
CN104464750A (en) * | 2014-10-24 | 2015-03-25 | 东南大学 | Voice separation method based on binaural sound source localization |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6253226B2 (en) * | 2012-10-29 | 2017-12-27 | 三菱電機株式会社 | Sound source separation device |
-
2016
- 2016-09-14 CN CN201610824648.XA patent/CN106373589B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102438189A (en) * | 2011-08-30 | 2012-05-02 | 东南大学 | Dual-channel acoustic signal-based sound source localization method |
CN102565759A (en) * | 2011-12-29 | 2012-07-11 | 东南大学 | Binaural sound source localization method based on sub-band signal to noise ratio estimation |
CN103983946A (en) * | 2014-05-23 | 2014-08-13 | 北京神州普惠科技股份有限公司 | Method for processing singles of multiple measuring channels in sound source localization process |
CN104464750A (en) * | 2014-10-24 | 2015-03-25 | 东南大学 | Voice separation method based on binaural sound source localization |
Also Published As
Publication number | Publication date |
---|---|
CN106373589A (en) | 2017-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106373589B (en) | A kind of ears mixing voice separation method based on iteration structure | |
CN109839612B (en) | Sound source direction estimation method and device based on time-frequency masking and deep neural network | |
CN102565759B (en) | Binaural sound source localization method based on sub-band signal to noise ratio estimation | |
CN102438189B (en) | Dual-channel acoustic signal-based sound source localization method | |
CN107346664A (en) | A kind of ears speech separating method based on critical band | |
Liu et al. | Continuous sound source localization based on microphone array for mobile robots | |
CN106226739A (en) | Merge the double sound source localization method of Substrip analysis | |
CN111429939B (en) | Sound signal separation method of double sound sources and pickup | |
CN104464750A (en) | Voice separation method based on binaural sound source localization | |
CN109188362A (en) | A kind of microphone array auditory localization signal processing method | |
CN106019230B (en) | A kind of sound localization method based on i-vector Speaker Identification | |
CN103901400A (en) | Binaural sound source positioning method based on delay compensation and binaural coincidence | |
Wang et al. | Pseudo-determined blind source separation for ad-hoc microphone networks | |
Dumortier et al. | Blind RT60 estimation robust across room sizes and source distances | |
CN110265060B (en) | Speaker number automatic detection method based on density clustering | |
Talagala et al. | Binaural localization of speech sources in the median plane using cepstral HRTF extraction | |
Wu et al. | Spatial feature learning for robust binaural sound source localization using a composite feature vector | |
Hu et al. | Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients | |
Zohourian et al. | Multi-channel speaker localization and separation using a model-based GSC and an inertial measurement unit | |
Wu et al. | Binaural localization of speech sources in 3-D using a composite feature vector of the HRTF | |
Hu et al. | Robust binaural sound localisation with temporal attention | |
Fang et al. | A robust interaural time differences estimation and dereverberation algorithm based on the coherence function | |
Malek et al. | Speaker extraction using LCMV beamformer with DNN-based SPP and RTF identification scheme | |
Habib et al. | Auditory inspired methods for localization of multiple concurrent speakers | |
Liu et al. | Robust interaural time difference estimation based on convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |