CN107346664A - A kind of ears speech separating method based on critical band - Google Patents
A kind of ears speech separating method based on critical band Download PDFInfo
- Publication number
- CN107346664A CN107346664A CN201710479139.2A CN201710479139A CN107346664A CN 107346664 A CN107346664 A CN 107346664A CN 201710479139 A CN201710479139 A CN 201710479139A CN 107346664 A CN107346664 A CN 107346664A
- Authority
- CN
- China
- Prior art keywords
- mrow
- signal
- ears
- critical band
- orientation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000005069 ears Anatomy 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000000926 separation method Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000009432 framing Methods 0.000 claims description 26
- 240000006409 Acacia auriculiformis Species 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 6
- 238000005314 correlation function Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 239000000919 ceramic Substances 0.000 claims 2
- 238000012545 processing Methods 0.000 abstract description 9
- 230000000873 masking effect Effects 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 9
- 230000004807 localization Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007812 deficiency Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000004568 cement Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
Abstract
The invention discloses a kind of speech separating method based on critical band and binaural signals, pass through data training and the azimuth information of sound source, in each critical band binaural signals are carried out with the classification of sound source, so as to obtain the data flow of each sound source, each sound-source signal after being separated is reconstructed, realizes speech Separation.Scaling down processing mechanism of the invention based on human auditory system, with reference to the auditory masking effect of human ear, according to the azimuth information of different sound sources, mixing voice is separated in each critical band, positioning separating resulting under the conditions of different noises and reverberation shows, ears speech Separation based on critical band, performance obtain effectively lifting.
Description
Technical field
The present invention relates to auditory localization and speech Separation field, and in particular to a kind of ears voice point based on critical band
From method.
Background technology
Voice positions and isolation technics is the front end of speech signal processing system, and its performance is to whole speech signal systems shadow
Sound is very big.Since digital communication era, the speech processes skill such as encoding and decoding speech, voice positioning, speech Separation, speech enhan-cement
Art is obtained for rapid development, and especially in current internet tide, voice assistant has pushed Speech processing to one
Individual new height.
The development of following multi-modal man-machine interaction, human-computer dialogue and speech recognition be unable to do without Speech processing research and
Development, so front end of the speech Separation technology as speech processing system, it is directly connected to the performance and effect of whole voice system
Fruit.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention provides a kind of double based on critical band
Whispering voice separation method, using the scaling down processing mechanism of human auditory system, with reference to the auditory masking effect of human ear, simulate human ear
Aural signature, divided based on critical band, different subband is divided to each frame signal obtain accurate hybrid matrix and carry out
Speech Separation, improve the deficiencies in the prior art.
Technical scheme:A kind of ears speech separating method based on critical band, it is characterised in that this method includes following
Step:
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase
Close impulse response function HRIR data and binaural signal, sound bearing known to the orientation of monophonic white noise signal convolution generation
Angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], at intervals of 5 °;
1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process is returned including amplitude
One change processing, framing adding window, obtain the single frames binaural signals after framing;
Amplitude normalization method is:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) represent
The maximum of left ear, auris dextra acoustical signal amplitude.
Framing adding window carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can
To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length.
1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation letter
Number calculates the interaural difference ITD estimates of single frames signal.The average of the same all frame ITD estimates in orientation is as the orientation
ITD trained values, it is designated as δ (θ).
The method for establishing the ITD models of azimuth angle theta is as follows:
The ITD values of τ frame signals are:
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), the training as θ orientation
ITD parameter:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.1), is transformed to frequency
Domain, the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors are calculated, it is same
IID trained values of the average of all frame IID estimates in orientation as the orientation, are designated as α (θ, ω), and ω represents Fourier transformation
Frequency spectrum.
The method for establishing the IID models of azimuth angle theta is as follows:
The IID values of τ frame signals are:
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angle
Frequency vector, scope is [0,2 π], at intervals of 2 π/512;
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the instruction as θ orientation
Practice IID parameters:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
2) the ears mixing voice Signal separator stage based on critical band and azimuth information:
2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different sides
Position.Ears mixing voice signal is pre-processed, including amplitude normalization processing, framing adding window;
2.2) to the ears mixing acoustical signal progress Fourier transformation after framing, the frequency range based on critical band,
Sub-band division is carried out to frequency domain, obtains the subband signal after framing;
The method of sub-band division is as follows:
Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), time-frequency domain is transformed into, obtains ears
The framing signal X of acoustical signal time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, carry out sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency of c-th of critical band are represented respectively
Scope.
2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) are built
Vertical orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, believed based on left and right otoacoustic emission
Number similarity, carry out the classification of sound source;
2.4) time-frequency after the framing of acquisition in the critical band classification results obtained by step 2.3) and step 2.1) is believed
Number be multiplied, obtain the time-frequency domain signal corresponding to each sound source;
2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), when being converted to
Domain signal, adding window is carried out, synthesize the separation voice of each sound source.
Beneficial effect:It is of the invention compared with the existing speech Separation technology based on frequency, the present invention be based on human auditory system
The scaling down processing mechanism of system, with reference to the auditory masking effect of human ear, after positioning stage accurately obtains sound bearing, to every
Different sub-band is separated in one frame, and auditory localization and critical band isolation technics are combined, in multiple speaker separation sides
Face, its separating property:SNR(Source to Noise Ratio)、SDR(Source to Distortion Ratio)、SAR
(Sources to Artifacts Ratio), PESQ (Perceptual Evaluation of Speech Quality)
To effectively improving.
Brief description of the drawings
Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation;
Fig. 2 is present system block diagram
Embodiment
The present invention is further described below in conjunction with the accompanying drawings.
The advanced row data training of the present invention, by each orientation interaural difference ITD (Interaural Time
Difference) and interaural intensity difference IID (Interaural Intensity Difference) average as sound bearing
Location feature clue, establish orientation mapping model;During actual auditory localization, all frame orientation of acoustical signal are mixed according to ears
Histogram, estimate final sound source number and orientation.In the Sound seperation stage, first ears mixing acoustical signal is carried out being based on facing
The sub-band division of boundary's frequency band, the azimuth information after being positioned with reference to voice, divides frequency-region signal in each critical band
Class, the time frequency point of each sound source on time-frequency domain is returned into time domain finally by inverse Fourier transform.
Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation, by taking 2 sound sources as an example.2 microphones
At ears, in the present invention, sound source locus represents that -180 ° of deflection≤θ≤180 ° are by the azimuth angle theta of sound source
Direction vector is in the projection of horizontal plane and the angle of middle vertical plane.On horizontal plane, θ=0 ° represent front, along clockwise direction θ=
90 °, 180 ° and -90 ° represent front-right, dead astern, front-left respectively.Using 2 sound sources, (sound source of the present embodiment is speaks to Fig. 1
The sound that people sends) exemplified by, its deflection is respectively -30 °, 30 °.
Fig. 2 be the present invention system block diagram, the inventive method include model training, time-frequency conversion, critical band division and
The sound source classification of subband, the embodiment of technical solution of the present invention is described in detail below in conjunction with the accompanying drawings:
Step 1) data are trained:
1.1) Fig. 2 is provided in overall system diagram, in the training stage, head related transfer functions HRTF (Head
Related Transfer Function), corresponding time domain with head Related impulse receptance function HRIR (Head Related
Impulse Response) it is used to generate the binaural signals of particular orientation.The present invention is tested using Massachusetts Institute of Technology's media
The HRIR data of room measurement, pair by the HRIR data at θ=- 90 °~90 ° (5 ° of interval) with the corresponding orientation of white noise convolution generation
Otoacoustic signal.
1.2) the ears white noise signal to orientation θ is pre-processed, the pretreatment of this method includes:Amplitude normalizing
Change, framing and adding window.
Amplitude normalization method is:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) represent
The maximum of left ear, auris dextra acoustical signal amplitude.
The present embodiment carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can
To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length, this
In embodiment, speech signal samples rate is 16kHz, and frame length 32ms, it is 16ms, such N=512 that frame, which moves,;wH(n) it is Hamming window
Window function, expression formula are:
1.3) the ITD models of azimuth angle theta are established.
The ITD values of τ frame signals are:
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), the training as θ orientation
ITD parameter:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation.
It has been built such that azimuth angle theta and has trained the model between IID parameters.
1.4) the IID models of azimuth angle theta are established:
The IID values of τ frame signals are:
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angle
Frequency vector, scope is [0,2 π], at intervals of 2 π/512.
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the instruction as θ orientation
Practice IID parameters:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation.
It has been built such that azimuth angle theta and has trained the model between IID parameters.
Ears compound voice Signal separator stage of the step 2) based on critical band and azimuth information.
2.1) pretreatment module in corresponding diagram 1, the ears mixing acoustical signal comprising the different multi-acoustical in orientation is entered
Row and above-mentioned steps 1.2) in identical pre-process, including amplitude normalization, framing and adding window, it be 32ms to take frame length, frame shifting
For 16ms, add Hamming window.
2.2) frequency-domain transform in corresponding diagram 1, Fourier in short-term is carried out by frame to the multiframe signal obtained by step 2.1) and become
Change, be transformed into time-frequency domain, obtain the framing signal X of binaural signals time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency of c-th of critical band are represented respectively
Scope.
The division scope of critical band, i.e., the low frequency of each critical band, high frequency and bandwidth are as shown in the table:
2.3) classification of the subband based on dimensional orientation in corresponding diagram 1.Here we assume that known ears creolized language message
The sound source number and corresponding attitude included in number.There is number of many algorithms by binaural signals to sound source at present
Estimated with azimuth information, auditory localization is not described here, the algorithm of auditory localization is not limited similarly.
Only discuss after auditory localization, how to be separated according to the attitude information of different sound sources.
According to the masking effect of human auditory system, generally in some critical band of a certain frame, an only sound source
Signal accounts for leading, is sky using interaural difference IID and interaural intensity difference ITD so in the speech Separation based on dimensional orientation
Between clue, mask function is calculated by maximum similarity between two sound channels, the classification of sound source is carried out in each critical band,
Here we assume that include L sound source, the azimuth angle theta of each sound source in ears mixing acoustical signall(1≤l≤L):
Wherein XL(τ,ω)、XR(τ, ω) is respectively the left and right ear frequency-region signal of τ frames, ωcRepresent c-th of critical band
Spectral range;θlFor azimuth corresponding to l-th of sound source;α(θl, ω) and it is that l-th of sound source corresponds to dimensional orientation θlIn ω frequencies
On IID parameters, δ (θl) it is the ITD parameter that l-th of sound source corresponds to orientation.
J (τ, c) is actually that sound source is classified using azimuth information in each critical band.
Immediately, binary mask mark is carried out to the critical band corresponding to each sound source:
Such Ml(τ, ω) represents binary mask of l-th of sound source in c-th of critical band.
2.4) according to binary mask, the binaural signals of every frame, each frequency are classified, obtain l-th of sound source
Corresponding time frequency point signal:
WhereinRepresent the frequency domain data of l-th of sound source τ frame.
Here we are multiplied with left otoacoustic signal with mask, obtain the time-frequency data of each sound source, actually can also profit
The time-frequency data of each sound source are obtained with auris dextra acoustical signal.
2.5) the time-frequency domain inverse transformation in corresponding diagram 1, inverse Fu in short-term is carried out to the frequency-region signal of l-th of sound source after separation
In leaf transformation, obtain sound source l τ frame time-domain signals
After being converted to time-domain signal, adding window is carried out, the τ frame signals gone after adding window can be expressed as:
WhereinwH(m) it is above Hamming window.
Each frame voice that will be gone after adding window carries out overlap-add, so as to obtain mixing l-th of sound after sound-source signal separates
Source signal sl, so as to realize the separation of different azimuth sound-source signal.
Described above is only the preferred embodiment of the present invention, it should be pointed out that:, the upset oil cylinder
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (7)
1. a kind of ears speech separating method based on critical band, it is characterised in that this method comprises the following steps:
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Guan pulse
Rush receptance function HRIR data and binaural signal known to the orientation of monophonic white noise signal convolution generation, ears white noise letter
Number sound bearing angle θ be defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °];
1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process includes amplitude normalization
Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born;
1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation function meter
Calculate the interaural difference ITD estimates of single frames signal, the ITD of the averages of the same all frame ITD estimates in orientation as the orientation
Trained values, the ITD models of azimuth angle theta are established, be designated as δ (θ);
1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.2), is transformed to frequency domain,
Calculate the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors, same orientation
IID trained values of the average of all frame IID estimates as the orientation, the IID models of azimuth angle theta are established, are designated as α (θ, ω),
ω represents the frequency spectrum of Fourier transformation;
2) the ears mixing voice Signal separator stage based on critical band and azimuth information:
2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different orientation, double
Ear mixing voice signal is pre-processed, and the method for the pretreatment is identical with the preprocess method in step 1.2), including width
Normalized, framing adding window are spent,;
2.2) Fourier transformation, the frequency range based on critical band, to frequency are carried out to the ears mixing acoustical signal after framing
Domain carries out sub-band division, obtains the subband signal after framing;
2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) foundation
Orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, based on left and right otoacoustic signal
Similarity, carry out the classification of sound source;
2.4) to the critical band classification results obtained by step 2.3) and the time frequency signal phase after the framing of acquisition in step 2.1)
Multiply, obtain the time-frequency domain signal corresponding to each sound source;
2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), is converted to time domain letter
Number, adding window is carried out, synthesizes the separation voice of each sound source.
A kind of 2. ears speech separating method based on critical band according to claim 1, it is characterised in that the sound
Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born.
A kind of 3. ears speech separating method based on critical band according to claim 1, it is characterised in that the step
It is rapid 1.2) in amplitude normalization method be:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) the left ear of expression,
The maximum of auris dextra acoustical signal amplitude.
A kind of 4. ears speech separating method based on critical band according to claim 1, it is characterised in that the step
It is rapid 1.2) in framing adding window windowing process is carried out to the voice signal after framing using Hamming window, the τ frame signals after adding window can
To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length.
A kind of 5. ears speech separating method based on critical band according to claim 1, it is characterised in that the step
It is rapid 1.3) in establish azimuth angle theta ITD models method it is as follows:
The ITD values of τ frame signals are:
<mrow>
<mi>I</mi>
<mi>T</mi>
<mi>D</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>arg</mi>
<munder>
<mrow>
<mi>m</mi>
<mi>a</mi>
<mi>x</mi>
</mrow>
<mi>k</mi>
</munder>
<mrow>
<mo>(</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>m</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mi>N</mi>
<mo>-</mo>
<mo>|</mo>
<mi>k</mi>
<mo>|</mo>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<msub>
<mi>x</mi>
<mi>L</mi>
</msub>
<mrow>
<mo>(</mo>
<mrow>
<mi>&tau;</mi>
<mo>,</mo>
<mi>n</mi>
</mrow>
<mo>)</mo>
</mrow>
<msub>
<mi>x</mi>
<mi>R</mi>
</msub>
<mrow>
<mo>(</mo>
<mrow>
<mi>&tau;</mi>
<mo>,</mo>
<mi>n</mi>
<mo>+</mo>
<mi>k</mi>
</mrow>
<mo>)</mo>
</mrow>
<mo>)</mo>
</mrow>
<mo>,</mo>
<mo>-</mo>
<mi>N</mi>
<mo>+</mo>
<mn>1</mn>
<mo>&le;</mo>
<mi>k</mi>
<mo>&le;</mo>
<mi>N</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), and the training ITD as θ orientation joins
Number:
<mrow>
<mi>&delta;</mi>
<mrow>
<mo>(</mo>
<mi>&theta;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<munder>
<mo>&Sigma;</mo>
<mi>&tau;</mi>
</munder>
<mi>I</mi>
<mi>T</mi>
<mi>D</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>f</mi>
<mi>r</mi>
<mi>a</mi>
<mi>m</mi>
<mi>e</mi>
<mi>N</mi>
<mi>u</mi>
<mi>m</mi>
</mrow>
</mfrac>
</mrow>
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
A kind of 6. ears speech separating method based on critical band according to claim 1, it is characterised in that the step
It is rapid 1.4) in establish azimuth angle theta IID models method it is as follows:
The IID values of τ frame signals are:
<mrow>
<mi>I</mi>
<mi>I</mi>
<mi>D</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>20</mn>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mfrac>
<mrow>
<mo>|</mo>
<msub>
<mi>X</mi>
<mi>L</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
<mo>|</mo>
</mrow>
<mrow>
<mo>|</mo>
<msub>
<mi>X</mi>
<mi>R</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
<mo>|</mo>
</mrow>
</mfrac>
</mrow>
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
<mrow>
<mi>X</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>m</mi>
<mo>=</mo>
<mn>0</mn>
</mrow>
<mrow>
<mi>N</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>x</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<msup>
<mi>e</mi>
<mrow>
<mo>-</mo>
<mi>j</mi>
<mi>&omega;</mi>
<mi>n</mi>
</mrow>
</msup>
</mrow>
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angular frequency
Vector, scope is [0,2 π], at intervals of 2 π/512;
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the training as θ orientation
IID parameters:
<mrow>
<mi>&alpha;</mi>
<mrow>
<mo>(</mo>
<mi>&theta;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<munder>
<mo>&Sigma;</mo>
<mi>&tau;</mi>
</munder>
<mi>I</mi>
<mi>I</mi>
<mi>D</mi>
<mrow>
<mo>(</mo>
<mi>&tau;</mi>
<mo>,</mo>
<mi>&omega;</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>f</mi>
<mi>r</mi>
<mi>a</mi>
<mi>m</mi>
<mi>e</mi>
<mi>N</mi>
<mi>u</mi>
<mi>m</mi>
</mrow>
</mfrac>
</mrow>
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
A kind of 7. ears speech separating method based on critical band according to claim 1, it is characterised in that the step
It is rapid 2.2) in sub-band division method it is as follows:
Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), is transformed into time-frequency domain, obtains binaural sound letter
The framing signal X of number time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, carry out sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency model of c-th of critical band are represented respectively
Enclose.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710479139.2A CN107346664A (en) | 2017-06-22 | 2017-06-22 | A kind of ears speech separating method based on critical band |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710479139.2A CN107346664A (en) | 2017-06-22 | 2017-06-22 | A kind of ears speech separating method based on critical band |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107346664A true CN107346664A (en) | 2017-11-14 |
Family
ID=60253298
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710479139.2A Pending CN107346664A (en) | 2017-06-22 | 2017-06-22 | A kind of ears speech separating method based on critical band |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107346664A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107942290A (en) * | 2017-11-16 | 2018-04-20 | 东南大学 | Binaural sound sources localization method based on BP neural network |
CN108091345A (en) * | 2017-12-27 | 2018-05-29 | 东南大学 | A kind of ears speech separating method based on support vector machines |
CN108615536A (en) * | 2018-04-09 | 2018-10-02 | 华南理工大学 | Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array |
CN108647556A (en) * | 2018-03-02 | 2018-10-12 | 重庆邮电大学 | Sound localization method based on frequency dividing and deep neural network |
CN110364175A (en) * | 2019-08-20 | 2019-10-22 | 北京凌声芯语音科技有限公司 | Sound enhancement method and system, verbal system |
CN110446142A (en) * | 2018-05-03 | 2019-11-12 | 阿里巴巴集团控股有限公司 | Audio-frequency information processing method, server, equipment, storage medium and client |
CN112731289A (en) * | 2020-12-10 | 2021-04-30 | 深港产学研基地(北京大学香港科技大学深圳研修院) | Binaural sound source positioning method and device based on weighted template matching |
CN113476041A (en) * | 2021-06-21 | 2021-10-08 | 苏州大学附属第一医院 | Speech perception capability test method and system for children using artificial cochlea |
CN113782047A (en) * | 2021-09-06 | 2021-12-10 | 云知声智能科技股份有限公司 | Voice separation method, device, equipment and storage medium |
US11328702B1 (en) | 2021-04-25 | 2022-05-10 | Shenzhen Shokz Co., Ltd. | Acoustic devices |
WO2022226696A1 (en) * | 2021-04-25 | 2022-11-03 | 深圳市韶音科技有限公司 | Open earphone |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464750A (en) * | 2014-10-24 | 2015-03-25 | 东南大学 | Voice separation method based on binaural sound source localization |
CN105575403A (en) * | 2015-12-25 | 2016-05-11 | 重庆邮电大学 | Cross-correlation sound source positioning method with combination of auditory masking and double-ear signal frames |
CN105900457A (en) * | 2014-01-03 | 2016-08-24 | 杜比实验室特许公司 | Methods and systems for designing and applying numerically optimized binaural room impulse responses |
-
2017
- 2017-06-22 CN CN201710479139.2A patent/CN107346664A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105900457A (en) * | 2014-01-03 | 2016-08-24 | 杜比实验室特许公司 | Methods and systems for designing and applying numerically optimized binaural room impulse responses |
CN104464750A (en) * | 2014-10-24 | 2015-03-25 | 东南大学 | Voice separation method based on binaural sound source localization |
CN105575403A (en) * | 2015-12-25 | 2016-05-11 | 重庆邮电大学 | Cross-correlation sound source positioning method with combination of auditory masking and double-ear signal frames |
Non-Patent Citations (7)
Title |
---|
B.C.J.MOORE: "《An Introduction to Psychology of Hearing》", 30 December 1997 * |
ROMAN N.等: ""Speech Segregation Based on Sound Localization"", 《JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA》 * |
廖启鹏: ""基于Gammatone听觉滤波器组和复倒谱盲解卷积的语音去混响研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
李坤: ""双耳强度差感知特性测量与分析"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
李枭雄: ""基于双耳空间信息的语音分离研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
王菁: ""基于计算听觉场景分析的混合语音分离"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
谢志文: ""心理声学掩蔽效应的研究"", 《中国博士学位论文全文数据库(信息科技辑)》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107942290B (en) * | 2017-11-16 | 2019-10-11 | 东南大学 | Binaural sound sources localization method based on BP neural network |
CN107942290A (en) * | 2017-11-16 | 2018-04-20 | 东南大学 | Binaural sound sources localization method based on BP neural network |
CN108091345A (en) * | 2017-12-27 | 2018-05-29 | 东南大学 | A kind of ears speech separating method based on support vector machines |
CN108091345B (en) * | 2017-12-27 | 2020-11-20 | 东南大学 | Double-ear voice separation method based on support vector machine |
CN108647556A (en) * | 2018-03-02 | 2018-10-12 | 重庆邮电大学 | Sound localization method based on frequency dividing and deep neural network |
CN108615536A (en) * | 2018-04-09 | 2018-10-02 | 华南理工大学 | Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array |
CN110446142B (en) * | 2018-05-03 | 2021-10-15 | 阿里巴巴集团控股有限公司 | Audio information processing method, server, device, storage medium and client |
CN110446142A (en) * | 2018-05-03 | 2019-11-12 | 阿里巴巴集团控股有限公司 | Audio-frequency information processing method, server, equipment, storage medium and client |
CN110364175B (en) * | 2019-08-20 | 2022-02-18 | 北京凌声芯语音科技有限公司 | Voice enhancement method and system and communication equipment |
CN110364175A (en) * | 2019-08-20 | 2019-10-22 | 北京凌声芯语音科技有限公司 | Sound enhancement method and system, verbal system |
CN112731289A (en) * | 2020-12-10 | 2021-04-30 | 深港产学研基地(北京大学香港科技大学深圳研修院) | Binaural sound source positioning method and device based on weighted template matching |
CN112731289B (en) * | 2020-12-10 | 2024-05-07 | 深港产学研基地(北京大学香港科技大学深圳研修院) | Binaural sound source positioning method and device based on weighted template matching |
US11328702B1 (en) | 2021-04-25 | 2022-05-10 | Shenzhen Shokz Co., Ltd. | Acoustic devices |
WO2022226696A1 (en) * | 2021-04-25 | 2022-11-03 | 深圳市韶音科技有限公司 | Open earphone |
US11715451B2 (en) | 2021-04-25 | 2023-08-01 | Shenzhen Shokz Co., Ltd. | Acoustic devices |
CN113476041A (en) * | 2021-06-21 | 2021-10-08 | 苏州大学附属第一医院 | Speech perception capability test method and system for children using artificial cochlea |
CN113476041B (en) * | 2021-06-21 | 2023-09-19 | 苏州大学附属第一医院 | Speech perception capability test method and system for artificial cochlea using children |
CN113782047A (en) * | 2021-09-06 | 2021-12-10 | 云知声智能科技股份有限公司 | Voice separation method, device, equipment and storage medium |
CN113782047B (en) * | 2021-09-06 | 2024-03-08 | 云知声智能科技股份有限公司 | Voice separation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107346664A (en) | A kind of ears speech separating method based on critical band | |
CN104464750B (en) | A kind of speech separating method based on binaural sound sources positioning | |
CN109830245B (en) | Multi-speaker voice separation method and system based on beam forming | |
CN106373589B (en) | A kind of ears mixing voice separation method based on iteration structure | |
CN109584903B (en) | Multi-user voice separation method based on deep learning | |
CN102565759B (en) | Binaural sound source localization method based on sub-band signal to noise ratio estimation | |
CN106504763A (en) | Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction | |
CN106782565A (en) | A kind of vocal print feature recognition methods and system | |
CN110728989B (en) | Binaural speech separation method based on long-time and short-time memory network L STM | |
CN102438189A (en) | Dual-channel acoustic signal-based sound source localization method | |
CN108091345B (en) | Double-ear voice separation method based on support vector machine | |
CN108520756B (en) | Method and device for separating speaker voice | |
Cai et al. | Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment. | |
CN110619887A (en) | Multi-speaker voice separation method based on convolutional neural network | |
CN111986695A (en) | Non-overlapping sub-band division fast independent vector analysis voice blind separation method and system | |
CN103901400A (en) | Binaural sound source positioning method based on delay compensation and binaural coincidence | |
Wang et al. | Pseudo-determined blind source separation for ad-hoc microphone networks | |
Spille et al. | Combining binaural and cortical features for robust speech recognition | |
Talagala et al. | Binaural localization of speech sources in the median plane using cepstral HRTF extraction | |
CN112216301B (en) | Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference | |
CN113345421B (en) | Multi-channel far-field target voice recognition method based on angle spectrum characteristics | |
CN114189781A (en) | Noise reduction method and system for double-microphone neural network noise reduction earphone | |
CN112731291A (en) | Binaural sound source positioning method and system for collaborative two-channel time-frequency mask estimation task learning | |
Guo et al. | Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction | |
Meutzner et al. | Binaural signal processing for enhanced speech recognition robustness in complex listening environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171114 |
|
RJ01 | Rejection of invention patent application after publication |