CN107346664A - A kind of ears speech separating method based on critical band - Google Patents

A kind of ears speech separating method based on critical band Download PDF

Info

Publication number
CN107346664A
CN107346664A CN201710479139.2A CN201710479139A CN107346664A CN 107346664 A CN107346664 A CN 107346664A CN 201710479139 A CN201710479139 A CN 201710479139A CN 107346664 A CN107346664 A CN 107346664A
Authority
CN
China
Prior art keywords
mrow
signal
ears
critical band
orientation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710479139.2A
Other languages
Chinese (zh)
Inventor
谈雅文
汤彬
汤一彬
陈秉岩
高远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN201710479139.2A priority Critical patent/CN107346664A/en
Publication of CN107346664A publication Critical patent/CN107346664A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a kind of speech separating method based on critical band and binaural signals, pass through data training and the azimuth information of sound source, in each critical band binaural signals are carried out with the classification of sound source, so as to obtain the data flow of each sound source, each sound-source signal after being separated is reconstructed, realizes speech Separation.Scaling down processing mechanism of the invention based on human auditory system, with reference to the auditory masking effect of human ear, according to the azimuth information of different sound sources, mixing voice is separated in each critical band, positioning separating resulting under the conditions of different noises and reverberation shows, ears speech Separation based on critical band, performance obtain effectively lifting.

Description

A kind of ears speech separating method based on critical band
Technical field
The present invention relates to auditory localization and speech Separation field, and in particular to a kind of ears voice point based on critical band From method.
Background technology
Voice positions and isolation technics is the front end of speech signal processing system, and its performance is to whole speech signal systems shadow Sound is very big.Since digital communication era, the speech processes skill such as encoding and decoding speech, voice positioning, speech Separation, speech enhan-cement Art is obtained for rapid development, and especially in current internet tide, voice assistant has pushed Speech processing to one Individual new height.
The development of following multi-modal man-machine interaction, human-computer dialogue and speech recognition be unable to do without Speech processing research and Development, so front end of the speech Separation technology as speech processing system, it is directly connected to the performance and effect of whole voice system Fruit.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention provides a kind of double based on critical band Whispering voice separation method, using the scaling down processing mechanism of human auditory system, with reference to the auditory masking effect of human ear, simulate human ear Aural signature, divided based on critical band, different subband is divided to each frame signal obtain accurate hybrid matrix and carry out Speech Separation, improve the deficiencies in the prior art.
Technical scheme:A kind of ears speech separating method based on critical band, it is characterised in that this method includes following Step:
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Close impulse response function HRIR data and binaural signal, sound bearing known to the orientation of monophonic white noise signal convolution generation Angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], at intervals of 5 °;
1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process is returned including amplitude One change processing, framing adding window, obtain the single frames binaural signals after framing;
Amplitude normalization method is:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) represent The maximum of left ear, auris dextra acoustical signal amplitude.
Framing adding window carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length.
1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation letter Number calculates the interaural difference ITD estimates of single frames signal.The average of the same all frame ITD estimates in orientation is as the orientation ITD trained values, it is designated as δ (θ).
The method for establishing the ITD models of azimuth angle theta is as follows:
The ITD values of τ frame signals are:
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), the training as θ orientation ITD parameter:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.1), is transformed to frequency Domain, the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors are calculated, it is same IID trained values of the average of all frame IID estimates in orientation as the orientation, are designated as α (θ, ω), and ω represents Fourier transformation Frequency spectrum.
The method for establishing the IID models of azimuth angle theta is as follows:
The IID values of τ frame signals are:
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angle Frequency vector, scope is [0,2 π], at intervals of 2 π/512;
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the instruction as θ orientation Practice IID parameters:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
2) the ears mixing voice Signal separator stage based on critical band and azimuth information:
2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different sides Position.Ears mixing voice signal is pre-processed, including amplitude normalization processing, framing adding window;
2.2) to the ears mixing acoustical signal progress Fourier transformation after framing, the frequency range based on critical band, Sub-band division is carried out to frequency domain, obtains the subband signal after framing;
The method of sub-band division is as follows:
Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), time-frequency domain is transformed into, obtains ears The framing signal X of acoustical signal time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, carry out sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency of c-th of critical band are represented respectively Scope.
2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) are built Vertical orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, believed based on left and right otoacoustic emission Number similarity, carry out the classification of sound source;
2.4) time-frequency after the framing of acquisition in the critical band classification results obtained by step 2.3) and step 2.1) is believed Number be multiplied, obtain the time-frequency domain signal corresponding to each sound source;
2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), when being converted to Domain signal, adding window is carried out, synthesize the separation voice of each sound source.
Beneficial effect:It is of the invention compared with the existing speech Separation technology based on frequency, the present invention be based on human auditory system The scaling down processing mechanism of system, with reference to the auditory masking effect of human ear, after positioning stage accurately obtains sound bearing, to every Different sub-band is separated in one frame, and auditory localization and critical band isolation technics are combined, in multiple speaker separation sides Face, its separating property:SNR(Source to Noise Ratio)、SDR(Source to Distortion Ratio)、SAR (Sources to Artifacts Ratio), PESQ (Perceptual Evaluation of Speech Quality) To effectively improving.
Brief description of the drawings
Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation;
Fig. 2 is present system block diagram
Embodiment
The present invention is further described below in conjunction with the accompanying drawings.
The advanced row data training of the present invention, by each orientation interaural difference ITD (Interaural Time Difference) and interaural intensity difference IID (Interaural Intensity Difference) average as sound bearing Location feature clue, establish orientation mapping model;During actual auditory localization, all frame orientation of acoustical signal are mixed according to ears Histogram, estimate final sound source number and orientation.In the Sound seperation stage, first ears mixing acoustical signal is carried out being based on facing The sub-band division of boundary's frequency band, the azimuth information after being positioned with reference to voice, divides frequency-region signal in each critical band Class, the time frequency point of each sound source on time-frequency domain is returned into time domain finally by inverse Fourier transform.
Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation, by taking 2 sound sources as an example.2 microphones At ears, in the present invention, sound source locus represents that -180 ° of deflection≤θ≤180 ° are by the azimuth angle theta of sound source Direction vector is in the projection of horizontal plane and the angle of middle vertical plane.On horizontal plane, θ=0 ° represent front, along clockwise direction θ= 90 °, 180 ° and -90 ° represent front-right, dead astern, front-left respectively.Using 2 sound sources, (sound source of the present embodiment is speaks to Fig. 1 The sound that people sends) exemplified by, its deflection is respectively -30 °, 30 °.
Fig. 2 be the present invention system block diagram, the inventive method include model training, time-frequency conversion, critical band division and The sound source classification of subband, the embodiment of technical solution of the present invention is described in detail below in conjunction with the accompanying drawings:
Step 1) data are trained:
1.1) Fig. 2 is provided in overall system diagram, in the training stage, head related transfer functions HRTF (Head Related Transfer Function), corresponding time domain with head Related impulse receptance function HRIR (Head Related Impulse Response) it is used to generate the binaural signals of particular orientation.The present invention is tested using Massachusetts Institute of Technology's media The HRIR data of room measurement, pair by the HRIR data at θ=- 90 °~90 ° (5 ° of interval) with the corresponding orientation of white noise convolution generation Otoacoustic signal.
1.2) the ears white noise signal to orientation θ is pre-processed, the pretreatment of this method includes:Amplitude normalizing Change, framing and adding window.
Amplitude normalization method is:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) represent The maximum of left ear, auris dextra acoustical signal amplitude.
The present embodiment carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length, this In embodiment, speech signal samples rate is 16kHz, and frame length 32ms, it is 16ms, such N=512 that frame, which moves,;wH(n) it is Hamming window Window function, expression formula are:
1.3) the ITD models of azimuth angle theta are established.
The ITD values of τ frame signals are:
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), the training as θ orientation ITD parameter:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation.
It has been built such that azimuth angle theta and has trained the model between IID parameters.
1.4) the IID models of azimuth angle theta are established:
The IID values of τ frame signals are:
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angle Frequency vector, scope is [0,2 π], at intervals of 2 π/512.
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the instruction as θ orientation Practice IID parameters:
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation.
It has been built such that azimuth angle theta and has trained the model between IID parameters.
Ears compound voice Signal separator stage of the step 2) based on critical band and azimuth information.
2.1) pretreatment module in corresponding diagram 1, the ears mixing acoustical signal comprising the different multi-acoustical in orientation is entered Row and above-mentioned steps 1.2) in identical pre-process, including amplitude normalization, framing and adding window, it be 32ms to take frame length, frame shifting For 16ms, add Hamming window.
2.2) frequency-domain transform in corresponding diagram 1, Fourier in short-term is carried out by frame to the multiframe signal obtained by step 2.1) and become Change, be transformed into time-frequency domain, obtain the framing signal X of binaural signals time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency of c-th of critical band are represented respectively Scope.
The division scope of critical band, i.e., the low frequency of each critical band, high frequency and bandwidth are as shown in the table:
2.3) classification of the subband based on dimensional orientation in corresponding diagram 1.Here we assume that known ears creolized language message The sound source number and corresponding attitude included in number.There is number of many algorithms by binaural signals to sound source at present Estimated with azimuth information, auditory localization is not described here, the algorithm of auditory localization is not limited similarly. Only discuss after auditory localization, how to be separated according to the attitude information of different sound sources.
According to the masking effect of human auditory system, generally in some critical band of a certain frame, an only sound source Signal accounts for leading, is sky using interaural difference IID and interaural intensity difference ITD so in the speech Separation based on dimensional orientation Between clue, mask function is calculated by maximum similarity between two sound channels, the classification of sound source is carried out in each critical band, Here we assume that include L sound source, the azimuth angle theta of each sound source in ears mixing acoustical signall(1≤l≤L):
Wherein XL(τ,ω)、XR(τ, ω) is respectively the left and right ear frequency-region signal of τ frames, ωcRepresent c-th of critical band Spectral range;θlFor azimuth corresponding to l-th of sound source;α(θl, ω) and it is that l-th of sound source corresponds to dimensional orientation θlIn ω frequencies On IID parameters, δ (θl) it is the ITD parameter that l-th of sound source corresponds to orientation.
J (τ, c) is actually that sound source is classified using azimuth information in each critical band.
Immediately, binary mask mark is carried out to the critical band corresponding to each sound source:
Such Ml(τ, ω) represents binary mask of l-th of sound source in c-th of critical band.
2.4) according to binary mask, the binaural signals of every frame, each frequency are classified, obtain l-th of sound source Corresponding time frequency point signal:
WhereinRepresent the frequency domain data of l-th of sound source τ frame.
Here we are multiplied with left otoacoustic signal with mask, obtain the time-frequency data of each sound source, actually can also profit The time-frequency data of each sound source are obtained with auris dextra acoustical signal.
2.5) the time-frequency domain inverse transformation in corresponding diagram 1, inverse Fu in short-term is carried out to the frequency-region signal of l-th of sound source after separation In leaf transformation, obtain sound source l τ frame time-domain signals
After being converted to time-domain signal, adding window is carried out, the τ frame signals gone after adding window can be expressed as:
WhereinwH(m) it is above Hamming window.
Each frame voice that will be gone after adding window carries out overlap-add, so as to obtain mixing l-th of sound after sound-source signal separates Source signal sl, so as to realize the separation of different azimuth sound-source signal.
Described above is only the preferred embodiment of the present invention, it should be pointed out that:, the upset oil cylinder For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (7)

1. a kind of ears speech separating method based on critical band, it is characterised in that this method comprises the following steps:
1) the parameter training stage:
1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Guan pulse Rush receptance function HRIR data and binaural signal known to the orientation of monophonic white noise signal convolution generation, ears white noise letter Number sound bearing angle θ be defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °];
1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process includes amplitude normalization Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born;
1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation function meter Calculate the interaural difference ITD estimates of single frames signal, the ITD of the averages of the same all frame ITD estimates in orientation as the orientation Trained values, the ITD models of azimuth angle theta are established, be designated as δ (θ);
1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.2), is transformed to frequency domain, Calculate the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors, same orientation IID trained values of the average of all frame IID estimates as the orientation, the IID models of azimuth angle theta are established, are designated as α (θ, ω), ω represents the frequency spectrum of Fourier transformation;
2) the ears mixing voice Signal separator stage based on critical band and azimuth information:
2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different orientation, double Ear mixing voice signal is pre-processed, and the method for the pretreatment is identical with the preprocess method in step 1.2), including width Normalized, framing adding window are spent,;
2.2) Fourier transformation, the frequency range based on critical band, to frequency are carried out to the ears mixing acoustical signal after framing Domain carries out sub-band division, obtains the subband signal after framing;
2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) foundation Orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, based on left and right otoacoustic signal Similarity, carry out the classification of sound source;
2.4) to the critical band classification results obtained by step 2.3) and the time frequency signal phase after the framing of acquisition in step 2.1) Multiply, obtain the time-frequency domain signal corresponding to each sound source;
2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), is converted to time domain letter Number, adding window is carried out, synthesizes the separation voice of each sound source.
A kind of 2. ears speech separating method based on critical band according to claim 1, it is characterised in that the sound Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born.
A kind of 3. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.2) in amplitude normalization method be:
xL=xL/maxvalue
xR=xR/maxvalue
Wherein xLAnd xRLeft otoacoustic signal and auris dextra acoustical signal are represented respectively;Maxvalue=max (| xL|,|xR|) the left ear of expression, The maximum of auris dextra acoustical signal amplitude.
A kind of 4. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.2) in framing adding window windowing process is carried out to the voice signal after framing using Hamming window, the τ frame signals after adding window can To be expressed as:
xL(τ, n)=wH(n)xL(τ N+n) 0≤n < N
xR(τ, n)=wH(n)xR(τ N+n) 0≤n < N
Wherein xL(τ,n)、xR(τ, n) represents the left and right otoacoustic signal of τ frames respectively;N is a frame sampling data length.
A kind of 5. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.3) in establish azimuth angle theta ITD models method it is as follows:
The ITD values of τ frame signals are:
<mrow> <mi>I</mi> <mi>T</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>arg</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>k</mi> </munder> <mrow> <mo>(</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mo>|</mo> <mi>k</mi> <mo>|</mo> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>&amp;tau;</mi> <mo>,</mo> <mi>n</mi> </mrow> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>&amp;tau;</mi> <mo>,</mo> <mi>n</mi> <mo>+</mo> <mi>k</mi> </mrow> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <mo>-</mo> <mi>N</mi> <mo>+</mo> <mn>1</mn> <mo>&amp;le;</mo> <mi>k</mi> <mo>&amp;le;</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow>
The ears white noise signal in the θ orientation is corresponded into the ITD of all framesτAverage δ (θ), and the training ITD as θ orientation joins Number:
<mrow> <mi>&amp;delta;</mi> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&amp;Sigma;</mo> <mi>&amp;tau;</mi> </munder> <mi>I</mi> <mi>T</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mi>r</mi> <mi>a</mi> <mi>m</mi> <mi>e</mi> <mi>N</mi> <mi>u</mi> <mi>m</mi> </mrow> </mfrac> </mrow>
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
A kind of 6. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.4) in establish azimuth angle theta IID models method it is as follows:
The IID values of τ frame signals are:
<mrow> <mi>I</mi> <mi>I</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mn>20</mn> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> </mrow>
Wherein, XL(τ, ω) and XR(τ, ω) difference xL(τ,m)、xRThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform:
<mrow> <mi>X</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>x</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>j</mi> <mi>&amp;omega;</mi> <mi>n</mi> </mrow> </msup> </mrow>
Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively;ω represents angular frequency Vector, scope is [0,2 π], at intervals of 2 π/512;
The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the training as θ orientation IID parameters:
<mrow> <mi>&amp;alpha;</mi> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&amp;Sigma;</mo> <mi>&amp;tau;</mi> </munder> <mi>I</mi> <mi>I</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&amp;tau;</mi> <mo>,</mo> <mi>&amp;omega;</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mi>r</mi> <mi>a</mi> <mi>m</mi> <mi>e</mi> <mi>N</mi> <mi>u</mi> <mi>m</mi> </mrow> </mfrac> </mrow>
Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,
It has been built such that azimuth angle theta and has trained the model between IID parameters.
A kind of 7. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 2.2) in sub-band division method it is as follows:
Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), is transformed into time-frequency domain, obtains binaural sound letter The framing signal X of number time-frequency domainL(τ, ω) and XR(τ,ω)。
Simultaneously according to the division methods of critical band, carry out sub-band division is carried out to frequency:
Wherein C represents the number of critical band, ωc_low、ωc_highThe low frequency and high frequency model of c-th of critical band are represented respectively Enclose.
CN201710479139.2A 2017-06-22 2017-06-22 A kind of ears speech separating method based on critical band Pending CN107346664A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710479139.2A CN107346664A (en) 2017-06-22 2017-06-22 A kind of ears speech separating method based on critical band

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710479139.2A CN107346664A (en) 2017-06-22 2017-06-22 A kind of ears speech separating method based on critical band

Publications (1)

Publication Number Publication Date
CN107346664A true CN107346664A (en) 2017-11-14

Family

ID=60253298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710479139.2A Pending CN107346664A (en) 2017-06-22 2017-06-22 A kind of ears speech separating method based on critical band

Country Status (1)

Country Link
CN (1) CN107346664A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107942290A (en) * 2017-11-16 2018-04-20 东南大学 Binaural sound sources localization method based on BP neural network
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN108647556A (en) * 2018-03-02 2018-10-12 重庆邮电大学 Sound localization method based on frequency dividing and deep neural network
CN110364175A (en) * 2019-08-20 2019-10-22 北京凌声芯语音科技有限公司 Sound enhancement method and system, verbal system
CN110446142A (en) * 2018-05-03 2019-11-12 阿里巴巴集团控股有限公司 Audio-frequency information processing method, server, equipment, storage medium and client
CN112731289A (en) * 2020-12-10 2021-04-30 深港产学研基地(北京大学香港科技大学深圳研修院) Binaural sound source positioning method and device based on weighted template matching
CN113476041A (en) * 2021-06-21 2021-10-08 苏州大学附属第一医院 Speech perception capability test method and system for children using artificial cochlea
CN113782047A (en) * 2021-09-06 2021-12-10 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium
US11328702B1 (en) 2021-04-25 2022-05-10 Shenzhen Shokz Co., Ltd. Acoustic devices
WO2022226696A1 (en) * 2021-04-25 2022-11-03 深圳市韶音科技有限公司 Open earphone

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464750A (en) * 2014-10-24 2015-03-25 东南大学 Voice separation method based on binaural sound source localization
CN105575403A (en) * 2015-12-25 2016-05-11 重庆邮电大学 Cross-correlation sound source positioning method with combination of auditory masking and double-ear signal frames
CN105900457A (en) * 2014-01-03 2016-08-24 杜比实验室特许公司 Methods and systems for designing and applying numerically optimized binaural room impulse responses

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105900457A (en) * 2014-01-03 2016-08-24 杜比实验室特许公司 Methods and systems for designing and applying numerically optimized binaural room impulse responses
CN104464750A (en) * 2014-10-24 2015-03-25 东南大学 Voice separation method based on binaural sound source localization
CN105575403A (en) * 2015-12-25 2016-05-11 重庆邮电大学 Cross-correlation sound source positioning method with combination of auditory masking and double-ear signal frames

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
B.C.J.MOORE: "《An Introduction to Psychology of Hearing》", 30 December 1997 *
ROMAN N.等: ""Speech Segregation Based on Sound Localization"", 《JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA》 *
廖启鹏: ""基于Gammatone听觉滤波器组和复倒谱盲解卷积的语音去混响研究"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
李坤: ""双耳强度差感知特性测量与分析"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
李枭雄: ""基于双耳空间信息的语音分离研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
王菁: ""基于计算听觉场景分析的混合语音分离"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
谢志文: ""心理声学掩蔽效应的研究"", 《中国博士学位论文全文数据库(信息科技辑)》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107942290B (en) * 2017-11-16 2019-10-11 东南大学 Binaural sound sources localization method based on BP neural network
CN107942290A (en) * 2017-11-16 2018-04-20 东南大学 Binaural sound sources localization method based on BP neural network
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108091345B (en) * 2017-12-27 2020-11-20 东南大学 Double-ear voice separation method based on support vector machine
CN108647556A (en) * 2018-03-02 2018-10-12 重庆邮电大学 Sound localization method based on frequency dividing and deep neural network
CN108615536A (en) * 2018-04-09 2018-10-02 华南理工大学 Time-frequency combination feature musical instrument assessment of acoustics system and method based on microphone array
CN110446142B (en) * 2018-05-03 2021-10-15 阿里巴巴集团控股有限公司 Audio information processing method, server, device, storage medium and client
CN110446142A (en) * 2018-05-03 2019-11-12 阿里巴巴集团控股有限公司 Audio-frequency information processing method, server, equipment, storage medium and client
CN110364175B (en) * 2019-08-20 2022-02-18 北京凌声芯语音科技有限公司 Voice enhancement method and system and communication equipment
CN110364175A (en) * 2019-08-20 2019-10-22 北京凌声芯语音科技有限公司 Sound enhancement method and system, verbal system
CN112731289A (en) * 2020-12-10 2021-04-30 深港产学研基地(北京大学香港科技大学深圳研修院) Binaural sound source positioning method and device based on weighted template matching
CN112731289B (en) * 2020-12-10 2024-05-07 深港产学研基地(北京大学香港科技大学深圳研修院) Binaural sound source positioning method and device based on weighted template matching
US11328702B1 (en) 2021-04-25 2022-05-10 Shenzhen Shokz Co., Ltd. Acoustic devices
WO2022226696A1 (en) * 2021-04-25 2022-11-03 深圳市韶音科技有限公司 Open earphone
US11715451B2 (en) 2021-04-25 2023-08-01 Shenzhen Shokz Co., Ltd. Acoustic devices
CN113476041A (en) * 2021-06-21 2021-10-08 苏州大学附属第一医院 Speech perception capability test method and system for children using artificial cochlea
CN113476041B (en) * 2021-06-21 2023-09-19 苏州大学附属第一医院 Speech perception capability test method and system for artificial cochlea using children
CN113782047A (en) * 2021-09-06 2021-12-10 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium
CN113782047B (en) * 2021-09-06 2024-03-08 云知声智能科技股份有限公司 Voice separation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107346664A (en) A kind of ears speech separating method based on critical band
CN104464750B (en) A kind of speech separating method based on binaural sound sources positioning
CN109830245B (en) Multi-speaker voice separation method and system based on beam forming
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
CN109584903B (en) Multi-user voice separation method based on deep learning
CN102565759B (en) Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN106504763A (en) Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
CN106782565A (en) A kind of vocal print feature recognition methods and system
CN110728989B (en) Binaural speech separation method based on long-time and short-time memory network L STM
CN102438189A (en) Dual-channel acoustic signal-based sound source localization method
CN108091345B (en) Double-ear voice separation method based on support vector machine
CN108520756B (en) Method and device for separating speaker voice
Cai et al. Multi-Channel Training for End-to-End Speaker Recognition Under Reverberant and Noisy Environment.
CN110619887A (en) Multi-speaker voice separation method based on convolutional neural network
CN111986695A (en) Non-overlapping sub-band division fast independent vector analysis voice blind separation method and system
CN103901400A (en) Binaural sound source positioning method based on delay compensation and binaural coincidence
Wang et al. Pseudo-determined blind source separation for ad-hoc microphone networks
Spille et al. Combining binaural and cortical features for robust speech recognition
Talagala et al. Binaural localization of speech sources in the median plane using cepstral HRTF extraction
CN112216301B (en) Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference
CN113345421B (en) Multi-channel far-field target voice recognition method based on angle spectrum characteristics
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
CN112731291A (en) Binaural sound source positioning method and system for collaborative two-channel time-frequency mask estimation task learning
Guo et al. Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Meutzner et al. Binaural signal processing for enhanced speech recognition robustness in complex listening environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171114

RJ01 Rejection of invention patent application after publication