CN107346664A

CN107346664A - A kind of ears speech separating method based on critical band

Info

Publication number: CN107346664A
Application number: CN201710479139.2A
Authority: CN
Inventors: 谈雅文; 汤彬; 汤一彬; 陈秉岩; 高远
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2017-06-22
Filing date: 2017-06-22
Publication date: 2017-11-14

Abstract

The invention discloses a kind of speech separating method based on critical band and binaural signals, pass through data training and the azimuth information of sound source, in each critical band binaural signals are carried out with the classification of sound source, so as to obtain the data flow of each sound source, each sound-source signal after being separated is reconstructed, realizes speech Separation.Scaling down processing mechanism of the invention based on human auditory system, with reference to the auditory masking effect of human ear, according to the azimuth information of different sound sources, mixing voice is separated in each critical band, positioning separating resulting under the conditions of different noises and reverberation shows, ears speech Separation based on critical band, performance obtain effectively lifting.

Description

A kind of ears speech separating method based on critical band

Technical field

The present invention relates to auditory localization and speech Separation field, and in particular to a kind of ears voice point based on critical band From method.

Background technology

Voice positions and isolation technics is the front end of speech signal processing system, and its performance is to whole speech signal systems shadow Sound is very big.Since digital communication era, the speech processes skill such as encoding and decoding speech, voice positioning, speech Separation, speech enhan-cement Art is obtained for rapid development, and especially in current internet tide, voice assistant has pushed Speech processing to one Individual new height.

The development of following multi-modal man-machine interaction, human-computer dialogue and speech recognition be unable to do without Speech processing research and Development, so front end of the speech Separation technology as speech processing system, it is directly connected to the performance and effect of whole voice system Fruit.

The content of the invention

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention provides a kind of double based on critical band Whispering voice separation method, using the scaling down processing mechanism of human auditory system, with reference to the auditory masking effect of human ear, simulate human ear Aural signature, divided based on critical band, different subband is divided to each frame signal obtain accurate hybrid matrix and carry out Speech Separation, improve the deficiencies in the prior art.

Technical scheme：A kind of ears speech separating method based on critical band, it is characterised in that this method includes following Step：

1) the parameter training stage：

1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Close impulse response function HRIR data and binaural signal, sound bearing known to the orientation of monophonic white noise signal convolution generation Angle θ is defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °], at intervals of 5 °；

1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process is returned including amplitude One change processing, framing adding window, obtain the single frames binaural signals after framing；

Amplitude normalization method is：

x_L=x_L/maxvalue

x_R=x_R/maxvalue

Wherein x_LAnd x_RLeft otoacoustic signal and auris dextra acoustical signal are represented respectively；Maxvalue=max (| x_L|,|x_R|) represent The maximum of left ear, auris dextra acoustical signal amplitude.

Framing adding window carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can To be expressed as：

x_L(τ, n)=w_H(n)x_L(τ N+n) 0≤n ＜ N

x_R(τ, n)=w_H(n)x_R(τ N+n) 0≤n ＜ N

Wherein x_L(τ,n)、x_R(τ, n) represents the left and right otoacoustic signal of τ frames respectively；N is a frame sampling data length.

1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation letter Number calculates the interaural difference ITD estimates of single frames signal.The average of the same all frame ITD estimates in orientation is as the orientation ITD trained values, it is designated as δ (θ).

The method for establishing the ITD models of azimuth angle theta is as follows：

The ITD values of τ frame signals are：

The ears white noise signal in the θ orientation is corresponded into the ITD of all frames_τAverage δ (θ), the training as θ orientation ITD parameter：

Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation,

It has been built such that azimuth angle theta and has trained the model between IID parameters.

1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.1), is transformed to frequency Domain, the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors are calculated, it is same IID trained values of the average of all frame IID estimates in orientation as the orientation, are designated as α (θ, ω), and ω represents Fourier transformation Frequency spectrum.

The method for establishing the IID models of azimuth angle theta is as follows：

The IID values of τ frame signals are：

Wherein, X_L(τ, ω) and X_R(τ, ω) difference x_L(τ,m)、x_RThe frequency domain representation of (τ, m), i.e. Short Time Fourier Transform：

Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively；ω represents angle Frequency vector, scope is [0,2 π], at intervals of 2 π/512；

The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the instruction as θ orientation Practice IID parameters：

2) the ears mixing voice Signal separator stage based on critical band and azimuth information：

2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different sides Position.Ears mixing voice signal is pre-processed, including amplitude normalization processing, framing adding window；

2.2) to the ears mixing acoustical signal progress Fourier transformation after framing, the frequency range based on critical band, Sub-band division is carried out to frequency domain, obtains the subband signal after framing；

The method of sub-band division is as follows：

Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), time-frequency domain is transformed into, obtains ears The framing signal X of acoustical signal time-frequency domain_L(τ, ω) and X_R(τ,ω)。

Simultaneously according to the division methods of critical band, carry out sub-band division is carried out to frequency：

Wherein C represents the number of critical band, ω_{c_low}、ω_{c_high}The low frequency and high frequency of c-th of critical band are represented respectively Scope.

2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) are built Vertical orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, believed based on left and right otoacoustic emission Number similarity, carry out the classification of sound source；

2.4) time-frequency after the framing of acquisition in the critical band classification results obtained by step 2.3) and step 2.1) is believed Number be multiplied, obtain the time-frequency domain signal corresponding to each sound source；

2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), when being converted to Domain signal, adding window is carried out, synthesize the separation voice of each sound source.

Beneficial effect：It is of the invention compared with the existing speech Separation technology based on frequency, the present invention be based on human auditory system The scaling down processing mechanism of system, with reference to the auditory masking effect of human ear, after positioning stage accurately obtains sound bearing, to every Different sub-band is separated in one frame, and auditory localization and critical band isolation technics are combined, in multiple speaker separation sides Face, its separating property：SNR(Source to Noise Ratio)、SDR(Source to Distortion Ratio)、SAR (Sources to Artifacts Ratio), PESQ (Perceptual Evaluation of Speech Quality) To effectively improving.

Brief description of the drawings

Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation；

Fig. 2 is present system block diagram

Embodiment

The present invention is further described below in conjunction with the accompanying drawings.

The advanced row data training of the present invention, by each orientation interaural difference ITD (Interaural Time Difference) and interaural intensity difference IID (Interaural Intensity Difference) average as sound bearing Location feature clue, establish orientation mapping model；During actual auditory localization, all frame orientation of acoustical signal are mixed according to ears Histogram, estimate final sound source number and orientation.In the Sound seperation stage, first ears mixing acoustical signal is carried out being based on facing The sub-band division of boundary's frequency band, the azimuth information after being positioned with reference to voice, divides frequency-region signal in each critical band Class, the time frequency point of each sound source on time-frequency domain is returned into time domain finally by inverse Fourier transform.

Fig. 1 is the plane space schematic diagram of auditory localization of the present invention and speech Separation, by taking 2 sound sources as an example.2 microphones At ears, in the present invention, sound source locus represents that -180 ° of deflection≤θ≤180 ° are by the azimuth angle theta of sound source Direction vector is in the projection of horizontal plane and the angle of middle vertical plane.On horizontal plane, θ=0 ° represent front, along clockwise direction θ= 90 °, 180 ° and -90 ° represent front-right, dead astern, front-left respectively.Using 2 sound sources, (sound source of the present embodiment is speaks to Fig. 1 The sound that people sends) exemplified by, its deflection is respectively -30 °, 30 °.

Fig. 2 be the present invention system block diagram, the inventive method include model training, time-frequency conversion, critical band division and The sound source classification of subband, the embodiment of technical solution of the present invention is described in detail below in conjunction with the accompanying drawings：

Step 1) data are trained：

1.1) Fig. 2 is provided in overall system diagram, in the training stage, head related transfer functions HRTF (Head Related Transfer Function), corresponding time domain with head Related impulse receptance function HRIR (Head Related Impulse Response) it is used to generate the binaural signals of particular orientation.The present invention is tested using Massachusetts Institute of Technology's media The HRIR data of room measurement, pair by the HRIR data at θ=- 90 °~90 ° (5 ° of interval) with the corresponding orientation of white noise convolution generation Otoacoustic signal.

1.2) the ears white noise signal to orientation θ is pre-processed, the pretreatment of this method includes：Amplitude normalizing Change, framing and adding window.

Amplitude normalization method is：

x_L=x_L/maxvalue

x_R=x_R/maxvalue

The present embodiment carries out windowing process using Hamming window to the voice signal after framing, and the τ frame signals after adding window can To be expressed as：

x_L(τ, n)=w_H(n)x_L(τ N+n) 0≤n ＜ N

x_R(τ, n)=w_H(n)x_R(τ N+n) 0≤n ＜ N

Wherein x_L(τ,n)、x_R(τ, n) represents the left and right otoacoustic signal of τ frames respectively；N is a frame sampling data length, this In embodiment, speech signal samples rate is 16kHz, and frame length 32ms, it is 16ms, such N=512 that frame, which moves,；w_H(n) it is Hamming window Window function, expression formula are：

1.3) the ITD models of azimuth angle theta are established.

The ITD values of τ frame signals are：

Wherein frameNum represents the totalframes after the ears white noise signal framing in θ orientation.

1.4) the IID models of azimuth angle theta are established：

The IID values of τ frame signals are：

Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively；ω represents angle Frequency vector, scope is [0,2 π], at intervals of 2 π/512.

Ears compound voice Signal separator stage of the step 2) based on critical band and azimuth information.

2.1) pretreatment module in corresponding diagram 1, the ears mixing acoustical signal comprising the different multi-acoustical in orientation is entered Row and above-mentioned steps 1.2) in identical pre-process, including amplitude normalization, framing and adding window, it be 32ms to take frame length, frame shifting For 16ms, add Hamming window.

2.2) frequency-domain transform in corresponding diagram 1, Fourier in short-term is carried out by frame to the multiframe signal obtained by step 2.1) and become Change, be transformed into time-frequency domain, obtain the framing signal X of binaural signals time-frequency domain_L(τ, ω) and X_R(τ,ω)。

Simultaneously according to the division methods of critical band, sub-band division is carried out to frequency：

The division scope of critical band, i.e., the low frequency of each critical band, high frequency and bandwidth are as shown in the table：

2.3) classification of the subband based on dimensional orientation in corresponding diagram 1.Here we assume that known ears creolized language message The sound source number and corresponding attitude included in number.There is number of many algorithms by binaural signals to sound source at present Estimated with azimuth information, auditory localization is not described here, the algorithm of auditory localization is not limited similarly. Only discuss after auditory localization, how to be separated according to the attitude information of different sound sources.

According to the masking effect of human auditory system, generally in some critical band of a certain frame, an only sound source Signal accounts for leading, is sky using interaural difference IID and interaural intensity difference ITD so in the speech Separation based on dimensional orientation Between clue, mask function is calculated by maximum similarity between two sound channels, the classification of sound source is carried out in each critical band, Here we assume that include L sound source, the azimuth angle theta of each sound source in ears mixing acoustical signal_l(1≤l≤L)：

Wherein X_L(τ,ω)、X_R(τ, ω) is respectively the left and right ear frequency-region signal of τ frames, ω_cRepresent c-th of critical band Spectral range；θ_lFor azimuth corresponding to l-th of sound source；α(θ_l, ω) and it is that l-th of sound source corresponds to dimensional orientation θ_lIn ω frequencies On IID parameters, δ (θ_l) it is the ITD parameter that l-th of sound source corresponds to orientation.

J (τ, c) is actually that sound source is classified using azimuth information in each critical band.

Immediately, binary mask mark is carried out to the critical band corresponding to each sound source：

Such M_l(τ, ω) represents binary mask of l-th of sound source in c-th of critical band.

2.4) according to binary mask, the binaural signals of every frame, each frequency are classified, obtain l-th of sound source Corresponding time frequency point signal：

WhereinRepresent the frequency domain data of l-th of sound source τ frame.

Here we are multiplied with left otoacoustic signal with mask, obtain the time-frequency data of each sound source, actually can also profit The time-frequency data of each sound source are obtained with auris dextra acoustical signal.

2.5) the time-frequency domain inverse transformation in corresponding diagram 1, inverse Fu in short-term is carried out to the frequency-region signal of l-th of sound source after separation In leaf transformation, obtain sound source l τ frame time-domain signals

After being converted to time-domain signal, adding window is carried out, the τ frame signals gone after adding window can be expressed as：

Whereinw_H(m) it is above Hamming window.

Each frame voice that will be gone after adding window carries out overlap-add, so as to obtain mixing l-th of sound after sound-source signal separates Source signal s_l, so as to realize the separation of different azimuth sound-source signal.

Described above is only the preferred embodiment of the present invention, it should be pointed out that：, the upset oil cylinder For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a kind of ears speech separating method based on critical band, it is characterised in that this method comprises the following steps：

1) the parameter training stage：

1.1) it is trained using having directive ears white noise signal, the ears white noise signal is and head phase Guan pulse Rush receptance function HRIR data and binaural signal known to the orientation of monophonic white noise signal convolution generation, ears white noise letter Number sound bearing angle θ be defined as direction vector in the projection of horizontal plane and the angle of middle vertical plane, in the range of [- 90 °, 90 °]；

1.2) the ears white noise signal of known orientation information is pre-processed, the preprocessing process includes amplitude normalization Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born；

1.3) cross-correlation function computing is carried out to the single frames ears voice signal obtained in step 1.2), utilizes cross-correlation function meter Calculate the interaural difference ITD estimates of single frames signal, the ITD of the averages of the same all frame ITD estimates in orientation as the orientation Trained values, the ITD models of azimuth angle theta are established, be designated as δ (θ)；

1.4) Short Time Fourier Transform is carried out to the single frames ears voice signal obtained in step 1.2), is transformed to frequency domain, Calculate the ratio that left otoacoustic signal and auris dextra acoustical signal are composed in each bin magnitudes, i.e. interaural intensity difference IID vectors, same orientation IID trained values of the average of all frame IID estimates as the orientation, the IID models of azimuth angle theta are established, are designated as α (θ, ω), ω represents the frequency spectrum of Fourier transformation；

2.1) the ears mixing voice signal in test process, comprising multi-acoustical, and each sound source corresponds to different orientation, double Ear mixing voice signal is pre-processed, and the method for the pretreatment is identical with the preprocess method in step 1.2), including width Normalized, framing adding window are spent,；

2.2) Fourier transformation, the frequency range based on critical band, to frequency are carried out to the ears mixing acoustical signal after framing Domain carries out sub-band division, obtains the subband signal after framing；

2.3) the sound source number and azimuth information included according to mixing sound-source signal, and step 1.3) and step 1.4) foundation Orientation acoustical signal ITD, IID parameter, in the every frame obtained in step 2.2), each critical band, based on left and right otoacoustic signal Similarity, carry out the classification of sound source；

2.4) to the critical band classification results obtained by step 2.3) and the time frequency signal phase after the framing of acquisition in step 2.1) Multiply, obtain the time-frequency domain signal corresponding to each sound source；

2.5) inverse Fourier transform is carried out to time-frequency domain signal corresponding to each sound source obtained by step 2.4), is converted to time domain letter Number, adding window is carried out, synthesizes the separation voice of each sound source.

A kind of 2. ears speech separating method based on critical band according to claim 1, it is characterised in that the sound Ceramic tile manufacturer selects sorting of installing machines to replace manual labor, and thus system is born.

A kind of 3. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.2) in amplitude normalization method be：

x_L=x_L/maxvalue

x_R=x_R/maxvalue

Wherein x_LAnd x_RLeft otoacoustic signal and auris dextra acoustical signal are represented respectively；Maxvalue=max (| x_L|,|x_R|) the left ear of expression, The maximum of auris dextra acoustical signal amplitude.

A kind of 4. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.2) in framing adding window windowing process is carried out to the voice signal after framing using Hamming window, the τ frame signals after adding window can To be expressed as：

x_L(τ, n)=w_H(n)x_L(τ N+n) 0≤n ＜ N

x_R(τ, n)=w_H(n)x_R(τ N+n) 0≤n ＜ N

A kind of 5. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.3) in establish azimuth angle theta ITD models method it is as follows：

The ITD values of τ frame signals are：

<mrow> <mi>I</mi> <mi>T</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>arg</mi> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mi>k</mi> </munder> <mrow> <mo>(</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mo>|</mo> <mi>k</mi> <mo>|</mo> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>x</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>&tau;</mi> <mo>,</mo> <mi>n</mi> </mrow> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mrow> <mi>&tau;</mi> <mo>,</mo> <mi>n</mi> <mo>+</mo> <mi>k</mi> </mrow> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <mo>-</mo> <mi>N</mi> <mo>+</mo> <mn>1</mn> <mo>&le;</mo> <mi>k</mi> <mo>&le;</mo> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow>

The ears white noise signal in the θ orientation is corresponded into the ITD of all frames_τAverage δ (θ), and the training ITD as θ orientation joins Number：

<mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&Sigma;</mo> <mi>&tau;</mi> </munder> <mi>I</mi> <mi>T</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mi>r</mi> <mi>a</mi> <mi>m</mi> <mi>e</mi> <mi>N</mi> <mi>u</mi> <mi>m</mi> </mrow> </mfrac> </mrow>

A kind of 6. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 1.4) in establish azimuth angle theta IID models method it is as follows：

The IID values of τ frame signals are：

<mrow> <mi>I</mi> <mi>I</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mn>20</mn> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>L</mi> </msub> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mi>R</mi> </msub> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mfrac> </mrow>

<mrow> <mi>X</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>N</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>x</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>j</mi> <mi>&omega;</mi> <mi>n</mi> </mrow> </msup> </mrow>

Wherein x (τ, n) represents τ frame acoustical signals, carries out Fourier transformation to left and right otoacoustic signal respectively；ω represents angular frequency Vector, scope is [0,2 π], at intervals of 2 π/512；

The IID (τ, ω) of all frames of ears white noise signal in the θ orientation is averaged α (θ, ω), the training as θ orientation IID parameters：

<mrow> <mi>&alpha;</mi> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&Sigma;</mo> <mi>&tau;</mi> </munder> <mi>I</mi> <mi>I</mi> <mi>D</mi> <mrow> <mo>(</mo> <mi>&tau;</mi> <mo>,</mo> <mi>&omega;</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>f</mi> <mi>r</mi> <mi>a</mi> <mi>m</mi> <mi>e</mi> <mi>N</mi> <mi>u</mi> <mi>m</mi> </mrow> </mfrac> </mrow>

A kind of 7. ears speech separating method based on critical band according to claim 1, it is characterised in that the step It is rapid 2.2) in sub-band division method it is as follows：

Short Time Fourier Transform is carried out by frame to the multiframe signal obtained by step 2.1), is transformed into time-frequency domain, obtains binaural sound letter The framing signal X of number time-frequency domain_L(τ, ω) and X_R(τ,ω)。

Wherein C represents the number of critical band, ω_{c_low}、ω_{c_high}The low frequency and high frequency model of c-th of critical band are represented respectively Enclose.