CN110728989B

CN110728989B - Binaural speech separation method based on long-time and short-time memory network L STM

Info

Publication number: CN110728989B
Application number: CN201910930176.XA
Authority: CN
Inventors: 周琳; 陆思源; 钟秋月; 庄琰
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-07-14
Anticipated expiration: 2039-09-29
Also published as: CN110728989A

Abstract

The invention discloses a binaural speech separation method based on a long-time and short-time memory network L STM, which is characterized in that interaural time difference, interaural intensity difference and interaural cross-correlation function of each time-frequency unit of a training binaural speech signal are extracted as separated spatial features, the spatial features of a current frame and front and rear 5 frames of time-frequency units in the same subband are used as input parameters of a bidirectional L STM network for training to obtain a separation model based on L STM, and in a test stage, the spatial features of the current frame and front and rear 5 frames of time-frequency units of the testing binaural speech signal are used as input parameters of a bidirectional L STM network obtained by training and are used for estimating a masking value of target speech of the current time-frequency unit, so that speech separation is carried out according to the masking value.

Description

Binaural speech separation method based on long-time and short-time memory network L STM

Technical Field

The invention relates to a speech separation algorithm, in particular to a binaural speech separation method based on a long-time and short-time memory network L STM.

Background

The voice separation algorithm is an important research direction for voice signal processing, and has a wide application occasion, such as a teleconference system, and the voice separation technology can realize extraction of interested sound sources from a plurality of speakers, so that the efficiency of a teleconference can be improved; the preprocessing process applied to the voice recognition can improve the quality of the voice and help to improve the recognition accuracy; when the method is applied to a hearing-aid device, a more prominent target sound source can be provided for a hearing-impaired person, and effective voice information can be provided.

Speech separation techniques are used in a wide variety of fields including, but not limited to, acoustics, digital signal processing, information communication, auditory psychology and physiology, etc. Binaural speech separation utilizes the difference of binaural signals to analyze and estimate the sound source orientation, and current separation algorithms can be divided into two categories according to the differences of their separation parameters:

1. interaural difference based separation

L ord Rayleigh, on the assumption of a spherical human head in 1907, a theory of separation based on interaural cue Difference is proposed for the first Time, that is, due to the position Difference between the sound source and the human binaural position, there are Time and Intensity differences in the binaural received speech signals, namely, an Interaural Time Difference (ITD) and an Interaural Intensity Difference (IID), which are the basis of the interaural speech separation.

2. Separation based on head-related transfer function

The ITD information can determine the sound source in the left and right directions, and cannot determine whether the sound is coming from the front or the rear, and cannot separate the elevation angle position. However, the separation of the voice is no longer limited to horizontal and forward voices by the Head-Related Transfer Function (HRTF) based method, which can separate the voice by designing an inverse filter using an HRTF database and calculating a cross-correlation value from a binaural signal after inverse filtering. The method solves the problem of three-dimensional space voice separation, but has overlarge computational complexity and stronger individuality of head-related transfer functions, and can cause inconsistency between the actual transfer function and the function used in the separation model when different individuals or surrounding environments are different (namely different noises or reverberation exists), thereby influencing the separation effect.

3. Deep neural network DNN based separation

The method applies the ideal masking ratio IRM to the separation problem of multiple speakers, models through azimuth angles, and extracts improved IRM values from sound sources with 19 forward azimuth angles and environmental noise to serve as training targets of a neural network. In the training stage, binaural voice signals are preprocessed, mixed voice signals pass through a Gamma filter bank and are subjected to framing and windowing to obtain each time-frequency unit, and spatial features of the time-frequency units are extracted and input into a DNN neural network for training. And in the testing stage, the extracted time-frequency space characteristics of the mixed voice are sent to the DNN after training is finished, and the output result of the DNN is the estimated masking ratio ERM. The separation method has high robustness, and is obviously improved in various speech evaluation indexes compared with the traditional algorithm, but the method does not utilize the time sequence correlation characteristic of the speech signal characteristic parameters.

Disclosure of Invention

The invention aims to solve the problem that the performance of the conventional binaural speech separation algorithm is sharply reduced under the conditions of high noise and strong reverberation, and provides a binaural speech separation method of a long-time and short-time memory network L STM, wherein the method adopts a L STM network to train characteristic parameters under multiple environments, and simulation test results show that the separation effect of the binaural speech separation algorithm based on the long-time and short-time memory network L STM is remarkably improved.

The binaural voice separation method based on the long-time and short-time memory network L STM comprises the following steps:

(1) convolving two different training single sound channel voice signals with head related impulse response functions HRIR with different azimuth angles to generate two training single sound source double-ear voice signals with different azimuth angles;

(2) mixing the two training single sound source binaural voice signals with different azimuth angles to obtain a mixed training binaural voice signal containing the two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural voice signal containing the two sound sources with different azimuth angles in different acoustic environments;

(3) performing subband filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step (2) to obtain a training binaural voice signal after each subband is framed, namely each time-frequency unit of the training binaural voice signal;

(4) calculating an interaural cross-correlation function CCF, an interaural time difference ITD and an interaural intensity difference I L D of each time-frequency unit of the training binaural speech signal obtained in the step (3) as the spatial characteristics of each time-frequency unit of the training binaural speech signal;

(5) taking the spatial characteristic parameters of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the previous and next 5 frames in the sub-band as the input of a long-short-term memory network L STM, taking the ideal masking ratio IRM of the time-frequency unit as the target value of a L STM, and training a L STM network;

(6) processing a mixed test binaural voice signal containing two sound sources with different azimuth angles under different acoustic environments according to the step (3) and the step (4) to obtain spatial characteristics of each time-frequency unit of the test binaural voice signal;

(7) inputting the spatial features of each time-frequency unit obtained in the step (6) and the spatial features of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained L STM network to obtain an estimated masking ratio ERM of each time-frequency unit;

(8) and (4) separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step (7) to obtain a time domain voice signal corresponding to a single sound source.

Further, the calculation formula of the two mono-source binaural speech signals with different azimuth angles in step (1) is as follows:

s_1,L(n)＝s₁(n)*h_1,Ls_2,L(n)＝s₂(n)*h_2,L

s_1,R(n)＝s₁(n)*h_1,R，s_2,R(n)＝s₂(n)*h_2,R

wherein s is₁(n)、s₂(n) two different monaural source speech signals, s_1,L(n)、s_1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h_1,L、h_1,RLeft ear HRIR, right ear HRIR, s representing azimuth 1_2,L(n)、s_2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h_2,L、h_2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.

Further, the method for calculating the mixed binaural speech signal including the two sound sources in step (2) includes:

s_left(n)＝s_1,L(n)+s_2,L(n)

s_right(n)＝s_1,R(n)+s_2,R(n)

wherein s is_left(n)、s_right(n) left and right ear signals of a mixed training binaural speech signal comprising two different azimuthal sound sources, s_1,L(n)、s_1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, s_2,L(n)、s_2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2;

the computing method of the noise-containing mixed training binaural speech signal is as follows:

x_left(n)＝s_left(n)+v_L(n)

x_right(n)＝s_right(n)+v_R(n)

wherein x is_Left(n)、x_Right(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, v_L(n)、v_R(n) represents the left and right ear noise signals at different signal-to-noise ratios, v_L(n)、v_R(n) are not relevant.

Further, the subband filtering calculation method in step (3) is as follows:

x_L(i,n)＝x_left(n)*g_i(n)

x_R(i,n)＝x_right(n)*g_i(n)

wherein x is_Left(n)、x_Right(n) respectively representing noisy mixed training left and right ear speech signals comprising two different azimuth sound sources, x_L(i,n)、x_R(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, g_i(n) is the impulse response function of the ith subband filter;

the framing method comprises the following steps: using a predetermined frame length and frame shift to convert the signal x into a digital signal_L(i,n)、x_R(i, n) into a plurality of single-frame signals x_L(i,k·N/2+m)、x_R(i, k · N/2+ m), where k is a frame number, m represents a sampling number within one frame, m is 0, 1.. and N-1, N is a frame length, and the frame is shifted by half;

the windowing method comprises the following steps:

x_L(i,k,m)＝w_H(m)x_L(i,k·N/2+m)

x_R(i,k,m)＝w_H(m)x_R(i,k·N/2+m)

wherein, w_H(m) represents a window function, x_L(i,k,m)、x_RAnd (i, k, m) respectively represent the left and right ear voice signals of the ith sub-band and the kth frame, and are used as time-frequency units for training the binaural voice signals.

Further, the cross-correlation function CCF in step (4) is calculated as:

wherein x is_L(i,k,m)、x_R(i, k, m) represents the time-frequency units of the i-th sub-band and the k-th frame of the training binaural speech signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency units of the i-th sub-band and the k-th frame corresponding to the binaural speech signal, d is the number of delayed sampling points, L is the maximum number of delayed sampling points, and N is the frame length;

the calculation method of the interaural time difference ITD comprises the following steps:

the interaural intensity difference I L D was calculated as:

the time-frequency unit space characteristic F (I, k) is composed of CCF, ITD and I L D:

F(i,k)＝[CCF(i,k，-L) CCF(i,k,-L+1) ··· CCF(i,k,L) ITD(i,k) ILD(i,k)]

and F (i, k) represents a spatial feature vector corresponding to the time-frequency unit of the i-th subband and the k-th frame binaural speech signal.

Further, the training process of the L STM network in the step (5) specifically includes:

(5-1) constructing L STM network, wherein the L STM network consists of an input layer, a L STM network layer and an output layer, the input layer comprises input at each moment, the output layer comprises output at each moment, the L STM network layer comprises L STM units at each moment, the time step size of L STM is set to 11, namely, a current frame and 5 frames before and after the current frame, and each L STM unit in the L STM network layer is respectively and bidirectionally connected with the L STM units before and after the current frame;

(5-2) randomly initializing L the weight of the STM network layer;

(5-3) calculating an ideal masking ratio IRM of the current time frequency unit as a target value of the L STM network, wherein the calculation formula of the ideal masking ratio IRM is as follows:

taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit_1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frame_L(i, k, m) according to s_L(i, k, m) calculating IRM:

(5-4) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of L STM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];

(5-5) obtaining L STM network output, namely estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of ERM and ideal masking value IRM:

wherein E [. C]Representing the desired operation, | · |₂Represents a L2 norm;

(5-6) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;

and (5-7) if the current iteration number is less than the preset total iteration number, returning to the step (5-4), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training.

Compared with the prior art, the method has the advantages that the cross-correlation function, the interaural time difference and the interaural intensity difference of the training binaural speech signals are extracted in each time-frequency unit, the spatial features are formed to be used as training samples, the L STM network is trained to obtain the L STM network separator, the multidimensional feature parameters of the testing binaural speech signals are extracted in the test, the estimated masking ratio ERM corresponding to each frame of binaural speech signals is estimated by the aid of the L STM network separator obtained through the training, and experimental results in different acoustic environments show that the method for separating the binaural speech based on the STM long-time memory network L obviously improves separation effects under the conditions of high noise and strong reverberation and has good robustness.

Drawings

FIG. 1 is a schematic flow diagram of one embodiment of the present invention;

FIG. 2 is a diagram of CCF functions for subbands of a speech signal;

fig. 3 is a schematic diagram of L STM network architecture in the present invention.

Detailed Description

As shown in fig. 1, the binaural speech separation method based on L STM network provided by the present embodiment includes the following steps:

the method comprises the following steps of firstly, convolving two different single-sound-channel voice signals in training voice with head-related impulse response functions HRIR (high-resolution infrared) with different azimuth angles to generate two training single-sound-source double-ear voice signals with different azimuth angles, wherein a calculation formula of each sound source is as follows:

s_1,L(n)＝s₁(n)*h_1,Ls_2,L(n)＝s₂(n)*h_2,L

s_1,R(n)＝s₁(n)*h_1,R，s_2,R(n)＝s₂(n)*h_2,R

wherein s is₁(n)、s₂(n) two different monaural source speech signals, s_1,L(n)、s_1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h_1,L、h_1,RThe left ear HRIR and the right ear HRIR, s corresponding to the azimuth 1_2,L(n)、s_2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h_2,L、h_2,RThe left ear HRIR and the right ear HRIR corresponding to the azimuth 2 are represented by convolution operation, and n is a sampling number.

The azimuth angles theta corresponding to the head Related Impulse response functions HRIR (head Related Impulse response) are in the range of [ -90 degrees, 90 degrees ], are separated by 5 degrees, and are 37 azimuth angles in total, each azimuth angle theta corresponds to a pair of head Related Impulse response functions HRIR, namely a left ear HRIR and a right ear HRIR.

Step two, mixing the training single-sound-source binaural speech signals in two different directions obtained in the step one to obtain a mixed training binaural speech signal containing two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural speech signal containing two sound sources with different azimuth angles in different acoustic environments, wherein the calculation method comprises the following steps:

the method for calculating the mixed training binaural voice signal containing two sound sources comprises the following steps:

s_left(n)＝s_1,L(n)+s_2,L(n)

s_right(n)＝s_1,R(n)+s_2,R(n)

wherein s is_left(n)、s_rightAnd (n) is the left and right ear signals of the training mixed binaural voice signal containing two different azimuth sound sources.

The computing method of the noise-containing mixed training binaural voice signal comprises the following steps:

x_left(n)＝s_left(n)+v_L(n)

x_right(n)＝s_right(n)+v_R(n)

The binaural speech signals under the noise environment are generated, so that an L STM network can learn the distribution rule of spatial characteristic parameters corresponding to the binaural speech signals under the noise environment, the signal-to-noise ratio is set to be 0, 5, 10, 15 and 20dB, and the binaural speech signals under different acoustic environments at two different azimuth angles are obtained, and the binaural speech signals with the signal-to-noise ratios of 0, 5, 10, 15 and 20dB are obtained when no reverberation exists corresponding to each azimuth angle.

And step three, carrying out sub-band filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step two to obtain the training binaural voice signal after each sub-band is framed.

The subband filtering may adopt a gamma filter bank, and the time domain impulse response function of the gamma filter is as follows:

g_i(n)＝An³e^-2πbincos(2πf_i)u(n)

wherein i represents the serial number of the filter; a is the filter gain; f. of_iIs the center frequency of the filter; b_iIs the attenuation factor of the filter, determines the attenuation speed of the impulse response; u (n) represents a step function. The number of filters in the Gamma filter bank adopted in this embodiment is 33, and the central frequency range is [50Hz,8000Hz ]]。

The calculation formula of the subband filtering is as follows:

x_L(i,n)＝x_left(n)*g_i(n)

x_R(i,n)＝x_right(n)*g_i(n)

wherein x is_L(i,n)、x_RAnd (i, n) are respectively a left ear voice signal and a right ear voice signal of the ith filtered subband, wherein i is more than or equal to 1 and less than or equal to 33. Language of each sound trackThe sound signals are sub-band filtered to obtain 33 sub-band sound signals.

Actually, the subband filter of the present invention is not limited to the filter structure of this embodiment, and may be adopted as long as it can realize the subband filtering function of the speech signal.

Framing: under the condition that the voice sampling frequency is 16kHz, the preset frame length is 512, the frame shift is 256, the left ear voice signal and the right ear voice signal of each sub-band are divided into multi-frame signals, and the left ear voice signal and the right ear voice signal after the frame division are x respectively_L(i,k·N/2+m)、x_R(i,k·N/2+m)。

The windowing formula is:

x_L(i,k,m)＝w_H(m)x_L(i,k·N/2+m)

x_R(i,k,m)＝w_H(m)x_R(i,k·N/2+m)

in the formula, x_L(i,k,m)、x_R(i, k, m) respectively represents the left and right ear voice signals of the ith sub-band and the kth frame, wherein i is more than or equal to 1 and less than or equal to 33, N is the frame length, and the value is 512.

The window function is the hamming window:

and step four, extracting the spatial characteristics of the binaural speech signal of each frame, namely the cross-correlation function CCF, the interaural time difference ITD and the interaural intensity difference I L D, of the training binaural speech signal after each sub-band is framed obtained in the step three.

The cross-correlation function CCF is calculated as:

wherein CCF (i, k, d) represents the cross-correlation function corresponding to the binaural speech signal of the ith subband and the kth frame, d is the number of delayed sample points, and L is the maximum number of delayed sample points.

The length of the cross-correlation function is generally a value between [ -1ms, 1ms ] in combination with the sound propagation speed and the human head size, the sampling rate of the speech signal in the present invention is 16kHz, so that the present embodiment takes L-16, and thus the number of CCF points calculated for each frame of training binaural speech signal is 33 points.

The calculation formula of the interaural time difference ITD is as follows:

the interaural intensity difference I L D is calculated as:

the spatial characteristics corresponding to each time-frequency unit are the combination form of the parameters:

F(i,k)＝[CCF(i,k，-L)CCF(i,k,-L+1)···CCF(i,k,L)ITD(i,k)ILD(i,k)]

wherein, F (i, k) represents a spatial feature vector corresponding to the i-th subband and the k-th frame binaural speech signal time-frequency unit.

Fig. 2 is a CCF function of a time-frequency unit of a speech signal, where CCF has a relatively simple relationship with delay in a low frequency band and generates a plurality of peaks in a high frequency band.

Step five, regarding the space characteristic parameters of each time-frequency unit obtained in the step four and the space characteristics of the time-frequency units corresponding to the front frame and the rear frame in the sub-band as the input characteristics of a long-time memory network L STM, regarding the ideal masking ratio IRM of the time-frequency unit as the target value of a L STM, and training a L STM based on a forward propagation algorithm and a backward propagation algorithm, wherein the input characteristic format is [ F (i, k-5), F (i, k-4), …, F (i, k), … and F (i, k +5) ];

the L STM network structure of this embodiment is given below, indeed, the structure of the L STM network of the present invention is not limited to the network structure of this embodiment.

As shown in fig. 3, a L STM network adopted in this embodiment includes an input layer, a L STM network layer, and an output layer, where the input layer includes inputs at various times, the output layer includes outputs at various times, the L STM network layer includes L STM units at various times, each L STM unit in the L STM network layer is respectively connected to a preceding time L STM unit and a following time L STM unit in a bidirectional manner, the input of the input layer is a 37 × dimensional sample, where 37 represents a spatial feature number of a time-frequency unit, 11 represents that a time step of L STM is set to 11 (preceding and following 5 frames and current frame), the L STM network layer L STM unit includes 256 neurons, and the output layer includes 20 neurons.

In the embodiment, on the basis of a simulation experiment, the learning rate is set to be 0.0001, the total iteration frequency is set to be 400, the learning rate is set to be 0.0001, the phenomenon that the error function and the error fraction oscillate excessively is avoided, and meanwhile, when the iteration frequency is 400, the network model is close to convergence.

Based on the set parameters, the fifth step specifically comprises the following steps:

(5-1) randomly initializing L the weight of the STM network layer;

(5-2) calculating an ideal masking ratio IRM of the current time frequency unit as a target value of the L STM network, wherein the calculation formula of the ideal masking ratio IRM is as follows:

(5-3) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of L STM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];

(5-4) obtaining L STM network output, namely estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of ERM and ideal masking value IRM:

wherein E [. C]Representing the desired operation, | · |₂Representing a L2 norm.

(5-5) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;

and (5-6) if the current iteration number is less than the preset total iteration number, returning to the step (5-3), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training.

And step six, processing the mixed test binaural voice signals containing two sound sources with different azimuth angles under different acoustic environments according to the step three and the step four to obtain the spatial characteristics of each time-frequency unit of the test binaural voice signals.

And step seven, inputting the spatial characteristics of each time-frequency unit obtained in the step six and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained L STM network to obtain the estimated masking ratio ERM of each time-frequency unit.

And step eight, separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step seven to obtain a time domain voice signal corresponding to a single sound source.

The other three separation algorithms involved in comparison are respectively a voice separation algorithm based on a degradation separation estimation technology DUET, a DNN network voice separation algorithm based on an ideal binary mask IBM and a DNN network voice separation algorithm based on an ideal ratio mask IRM, and the method corresponds to an IRM-L STM method, and the former two methods belong to the traditional separation method.

The PESQ values for the four methods are shown in table 1:

TABLE 1 comparison of PESQ values for the four methods

SNR(dB)	DUET	IBM-DNN	IRM-DNN	IRM-LSTM
					0	1.403	1.467	1.946	1.874
5	1.57	1.656	2.121	2.140
					10	1.754	1.834	2.258	2.355
15	1.923	1.982	2.386	2.528
					20	2.102	2.119	2.510	2.654
Noiseless	2.628	2.355	2.765	2.795

According to the results in table 1, the PESQ value of the L STM network based algorithm is much higher than that of the traditional algorithm, and is higher than that of the algorithm using the DNN network except for the signal-to-noise ratio of 0 dB.

Meanwhile, the algorithm of the patent is tested in the environment with the signal to noise ratio of-3, 3, 6, 9 and 12dB, and indexes are shown in table 2.

TABLE 2 evaluation of all SNR

SNR(dB)	PESQ
		-3	1.867
0	1.874
		3	2.161
5	2.140
		6	2.322
9	2.452
		10	2.355
12	2.552
		15	2.528
20	2.654

As can be seen from the table, for the case of the signal-to-noise ratio without training, the PESQ indexes are good in performance and similar to the case of the adjacent signal-to-noise ratio, so that the algorithm based on the L STM network has good generalization on the case of the noise signal-to-noise ratio and has good robustness.

Meanwhile, the PESQ index under the reverberation condition is tested, and 200ms reverberation and 600ms reverberation are respectively adopted, and the results are shown in tables 3 and 4:

TABLE 3200 ms reverberation environment four methods PESQ value comparison

SNR(dB)	DUET	IBM-DNN	IRM-DNN	IRM-LSTM
					0	1.335	1.413	1.717	1.710
5	1.468	1.593	1.971	2.004
					10	1.597	1.758	2.139	2.151
15	1.678	1.865	2.262	2.359
					20	1.734	1.932	2.345	2.380

TABLE 4600 ms reverberation environment four methods PESQ value comparison

SNR(dB)	DUET	IBM-DNN	IRM-DNN	IRM-LSTM
					0	1.322	1.410	1.664	1.645
5	1.429	1.570	1.913	2.024
					10	1.524	1.713	2.069	2.120
15	1.579	1.800	2.180	2.253
					20	1.617	1.857	2.252	2.298

According to the tables 3 and 4, the PESQ value of the L STM network-based binaural speech separation algorithm in the reverberation environment is higher than that of the DNN-based algorithm and is significantly higher than that of the other two traditional algorithms, so that the algorithm has better generalization performance for the reverberation environment and has better robustness.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A binaural speech separation method based on a long-time and short-time memory network L STM is characterized by comprising the following steps:

(5) and (3) taking the spatial characteristics of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the previous and next 5 frames in the sub-band as the input of a long-time and short-time memory network L STM, and taking the ideal masking ratio IRM of the time-frequency unit as a target value of a L STM network to train a L STM network, wherein the training process of the L STM network specifically comprises the following steps:

(5-2) randomly initializing L the weight of the STM network layer;

(5-7) if the current iteration number is smaller than the preset total iteration number, returning to the step (5-4), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training;

2. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the calculation formula of the binaural speech signals of the two different azimuth single sound sources in step (1) is as follows:

wherein s is₁(n)、s₂(n) two different monaural source speech signals, s_1,L(n)、s_1,R(n) indicates that azimuth 1 corresponds toSingle sound source left and right ear speech signals, h_1,L、h_1,RLeft ear HRIR, right ear HRIR, s representing azimuth 1_2,L(n)、s_2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h_2,L、h_2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.

3. The binaural speech separation method based on the long-and-short-term memory network L STM according to claim 1, wherein the calculation method of the mixed training binaural speech signal including the two sound sources in the step (2) is as follows:

s_left(n)＝s_1,L(n)+s_2,L(n)

s_right(n)＝s_1,R(n)+s_2,R(n)

x_left(n)＝s_left(n)+v_L(n)

x_right(n)＝s_right(n)+v_R(n)

4. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the subband filtering calculation method in step (3) is as follows:

x_L(i,n)＝x_left(n)*g_i(n)

x_R(i,n)＝x_right(n)*g_i(n)

wherein x is_Left(n)、x_Right(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, x_L(i,n)、x_R(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, g_i(n) is the impulse response function of the ith subband filter;

the windowing method comprises the following steps:

x_L(i,k,m)＝w_H(m)x_L(i,k·N/2+m)

x_R(i,k,m)＝w_H(m)x_R(i,k·N/2+m)

5. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the cross-correlation function CCF in the step (4) is calculated by the formula:

wherein x is_L(i,k,m)、x_R(i, k, m) represents the time-frequency unit of the i-th sub-band and the k-th frame of the training binaural voice signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency unit of the i-th sub-band and the k-th frame corresponding to the binaural voice signal, d is the number of delay sampling points, L is the maximum delay sampling pointNumber, N is the frame length;

the interaural intensity difference I L D was calculated as:

f (I, k) — [ CCF (I, k, -L) CCF (I, k, -L +1) · · CCF (I, k, L) ITD (I, k) I L D (I, k) ] where F (I, k) represents the spatial feature vector corresponding to the I-th subband, k-th frame binaural speech signal time-frequency unit.