CN110728989B - Binaural speech separation method based on long-time and short-time memory network L STM - Google Patents

Binaural speech separation method based on long-time and short-time memory network L STM Download PDF

Info

Publication number
CN110728989B
CN110728989B CN201910930176.XA CN201910930176A CN110728989B CN 110728989 B CN110728989 B CN 110728989B CN 201910930176 A CN201910930176 A CN 201910930176A CN 110728989 B CN110728989 B CN 110728989B
Authority
CN
China
Prior art keywords
time
stm
binaural
training
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910930176.XA
Other languages
Chinese (zh)
Other versions
CN110728989A (en
Inventor
周琳
陆思源
钟秋月
庄琰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910930176.XA priority Critical patent/CN110728989B/en
Publication of CN110728989A publication Critical patent/CN110728989A/en
Application granted granted Critical
Publication of CN110728989B publication Critical patent/CN110728989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a binaural speech separation method based on a long-time and short-time memory network L STM, which is characterized in that interaural time difference, interaural intensity difference and interaural cross-correlation function of each time-frequency unit of a training binaural speech signal are extracted as separated spatial features, the spatial features of a current frame and front and rear 5 frames of time-frequency units in the same subband are used as input parameters of a bidirectional L STM network for training to obtain a separation model based on L STM, and in a test stage, the spatial features of the current frame and front and rear 5 frames of time-frequency units of the testing binaural speech signal are used as input parameters of a bidirectional L STM network obtained by training and are used for estimating a masking value of target speech of the current time-frequency unit, so that speech separation is carried out according to the masking value.

Description

Binaural speech separation method based on long-time and short-time memory network L STM
Technical Field
The invention relates to a speech separation algorithm, in particular to a binaural speech separation method based on a long-time and short-time memory network L STM.
Background
The voice separation algorithm is an important research direction for voice signal processing, and has a wide application occasion, such as a teleconference system, and the voice separation technology can realize extraction of interested sound sources from a plurality of speakers, so that the efficiency of a teleconference can be improved; the preprocessing process applied to the voice recognition can improve the quality of the voice and help to improve the recognition accuracy; when the method is applied to a hearing-aid device, a more prominent target sound source can be provided for a hearing-impaired person, and effective voice information can be provided.
Speech separation techniques are used in a wide variety of fields including, but not limited to, acoustics, digital signal processing, information communication, auditory psychology and physiology, etc. Binaural speech separation utilizes the difference of binaural signals to analyze and estimate the sound source orientation, and current separation algorithms can be divided into two categories according to the differences of their separation parameters:
1. interaural difference based separation
L ord Rayleigh, on the assumption of a spherical human head in 1907, a theory of separation based on interaural cue Difference is proposed for the first Time, that is, due to the position Difference between the sound source and the human binaural position, there are Time and Intensity differences in the binaural received speech signals, namely, an Interaural Time Difference (ITD) and an Interaural Intensity Difference (IID), which are the basis of the interaural speech separation.
2. Separation based on head-related transfer function
The ITD information can determine the sound source in the left and right directions, and cannot determine whether the sound is coming from the front or the rear, and cannot separate the elevation angle position. However, the separation of the voice is no longer limited to horizontal and forward voices by the Head-Related Transfer Function (HRTF) based method, which can separate the voice by designing an inverse filter using an HRTF database and calculating a cross-correlation value from a binaural signal after inverse filtering. The method solves the problem of three-dimensional space voice separation, but has overlarge computational complexity and stronger individuality of head-related transfer functions, and can cause inconsistency between the actual transfer function and the function used in the separation model when different individuals or surrounding environments are different (namely different noises or reverberation exists), thereby influencing the separation effect.
3. Deep neural network DNN based separation
The method applies the ideal masking ratio IRM to the separation problem of multiple speakers, models through azimuth angles, and extracts improved IRM values from sound sources with 19 forward azimuth angles and environmental noise to serve as training targets of a neural network. In the training stage, binaural voice signals are preprocessed, mixed voice signals pass through a Gamma filter bank and are subjected to framing and windowing to obtain each time-frequency unit, and spatial features of the time-frequency units are extracted and input into a DNN neural network for training. And in the testing stage, the extracted time-frequency space characteristics of the mixed voice are sent to the DNN after training is finished, and the output result of the DNN is the estimated masking ratio ERM. The separation method has high robustness, and is obviously improved in various speech evaluation indexes compared with the traditional algorithm, but the method does not utilize the time sequence correlation characteristic of the speech signal characteristic parameters.
Disclosure of Invention
The invention aims to solve the problem that the performance of the conventional binaural speech separation algorithm is sharply reduced under the conditions of high noise and strong reverberation, and provides a binaural speech separation method of a long-time and short-time memory network L STM, wherein the method adopts a L STM network to train characteristic parameters under multiple environments, and simulation test results show that the separation effect of the binaural speech separation algorithm based on the long-time and short-time memory network L STM is remarkably improved.
The binaural voice separation method based on the long-time and short-time memory network L STM comprises the following steps:
(1) convolving two different training single sound channel voice signals with head related impulse response functions HRIR with different azimuth angles to generate two training single sound source double-ear voice signals with different azimuth angles;
(2) mixing the two training single sound source binaural voice signals with different azimuth angles to obtain a mixed training binaural voice signal containing the two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural voice signal containing the two sound sources with different azimuth angles in different acoustic environments;
(3) performing subband filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step (2) to obtain a training binaural voice signal after each subband is framed, namely each time-frequency unit of the training binaural voice signal;
(4) calculating an interaural cross-correlation function CCF, an interaural time difference ITD and an interaural intensity difference I L D of each time-frequency unit of the training binaural speech signal obtained in the step (3) as the spatial characteristics of each time-frequency unit of the training binaural speech signal;
(5) taking the spatial characteristic parameters of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the previous and next 5 frames in the sub-band as the input of a long-short-term memory network L STM, taking the ideal masking ratio IRM of the time-frequency unit as the target value of a L STM, and training a L STM network;
(6) processing a mixed test binaural voice signal containing two sound sources with different azimuth angles under different acoustic environments according to the step (3) and the step (4) to obtain spatial characteristics of each time-frequency unit of the test binaural voice signal;
(7) inputting the spatial features of each time-frequency unit obtained in the step (6) and the spatial features of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained L STM network to obtain an estimated masking ratio ERM of each time-frequency unit;
(8) and (4) separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step (7) to obtain a time domain voice signal corresponding to a single sound source.
Further, the calculation formula of the two mono-source binaural speech signals with different azimuth angles in step (1) is as follows:
s1,L(n)=s1(n)*h1,Ls2,L(n)=s2(n)*h2,L
s1,R(n)=s1(n)*h1,R,s2,R(n)=s2(n)*h2,R
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h1,L、h1,RLeft ear HRIR, right ear HRIR, s representing azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.
Further, the method for calculating the mixed binaural speech signal including the two sound sources in step (2) includes:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、sright(n) left and right ear signals of a mixed training binaural speech signal comprising two different azimuthal sound sources, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, s2,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2;
the computing method of the noise-containing mixed training binaural speech signal is as follows:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
Further, the subband filtering calculation method in step (3) is as follows:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isLeft(n)、xRight(n) respectively representing noisy mixed training left and right ear speech signals comprising two different azimuth sound sources, xL(i,n)、xR(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, gi(n) is the impulse response function of the ith subband filter;
the framing method comprises the following steps: using a predetermined frame length and frame shift to convert the signal x into a digital signalL(i,n)、xR(i, n) into a plurality of single-frame signals xL(i,k·N/2+m)、xR(i, k · N/2+ m), where k is a frame number, m represents a sampling number within one frame, m is 0, 1.. and N-1, N is a frame length, and the frame is shifted by half;
the windowing method comprises the following steps:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
wherein, wH(m) represents a window function, xL(i,k,m)、xRAnd (i, k, m) respectively represent the left and right ear voice signals of the ith sub-band and the kth frame, and are used as time-frequency units for training the binaural voice signals.
Further, the cross-correlation function CCF in step (4) is calculated as:
Figure BDA0002219981360000041
wherein x isL(i,k,m)、xR(i, k, m) represents the time-frequency units of the i-th sub-band and the k-th frame of the training binaural speech signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency units of the i-th sub-band and the k-th frame corresponding to the binaural speech signal, d is the number of delayed sampling points, L is the maximum number of delayed sampling points, and N is the frame length;
the calculation method of the interaural time difference ITD comprises the following steps:
Figure BDA0002219981360000042
the interaural intensity difference I L D was calculated as:
Figure BDA0002219981360000043
the time-frequency unit space characteristic F (I, k) is composed of CCF, ITD and I L D:
F(i,k)=[CCF(i,k,-L) CCF(i,k,-L+1) ··· CCF(i,k,L) ITD(i,k) ILD(i,k)]
and F (i, k) represents a spatial feature vector corresponding to the time-frequency unit of the i-th subband and the k-th frame binaural speech signal.
Further, the training process of the L STM network in the step (5) specifically includes:
(5-1) constructing L STM network, wherein the L STM network consists of an input layer, a L STM network layer and an output layer, the input layer comprises input at each moment, the output layer comprises output at each moment, the L STM network layer comprises L STM units at each moment, the time step size of L STM is set to 11, namely, a current frame and 5 frames before and after the current frame, and each L STM unit in the L STM network layer is respectively and bidirectionally connected with the L STM units before and after the current frame;
(5-2) randomly initializing L the weight of the STM network layer;
(5-3) calculating an ideal masking ratio IRM of the current time frequency unit as a target value of the L STM network, wherein the calculation formula of the ideal masking ratio IRM is as follows:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frameL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure BDA0002219981360000051
(5-4) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of L STM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-5) obtaining L STM network output, namely estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of ERM and ideal masking value IRM:
Figure BDA0002219981360000052
wherein E [. C]Representing the desired operation, | · |2Represents a L2 norm;
(5-6) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
and (5-7) if the current iteration number is less than the preset total iteration number, returning to the step (5-4), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training.
Compared with the prior art, the method has the advantages that the cross-correlation function, the interaural time difference and the interaural intensity difference of the training binaural speech signals are extracted in each time-frequency unit, the spatial features are formed to be used as training samples, the L STM network is trained to obtain the L STM network separator, the multidimensional feature parameters of the testing binaural speech signals are extracted in the test, the estimated masking ratio ERM corresponding to each frame of binaural speech signals is estimated by the aid of the L STM network separator obtained through the training, and experimental results in different acoustic environments show that the method for separating the binaural speech based on the STM long-time memory network L obviously improves separation effects under the conditions of high noise and strong reverberation and has good robustness.
Drawings
FIG. 1 is a schematic flow diagram of one embodiment of the present invention;
FIG. 2 is a diagram of CCF functions for subbands of a speech signal;
fig. 3 is a schematic diagram of L STM network architecture in the present invention.
Detailed Description
As shown in fig. 1, the binaural speech separation method based on L STM network provided by the present embodiment includes the following steps:
the method comprises the following steps of firstly, convolving two different single-sound-channel voice signals in training voice with head-related impulse response functions HRIR (high-resolution infrared) with different azimuth angles to generate two training single-sound-source double-ear voice signals with different azimuth angles, wherein a calculation formula of each sound source is as follows:
s1,L(n)=s1(n)*h1,Ls2,L(n)=s2(n)*h2,L
s1,R(n)=s1(n)*h1,R,s2,R(n)=s2(n)*h2,R
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h1,L、h1,RThe left ear HRIR and the right ear HRIR, s corresponding to the azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RThe left ear HRIR and the right ear HRIR corresponding to the azimuth 2 are represented by convolution operation, and n is a sampling number.
The azimuth angles theta corresponding to the head Related Impulse response functions HRIR (head Related Impulse response) are in the range of [ -90 degrees, 90 degrees ], are separated by 5 degrees, and are 37 azimuth angles in total, each azimuth angle theta corresponds to a pair of head Related Impulse response functions HRIR, namely a left ear HRIR and a right ear HRIR.
Step two, mixing the training single-sound-source binaural speech signals in two different directions obtained in the step one to obtain a mixed training binaural speech signal containing two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural speech signal containing two sound sources with different azimuth angles in different acoustic environments, wherein the calculation method comprises the following steps:
the method for calculating the mixed training binaural voice signal containing two sound sources comprises the following steps:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、srightAnd (n) is the left and right ear signals of the training mixed binaural voice signal containing two different azimuth sound sources.
The computing method of the noise-containing mixed training binaural voice signal comprises the following steps:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
The binaural speech signals under the noise environment are generated, so that an L STM network can learn the distribution rule of spatial characteristic parameters corresponding to the binaural speech signals under the noise environment, the signal-to-noise ratio is set to be 0, 5, 10, 15 and 20dB, and the binaural speech signals under different acoustic environments at two different azimuth angles are obtained, and the binaural speech signals with the signal-to-noise ratios of 0, 5, 10, 15 and 20dB are obtained when no reverberation exists corresponding to each azimuth angle.
And step three, carrying out sub-band filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step two to obtain the training binaural voice signal after each sub-band is framed.
The subband filtering may adopt a gamma filter bank, and the time domain impulse response function of the gamma filter is as follows:
gi(n)=An3e-2πbincos(2πfi)u(n)
wherein i represents the serial number of the filter; a is the filter gain; f. ofiIs the center frequency of the filter; biIs the attenuation factor of the filter, determines the attenuation speed of the impulse response; u (n) represents a step function. The number of filters in the Gamma filter bank adopted in this embodiment is 33, and the central frequency range is [50Hz,8000Hz ]]。
The calculation formula of the subband filtering is as follows:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isL(i,n)、xRAnd (i, n) are respectively a left ear voice signal and a right ear voice signal of the ith filtered subband, wherein i is more than or equal to 1 and less than or equal to 33. Language of each sound trackThe sound signals are sub-band filtered to obtain 33 sub-band sound signals.
Actually, the subband filter of the present invention is not limited to the filter structure of this embodiment, and may be adopted as long as it can realize the subband filtering function of the speech signal.
Framing: under the condition that the voice sampling frequency is 16kHz, the preset frame length is 512, the frame shift is 256, the left ear voice signal and the right ear voice signal of each sub-band are divided into multi-frame signals, and the left ear voice signal and the right ear voice signal after the frame division are x respectivelyL(i,k·N/2+m)、xR(i,k·N/2+m)。
The windowing formula is:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
in the formula, xL(i,k,m)、xR(i, k, m) respectively represents the left and right ear voice signals of the ith sub-band and the kth frame, wherein i is more than or equal to 1 and less than or equal to 33, N is the frame length, and the value is 512.
The window function is the hamming window:
Figure BDA0002219981360000071
and step four, extracting the spatial characteristics of the binaural speech signal of each frame, namely the cross-correlation function CCF, the interaural time difference ITD and the interaural intensity difference I L D, of the training binaural speech signal after each sub-band is framed obtained in the step three.
The cross-correlation function CCF is calculated as:
Figure BDA0002219981360000072
wherein CCF (i, k, d) represents the cross-correlation function corresponding to the binaural speech signal of the ith subband and the kth frame, d is the number of delayed sample points, and L is the maximum number of delayed sample points.
The length of the cross-correlation function is generally a value between [ -1ms, 1ms ] in combination with the sound propagation speed and the human head size, the sampling rate of the speech signal in the present invention is 16kHz, so that the present embodiment takes L-16, and thus the number of CCF points calculated for each frame of training binaural speech signal is 33 points.
The calculation formula of the interaural time difference ITD is as follows:
Figure BDA0002219981360000082
the interaural intensity difference I L D is calculated as:
Figure BDA0002219981360000081
the spatial characteristics corresponding to each time-frequency unit are the combination form of the parameters:
F(i,k)=[CCF(i,k,-L)CCF(i,k,-L+1)···CCF(i,k,L)ITD(i,k)ILD(i,k)]
wherein, F (i, k) represents a spatial feature vector corresponding to the i-th subband and the k-th frame binaural speech signal time-frequency unit.
Fig. 2 is a CCF function of a time-frequency unit of a speech signal, where CCF has a relatively simple relationship with delay in a low frequency band and generates a plurality of peaks in a high frequency band.
Step five, regarding the space characteristic parameters of each time-frequency unit obtained in the step four and the space characteristics of the time-frequency units corresponding to the front frame and the rear frame in the sub-band as the input characteristics of a long-time memory network L STM, regarding the ideal masking ratio IRM of the time-frequency unit as the target value of a L STM, and training a L STM based on a forward propagation algorithm and a backward propagation algorithm, wherein the input characteristic format is [ F (i, k-5), F (i, k-4), …, F (i, k), … and F (i, k +5) ];
the L STM network structure of this embodiment is given below, indeed, the structure of the L STM network of the present invention is not limited to the network structure of this embodiment.
As shown in fig. 3, a L STM network adopted in this embodiment includes an input layer, a L STM network layer, and an output layer, where the input layer includes inputs at various times, the output layer includes outputs at various times, the L STM network layer includes L STM units at various times, each L STM unit in the L STM network layer is respectively connected to a preceding time L STM unit and a following time L STM unit in a bidirectional manner, the input of the input layer is a 37 × dimensional sample, where 37 represents a spatial feature number of a time-frequency unit, 11 represents that a time step of L STM is set to 11 (preceding and following 5 frames and current frame), the L STM network layer L STM unit includes 256 neurons, and the output layer includes 20 neurons.
In the embodiment, on the basis of a simulation experiment, the learning rate is set to be 0.0001, the total iteration frequency is set to be 400, the learning rate is set to be 0.0001, the phenomenon that the error function and the error fraction oscillate excessively is avoided, and meanwhile, when the iteration frequency is 400, the network model is close to convergence.
Based on the set parameters, the fifth step specifically comprises the following steps:
(5-1) randomly initializing L the weight of the STM network layer;
(5-2) calculating an ideal masking ratio IRM of the current time frequency unit as a target value of the L STM network, wherein the calculation formula of the ideal masking ratio IRM is as follows:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frameL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure BDA0002219981360000091
(5-3) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of L STM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-4) obtaining L STM network output, namely estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of ERM and ideal masking value IRM:
Figure BDA0002219981360000092
wherein E [. C]Representing the desired operation, | · |2Representing a L2 norm.
(5-5) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
and (5-6) if the current iteration number is less than the preset total iteration number, returning to the step (5-3), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training.
And step six, processing the mixed test binaural voice signals containing two sound sources with different azimuth angles under different acoustic environments according to the step three and the step four to obtain the spatial characteristics of each time-frequency unit of the test binaural voice signals.
And step seven, inputting the spatial characteristics of each time-frequency unit obtained in the step six and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained L STM network to obtain the estimated masking ratio ERM of each time-frequency unit.
And step eight, separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step seven to obtain a time domain voice signal corresponding to a single sound source.
The other three separation algorithms involved in comparison are respectively a voice separation algorithm based on a degradation separation estimation technology DUET, a DNN network voice separation algorithm based on an ideal binary mask IBM and a DNN network voice separation algorithm based on an ideal ratio mask IRM, and the method corresponds to an IRM-L STM method, and the former two methods belong to the traditional separation method.
The PESQ values for the four methods are shown in table 1:
TABLE 1 comparison of PESQ values for the four methods
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.403 1.467 1.946 1.874
5 1.57 1.656 2.121 2.140
10 1.754 1.834 2.258 2.355
15 1.923 1.982 2.386 2.528
20 2.102 2.119 2.510 2.654
Noiseless 2.628 2.355 2.765 2.795
According to the results in table 1, the PESQ value of the L STM network based algorithm is much higher than that of the traditional algorithm, and is higher than that of the algorithm using the DNN network except for the signal-to-noise ratio of 0 dB.
Meanwhile, the algorithm of the patent is tested in the environment with the signal to noise ratio of-3, 3, 6, 9 and 12dB, and indexes are shown in table 2.
TABLE 2 evaluation of all SNR
SNR(dB) PESQ
-3 1.867
0 1.874
3 2.161
5 2.140
6 2.322
9 2.452
10 2.355
12 2.552
15 2.528
20 2.654
As can be seen from the table, for the case of the signal-to-noise ratio without training, the PESQ indexes are good in performance and similar to the case of the adjacent signal-to-noise ratio, so that the algorithm based on the L STM network has good generalization on the case of the noise signal-to-noise ratio and has good robustness.
Meanwhile, the PESQ index under the reverberation condition is tested, and 200ms reverberation and 600ms reverberation are respectively adopted, and the results are shown in tables 3 and 4:
TABLE 3200 ms reverberation environment four methods PESQ value comparison
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.335 1.413 1.717 1.710
5 1.468 1.593 1.971 2.004
10 1.597 1.758 2.139 2.151
15 1.678 1.865 2.262 2.359
20 1.734 1.932 2.345 2.380
TABLE 4600 ms reverberation environment four methods PESQ value comparison
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.322 1.410 1.664 1.645
5 1.429 1.570 1.913 2.024
10 1.524 1.713 2.069 2.120
15 1.579 1.800 2.180 2.253
20 1.617 1.857 2.252 2.298
According to the tables 3 and 4, the PESQ value of the L STM network-based binaural speech separation algorithm in the reverberation environment is higher than that of the DNN-based algorithm and is significantly higher than that of the other two traditional algorithms, so that the algorithm has better generalization performance for the reverberation environment and has better robustness.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (5)

1. A binaural speech separation method based on a long-time and short-time memory network L STM is characterized by comprising the following steps:
(1) convolving two different training single sound channel voice signals with head related impulse response functions HRIR with different azimuth angles to generate two training single sound source double-ear voice signals with different azimuth angles;
(2) mixing the two training single sound source binaural voice signals with different azimuth angles to obtain a mixed training binaural voice signal containing the two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural voice signal containing the two sound sources with different azimuth angles in different acoustic environments;
(3) performing subband filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step (2) to obtain a training binaural voice signal after each subband is framed, namely each time-frequency unit of the training binaural voice signal;
(4) calculating an interaural cross-correlation function CCF, an interaural time difference ITD and an interaural intensity difference I L D of each time-frequency unit of the training binaural speech signal obtained in the step (3) as the spatial characteristics of each time-frequency unit of the training binaural speech signal;
(5) and (3) taking the spatial characteristics of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the previous and next 5 frames in the sub-band as the input of a long-time and short-time memory network L STM, and taking the ideal masking ratio IRM of the time-frequency unit as a target value of a L STM network to train a L STM network, wherein the training process of the L STM network specifically comprises the following steps:
(5-1) constructing L STM network, wherein the L STM network consists of an input layer, a L STM network layer and an output layer, the input layer comprises input at each moment, the output layer comprises output at each moment, the L STM network layer comprises L STM units at each moment, the time step size of L STM is set to 11, namely, a current frame and 5 frames before and after the current frame, and each L STM unit in the L STM network layer is respectively and bidirectionally connected with the L STM units before and after the current frame;
(5-2) randomly initializing L the weight of the STM network layer;
(5-3) calculating an ideal masking ratio IRM of the current time frequency unit as a target value of the L STM network, wherein the calculation formula of the ideal masking ratio IRM is as follows:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frameL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure FDA0002497909420000011
(5-4) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of L STM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-5) obtaining L STM network output, namely estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of ERM and ideal masking value IRM:
Figure FDA0002497909420000021
wherein E [. C]Representing the desired operation, | · |2Represents a L2 norm;
(5-6) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
(5-7) if the current iteration number is smaller than the preset total iteration number, returning to the step (5-4), continuing inputting the spatial features of the next batch of time-frequency units for calculation until the preset iteration number is obtained, ending the iteration, and ending L STM network training;
(6) processing a mixed test binaural voice signal containing two sound sources with different azimuth angles under different acoustic environments according to the step (3) and the step (4) to obtain spatial characteristics of each time-frequency unit of the test binaural voice signal;
(7) inputting the spatial features of each time-frequency unit obtained in the step (6) and the spatial features of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained L STM network to obtain an estimated masking ratio ERM of each time-frequency unit;
(8) and (4) separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step (7) to obtain a time domain voice signal corresponding to a single sound source.
2. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the calculation formula of the binaural speech signals of the two different azimuth single sound sources in step (1) is as follows:
Figure FDA0002497909420000022
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) indicates that azimuth 1 corresponds toSingle sound source left and right ear speech signals, h1,L、h1,RLeft ear HRIR, right ear HRIR, s representing azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.
3. The binaural speech separation method based on the long-and-short-term memory network L STM according to claim 1, wherein the calculation method of the mixed training binaural speech signal including the two sound sources in the step (2) is as follows:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、sright(n) left and right ear signals of a mixed training binaural speech signal comprising two different azimuthal sound sources, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, s2,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2;
the computing method of the noise-containing mixed training binaural speech signal is as follows:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
4. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the subband filtering calculation method in step (3) is as follows:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, xL(i,n)、xR(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, gi(n) is the impulse response function of the ith subband filter;
the framing method comprises the following steps: using a predetermined frame length and frame shift to convert the signal x into a digital signalL(i,n)、xR(i, n) into a plurality of single-frame signals xL(i,k·N/2+m)、xR(i, k · N/2+ m), where k is a frame number, m represents a sampling number within one frame, m is 0, 1.. and N-1, N is a frame length, and the frame is shifted by half;
the windowing method comprises the following steps:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
wherein, wH(m) represents a window function, xL(i,k,m)、xRAnd (i, k, m) respectively represent the left and right ear voice signals of the ith sub-band and the kth frame, and are used as time-frequency units for training the binaural voice signals.
5. The binaural speech separation method based on the long-and-short memory network L STM according to claim 1, wherein the cross-correlation function CCF in the step (4) is calculated by the formula:
Figure FDA0002497909420000031
wherein x isL(i,k,m)、xR(i, k, m) represents the time-frequency unit of the i-th sub-band and the k-th frame of the training binaural voice signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency unit of the i-th sub-band and the k-th frame corresponding to the binaural voice signal, d is the number of delay sampling points, L is the maximum delay sampling pointNumber, N is the frame length;
the calculation method of the interaural time difference ITD comprises the following steps:
Figure FDA0002497909420000041
the interaural intensity difference I L D was calculated as:
Figure FDA0002497909420000042
the time-frequency unit space characteristic F (I, k) is composed of CCF, ITD and I L D:
f (I, k) — [ CCF (I, k, -L) CCF (I, k, -L +1) · · CCF (I, k, L) ITD (I, k) I L D (I, k) ] where F (I, k) represents the spatial feature vector corresponding to the I-th subband, k-th frame binaural speech signal time-frequency unit.
CN201910930176.XA 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM Active CN110728989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910930176.XA CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910930176.XA CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Publications (2)

Publication Number Publication Date
CN110728989A CN110728989A (en) 2020-01-24
CN110728989B true CN110728989B (en) 2020-07-14

Family

ID=69219570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910930176.XA Active CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Country Status (1)

Country Link
CN (1) CN110728989B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111707990B (en) * 2020-08-19 2021-05-14 东南大学 Binaural sound source positioning method based on dense convolutional network
CN112216301B (en) * 2020-11-17 2022-04-29 东南大学 Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference
CN113079452B (en) * 2021-03-30 2022-11-15 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio direction information generating method, electronic device, and medium
CN113327624B (en) * 2021-05-25 2023-06-23 西北工业大学 Method for intelligent monitoring of environmental noise by adopting end-to-end time domain sound source separation system
CN113936681B (en) * 2021-10-13 2024-04-09 东南大学 Speech enhancement method based on mask mapping and mixed cavity convolution network
CN113823309B (en) * 2021-11-22 2022-02-08 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114446316B (en) * 2022-01-27 2024-03-12 腾讯科技(深圳)有限公司 Audio separation method, training method, device and equipment of audio separation model
CN115862676A (en) * 2023-02-22 2023-03-28 南方电网数字电网研究院有限公司 Voice superposition detection method and device based on deep learning and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR;Weninger F, Erdogan H and Watanabe S.;《Proceedings of the 12th International Conference on Latent》;20151231;91-99 *
基于深度学习的语音增强技术研究;李璐君;;《解放军信息工程大学 硕士学位论文》;20180630;全文 *

Also Published As

Publication number Publication date
CN110728989A (en) 2020-01-24

Similar Documents

Publication Publication Date Title
CN110728989B (en) Binaural speech separation method based on long-time and short-time memory network L STM
CN109164415B (en) Binaural sound source positioning method based on convolutional neural network
Vecchiotti et al. End-to-end binaural sound localisation from the raw waveform
CN110517705B (en) Binaural sound source positioning method and system based on deep neural network and convolutional neural network
CN109410976B (en) Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid
CN107942290B (en) Binaural sound sources localization method based on BP neural network
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
Mosayyebpour et al. Single-microphone LP residual skewness-based inverse filtering of the room impulse response
Ren et al. A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement.
Wang et al. Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR
CN108986832A (en) Ears speech dereverberation method and device based on voice probability of occurrence and consistency
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
Barros et al. Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets
CN111948609B (en) Binaural sound source positioning method based on Soft-argmax regression device
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
CN111707990B (en) Binaural sound source positioning method based on dense convolutional network
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN114613384B (en) Deep learning-based multi-input voice signal beam forming information complementation method
CN112731291B (en) Binaural sound source localization method and system for collaborative two-channel time-frequency mask estimation task learning
Wu et al. Microphone array speech separation algorithm based on dnn
Youssef et al. From monaural to binaural speaker recognition for humanoid robots
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
Youssef et al. Binaural speaker recognition for humanoid robots
Shen et al. Multichannel Speech Enhancement in Vehicle Environment Based on Interchannel Attention Mechanism
CN112216301B (en) Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant