CN110728989A - Binaural voice separation method based on long-time and short-time memory network LSTM - Google Patents

Binaural voice separation method based on long-time and short-time memory network LSTM Download PDF

Info

Publication number
CN110728989A
CN110728989A CN201910930176.XA CN201910930176A CN110728989A CN 110728989 A CN110728989 A CN 110728989A CN 201910930176 A CN201910930176 A CN 201910930176A CN 110728989 A CN110728989 A CN 110728989A
Authority
CN
China
Prior art keywords
time
binaural
lstm
training
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910930176.XA
Other languages
Chinese (zh)
Other versions
CN110728989B (en
Inventor
周琳
陆思源
钟秋月
庄琰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910930176.XA priority Critical patent/CN110728989B/en
Publication of CN110728989A publication Critical patent/CN110728989A/en
Application granted granted Critical
Publication of CN110728989B publication Critical patent/CN110728989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)

Abstract

The invention discloses a binaural voice separation method based on a long-time and short-time memory network LSTM. The invention extracts the interaural time difference, interaural intensity difference and interaural cross-correlation function of each time-frequency unit of a training binaural speech signal as the separated spatial characteristics, and trains the spatial characteristics of the time-frequency units of the current frame and the front and rear 5 frames in the same subband as the input parameters of the bidirectional LSTM network to obtain the LSTM-based separation model. And in the testing stage, the spatial characteristics of the time-frequency units of the current frame and the 5 frames before and after the current frame of the tested binaural voice signal are used as input parameters of the bidirectional LSTM network obtained by training, and the input parameters are used for estimating the masking value of the target voice of the current time-frequency unit, so that voice separation is carried out according to the masking value. The separation result shows that compared with a method based on a deep neural network, the binaural separation method based on the LSTM network provided by the invention has the advantages that the subjective evaluation index is obviously improved, and the algorithm generalization performance is good.

Description

Binaural voice separation method based on long-time and short-time memory network LSTM
Technical Field
The invention relates to a speech separation algorithm, in particular to a binaural speech separation method based on a long-time and short-time memory network LSTM.
Background
The voice separation algorithm is an important research direction for voice signal processing, and has a wide application occasion, such as a teleconference system, and the voice separation technology can realize extraction of interested sound sources from a plurality of speakers, so that the efficiency of a teleconference can be improved; the preprocessing process applied to the voice recognition can improve the quality of the voice and help to improve the recognition accuracy; when the method is applied to a hearing-aid device, a more prominent target sound source can be provided for a hearing-impaired person, and effective voice information can be provided.
Speech separation techniques are used in a wide variety of fields including, but not limited to, acoustics, digital signal processing, information communication, auditory psychology and physiology, etc. Binaural speech separation utilizes the difference of binaural signals to analyze and estimate the sound source orientation, and current separation algorithms can be divided into two categories according to the differences of their separation parameters:
1. interaural difference based separation
Lord Rayleigh in 1907 on the assumption of a spherical head proposes a separation theory based on interaural cue Difference for the first Time, that is, Time and Intensity differences exist between the speech signals received by two ears due to the position Difference between the sound source and the positions of the two ears of a person, namely, an Interaural Time Difference (ITD) and an Interaural Intensity Difference (IID), and the two factor differences are the basis of the separation of the two ears. Cross Correlation Functions (CCFs) of binaural speech signals related to ITDs and IIDs are also parameters of interaural differences, but in actual environments, due to interference of reverberation and noise, the separation effect is reduced.
2. Separation based on head-related transfer function
The ITD information can determine the sound source in the left and right directions, and cannot determine whether the sound is coming from the front or the rear, and cannot separate the elevation angle position. However, the separation of the voice is no longer limited to horizontal and forward voices by the Head-Related Transfer Function (HRTF) based method, which can separate the voice by designing an inverse filter using an HRTF database and calculating a cross-correlation value from a binaural signal after inverse filtering. The method solves the problem of three-dimensional space voice separation, but has overlarge computational complexity and stronger individuality of head-related transfer functions, and can cause inconsistency between the actual transfer function and the function used in the separation model when different individuals or surrounding environments are different (namely different noises or reverberation exists), thereby influencing the separation effect.
3. Deep neural network DNN based separation
The method applies the ideal masking ratio IRM to the separation problem of multiple speakers, models through azimuth angles, and extracts improved IRM values from sound sources with 19 forward azimuth angles and environmental noise to serve as training targets of a neural network. In the training stage, binaural voice signals are preprocessed, mixed voice signals pass through a Gamma filter bank and are subjected to framing and windowing to obtain each time-frequency unit, and spatial features of the time-frequency units are extracted and input into a DNN neural network for training. And in the testing stage, the extracted time-frequency space characteristics of the mixed voice are sent to the DNN after training is finished, and the output result of the DNN is the estimated masking ratio ERM. The separation method has high robustness, and is obviously improved in various speech evaluation indexes compared with the traditional algorithm, but the method does not utilize the time sequence correlation characteristic of the speech signal characteristic parameters.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problem that the performance of the conventional binaural speech separation algorithm is sharply reduced under the conditions of high noise and strong reverberation, the invention provides a binaural speech separation method of a long-time memory network LSTM, which adopts an LSTM network to train characteristic parameters under multiple environments. Simulation test results show that the separation effect of the binaural speech separation algorithm based on the long-time and short-time memory network LSTM is remarkably improved.
The technical scheme is as follows: the invention discloses a binaural voice separation method based on a long-time and short-time memory network LSTM, which comprises the following steps:
(1) convolving two different training single sound channel voice signals with head related impulse response functions HRIR with different azimuth angles to generate two training single sound source double-ear voice signals with different azimuth angles;
(2) mixing the two training single sound source binaural voice signals with different azimuth angles to obtain a mixed training binaural voice signal containing the two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural voice signal containing the two sound sources with different azimuth angles in different acoustic environments;
(3) performing subband filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step (2) to obtain a training binaural voice signal after each subband is framed, namely each time-frequency unit of the training binaural voice signal;
(4) calculating an interaural cross-correlation function CCF, an interaural time difference ITD and an interaural intensity difference ILD of each time-frequency unit of the training binaural speech signal obtained in the step (3) to serve as spatial features of each time-frequency unit of the training binaural speech signal;
(5) taking the spatial characteristic parameters of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band as the input of a long-time memory network LSTM network, taking the ideal masking ratio IRM of the time-frequency unit as the target value of the LSTM network, and training the LSTM network;
(6) processing a mixed test binaural voice signal containing two sound sources with different azimuth angles under different acoustic environments according to the step (3) and the step (4) to obtain spatial characteristics of each time-frequency unit of the test binaural voice signal;
(7) inputting the spatial characteristics of each time-frequency unit obtained in the step (6) and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained LSTM network to obtain an estimated masking ratio ERM of each time-frequency unit;
(8) and (4) separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step (7) to obtain a time domain voice signal corresponding to a single sound source.
Further, the calculation formula of the two mono-source binaural speech signals with different azimuth angles in step (1) is as follows:
s1,L(n)=s1(n)*h1,Ls2,L(n)=s2(n)*h2,L
s1,R(n)=s1(n)*h1,R,s2,R(n)=s2(n)*h2,R
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h1,L、h1,RLeft ear HRIR, right ear HRIR, s representing azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.
Further, the method for calculating the mixed binaural speech signal including the two sound sources in step (2) includes:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、sright(n) left and right ear signals of a mixed training binaural speech signal comprising two different azimuthal sound sources, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, s2,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2;
the computing method of the noise-containing mixed training binaural speech signal is as follows:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
Further, the subband filtering calculation method in step (3) is as follows:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isLeft(n)、xRight(n) respectively representing noisy mixed training left and right ear speech signals comprising two different azimuth sound sources, xL(i,n)、xR(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, gi(n) is the impulse response function of the ith subband filter;
the framing method comprises the following steps: using a predetermined frame length and frame shift to convert the signal x into a digital signalL(i,n)、xR(i, n) into a plurality of single-frame signals xL(i,k·N/2+m)、xR(i, k · N/2+ m), where k is a frame number, m represents a sampling number within one frame, m is 0, 1.. and N-1, N is a frame length, and the frame is shifted by half;
the windowing method comprises the following steps:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
wherein, wH(m) represents a window function, xL(i,k,m)、xRAnd (i, k, m) respectively represent the left and right ear voice signals of the ith sub-band and the kth frame, and are used as time-frequency units for training the binaural voice signals.
Further, the cross-correlation function CCF in step (4) is calculated as:
Figure BDA0002219981360000041
wherein x isL(i,k,m)、xR(i, k, m) represents the time-frequency units of the i-th sub-band and the k-th frame of the training binaural speech signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency units of the i-th sub-band and the k-th frame corresponding to the binaural speech signal, d is the number of delay sampling points, L is the maximum number of delay sampling points, and N is the frame length;
the calculation method of the interaural time difference ITD comprises the following steps:
Figure BDA0002219981360000042
the interaural intensity difference ILD was calculated as:
Figure BDA0002219981360000043
the time-frequency unit space characteristic F (i, k) is composed of CCF, ITD and ILD:
F(i,k)=[CCF(i,k,-L) CCF(i,k,-L+1) ··· CCF(i,k,L) ITD(i,k) ILD(i,k)]
and F (i, k) represents a spatial feature vector corresponding to the time-frequency unit of the i-th subband and the k-th frame binaural speech signal.
Further, the training process of the LSTM network in step (5) specifically includes:
(5-1) constructing an LSTM network, wherein the LSTM network consists of an input layer, an LSTM network layer and an output layer, the input layer comprises input at each moment, the output layer comprises output at each moment, the LSTM network layer comprises LSTM units at each moment, the time step of the LSTM is set to be 11, namely the current frame and 5 frames before and after the current frame, and each LSTM unit in the LSTM network layer is respectively in bidirectional connection with the LSTM units at the moment before and after the current frame;
(5-2) randomly initializing the weight of the LSTM network layer;
(5-3) calculating an ideal masking ratio (IRM) of the current time-frequency unit as a target value of the LSTM network; wherein, the ideal masking ratio IRM is calculated according to the following formula:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frameL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure BDA0002219981360000051
(5-4) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of the LSTM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-5) obtaining the output of the LSTM network, namely the estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of the ERM and the ideal masking value IRM:
Figure BDA0002219981360000052
wherein E [. C]Representing the desired operation, | · |2Represents the L2 norm;
(5-6) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
and (5-7) if the current iteration times are less than the preset total iteration times, returning to the step (5-4), continuously inputting the spatial features of the next time-frequency units for calculation until the preset iteration times are obtained, ending the iteration, and ending the LSTM network training.
Has the advantages that: compared with the prior art, the method has the remarkable advantages that the cross-correlation function, the interaural time difference and the interaural intensity difference of the training binaural speech signals are extracted from each time-frequency unit to form spatial characteristics as training samples, and the LSTM network is trained to obtain the LSTM network separator. And extracting multi-dimensional characteristic parameters of the tested binaural speech signals in the test, and estimating an estimated masking ratio ERM corresponding to each frame of binaural speech signals by using an LSTM network separator obtained by training. Experimental results in different acoustic environments show that the binaural voice separation method based on the long-and-short-time memory network LSTM provided by the invention obviously improves the separation effect under the conditions of high noise and strong reverberation, and has good robustness.
Drawings
FIG. 1 is a schematic flow diagram of one embodiment of the present invention;
FIG. 2 is a diagram of CCF functions for subbands of a speech signal;
fig. 3 is a schematic diagram of the LSTM network structure in the present invention.
Detailed Description
As shown in fig. 1, the method for binaural speech separation based on LSTM network provided by this embodiment includes the following steps:
the method comprises the following steps of firstly, convolving two different single-sound-channel voice signals in training voice with head-related impulse response functions HRIR (high-resolution infrared) with different azimuth angles to generate two training single-sound-source double-ear voice signals with different azimuth angles, wherein a calculation formula of each sound source is as follows:
s1,L(n)=s1(n)*h1,Ls2,L(n)=s2(n)*h2,L
s1,R(n)=s1(n)*h1,R,s2,R(n)=s2(n)*h2,R
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h1,L、h1,RThe left ear HRIR and the right ear HRIR, s corresponding to the azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RThe left ear HRIR and the right ear HRIR corresponding to the azimuth 2 are represented by convolution operation, and n is a sampling number.
The monophonic source signal adopts monophonic female voice and male voice signals in a CHAINS Speech Corpus voice library SOLO. The azimuth angle θ corresponding to the head-Related Impulse response function hrir (head Related Impulse response) ranges from [ -90 °,90 ° ], separated by 5 °, for a total of 37 azimuth angles. Each azimuth angle θ corresponds to a pair of head-related impulse response functions HRIR, i.e., a left ear HRIR, a right ear HRIR.
Step two, mixing the training single-sound-source binaural speech signals in two different directions obtained in the step one to obtain a mixed training binaural speech signal containing two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural speech signal containing two sound sources with different azimuth angles in different acoustic environments, wherein the calculation method comprises the following steps:
the method for calculating the mixed training binaural voice signal containing two sound sources comprises the following steps:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、srightAnd (n) is the left and right ear signals of the training mixed binaural voice signal containing two different azimuth sound sources.
The computing method of the noise-containing mixed training binaural voice signal comprises the following steps:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
The binaural speech signal in the noise environment is generated, so that the LSTM network can learn the distribution rule of the spatial characteristic parameters corresponding to the binaural speech signal in the noise environment. The signal-to-noise ratio is set to 0, 5, 10, 15, 20dB, resulting in a binaural speech signal in different acoustic environments at two different azimuths. Thus, corresponding to each azimuth angle, the binaural voice signals with the signal-to-noise ratios of 0, 5, 10, 15 and 20dB are obtained when no reverberation exists.
And step three, carrying out sub-band filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step two to obtain the training binaural voice signal after each sub-band is framed.
The subband filtering may adopt a gamma filter bank, and the time domain impulse response function of the gamma filter is as follows:
gi(n)=An3e-2πbincos(2πfi)u(n)
wherein i represents the serial number of the filter; a is the filter gain; f. ofiIs the center frequency of the filter; biIs the attenuation factor of the filter, determines the attenuation speed of the impulse response; u (n) represents a step function. The number of filters in the Gamma filter bank adopted in this embodiment is 33, and the central frequency range is [50Hz,8000Hz ]]。
The calculation formula of the subband filtering is as follows:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isL(i,n)、xRAnd (i, n) are respectively a left ear voice signal and a right ear voice signal of the ith filtered subband, wherein i is more than or equal to 1 and less than or equal to 33. The voice signal of each channel will get 33 sub-band voice signals after sub-band filtering.
Actually, the subband filter of the present invention is not limited to the filter structure of this embodiment, and may be adopted as long as it can realize the subband filtering function of the speech signal.
Framing: under the condition that the voice sampling frequency is 16kHz, the preset frame length is 512, the frame shift is 256, the left ear voice signal and the right ear voice signal of each sub-band are divided into multi-frame signals, and the left ear voice signal and the right ear voice signal after the frame division are x respectivelyL(i,k·N/2+m)、xR(i,k·N/2+m)。
The windowing formula is:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
in the formula, xL(i,k,m)、xR(i, k, m) respectively represents the left and right ear voice signals of the ith sub-band and the kth frame, wherein i is more than or equal to 1 and less than or equal to 33, N is the frame length, and the value is 512.
The window function is the hamming window:
Figure BDA0002219981360000071
and step four, extracting the spatial characteristics of the binaural speech signal of each frame, namely the cross-correlation function CCF, the interaural time difference ITD and the interaural intensity difference ILD, of the training binaural speech signal after each sub-band is framed obtained in the step three.
The cross-correlation function CCF is calculated as:
Figure BDA0002219981360000072
wherein CCF (i, k, d) represents a cross-correlation function corresponding to the i-th subband and the k-th frame binaural speech signal, d is the number of delayed sample points, and L is the maximum number of delayed sample points.
The length of the cross-correlation function is typically taken to be a value between-1 ms, in combination with the speed of sound propagation and the size of the human head. In the present invention, the sampling rate of the speech signal is 16kHz, so that the present embodiment takes L to 16, and thus the number of CCF points calculated by training the binaural speech signal per frame is 33 points.
The calculation formula of the interaural time difference ITD is as follows:
Figure BDA0002219981360000082
the calculation formula of the interaural intensity difference ILD is:
the spatial characteristics corresponding to each time-frequency unit are the combination form of the parameters:
F(i,k)=[CCF(i,k,-L)CCF(i,k,-L+1)···CCF(i,k,L)ITD(i,k)ILD(i,k)]
wherein, F (i, k) represents a spatial feature vector corresponding to the i-th subband and the k-th frame binaural speech signal time-frequency unit.
Fig. 2 is a CCF function of a time-frequency unit of a speech signal, where CCF has a relatively simple relationship with delay in a low frequency band and generates a plurality of peaks in a high frequency band.
And fifthly, regarding the spatial characteristic parameters of each time-frequency unit obtained in the fourth step and the spatial characteristics of the time-frequency units corresponding to the front frame and the rear frame in the sub-band as input characteristics of a long-time memory network LSTM network, taking the ideal masking ratio IRM of the time-frequency unit as a target value of the LSTM network, and training the LSTM network based on forward propagation and backward propagation algorithms. The input characteristic format is: [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +5) ];
the LSTM network structure of the present embodiment is given below. In fact, the structure of the LSTM network of the present invention is not limited to the network structure of this embodiment.
As shown in fig. 3, the LSTM network used in this embodiment includes an input layer, an LSTM network layer, and an output layer. The input layer comprises input at each moment, the output layer comprises output at each moment, the LSTM network layer comprises LSTM units at each moment, and each LSTM unit in the LSTM network layer is respectively and bidirectionally connected with the LSTM units at the front moment and the rear moment. The input of the input layer is a 37 × 11 dimensional sample, where 37 represents the spatial feature number of a time-frequency unit, and 11 represents that the time step of LSTM is set to 11 (5 frames before and after and the current frame). The LSTM network layer LSTM unit contains 256 neurons and the output layer contains 20 neurons. And in the training stage, spatial characteristic parameters of each time-frequency unit are extracted after training voice is preprocessed and are sent to an LSTM network for training, loss calculation is carried out on an output ERM at the last moment of the LSTM neural network and a real IRM, a mean square error loss function is adopted as a loss function, and the trained network is obtained through cyclic iteration.
In the embodiment, on the basis of a simulation experiment, the learning rate is set to be 0.0001, the total iteration frequency is set to be 400, the learning rate is set to be 0.0001, the phenomenon that the error function and the error fraction oscillate excessively is avoided, and meanwhile, when the iteration frequency is 400, the network model is close to convergence.
Based on the set parameters, the fifth step specifically comprises the following steps:
(5-1) randomly initializing the weight of the LSTM network layer;
(5-2) calculating an ideal masking ratio (IRM) of the current time-frequency unit as a target value of the LSTM network; wherein, the ideal masking ratio IRM is calculated according to the following formula:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) sub-band filtering, dividingObtaining left ear voice signals s of the ith sub-band and the kth frame after framing and windowingL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure BDA0002219981360000091
(5-3) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of the LSTM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-4) obtaining the output of the LSTM network, namely the estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of the ERM and the ideal masking value IRM:
Figure BDA0002219981360000092
wherein E [. C]Representing the desired operation, | · |2Representing the L2 norm.
(5-5) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
and (5-6) if the current iteration times are less than the preset total iteration times, returning to the step (5-3), continuously inputting the spatial features of the next time-frequency units for calculation until the preset iteration times are obtained, ending the iteration, and ending the LSTM network training.
And step six, processing the mixed test binaural voice signals containing two sound sources with different azimuth angles under different acoustic environments according to the step three and the step four to obtain the spatial characteristics of each time-frequency unit of the test binaural voice signals.
And step seven, inputting the spatial characteristics of each time-frequency unit obtained in the step six and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into the well-trained LSTM network together to obtain the estimated masking ratio ERM of each time-frequency unit.
And step eight, separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step seven to obtain a time domain voice signal corresponding to a single sound source.
The method is subjected to simulation verification, and the performance evaluation is compared by adopting the separated voice quality PESQ (Perceptialevaluation of Speech quality). The other three separation algorithms participating in the comparison are: the method comprises a voice separation algorithm based on a degradation separation estimation technology DUET, a DNN network voice separation algorithm based on an ideal binary mask IBM and a DNN network voice separation algorithm based on an ideal ratio mask IRM, wherein the first two methods correspond to an IRM-LSTM method, and belong to traditional separation methods.
The PESQ values for the four methods are shown in table 1:
TABLE 1 comparison of PESQ values for the four methods
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.403 1.467 1.946 1.874
5 1.57 1.656 2.121 2.140
10 1.754 1.834 2.258 2.355
15 1.923 1.982 2.386 2.528
20 2.102 2.119 2.510 2.654
Noiseless 2.628 2.355 2.765 2.795
According to the results of table 1, the algorithm based on LSTM network has PESQ value much higher than the conventional algorithm and higher than the algorithm using DNN network except 0dB snr. The voice signals separated by the algorithm of the patent are obviously superior to the DNN method in subjective evaluation indexes, and the voice saturation and the voice subjective hearing can be improved.
Meanwhile, the algorithm of the patent is tested in the environment with the signal to noise ratio of-3, 3, 6, 9 and 12dB, and indexes are shown in table 2.
TABLE 2 evaluation of all SNR
SNR(dB) PESQ
-3 1.867
0 1.874
3 2.161
5 2.140
6 2.322
9 2.452
10 2.355
12 2.552
15 2.528
20 2.654
As can be seen from the table, for the case of the signal-to-noise ratio without training, the PESQ indexes are good in performance and similar to the adjacent signal-to-noise ratio, so that the algorithm based on the LSTM network has good generalization on the case of the noise signal-to-noise ratio and has good robustness.
Meanwhile, the PESQ index under the reverberation condition is tested, and 200ms reverberation and 600ms reverberation are respectively adopted, and the results are shown in tables 3 and 4:
TABLE 3200 ms reverberation environment four methods PESQ value comparison
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.335 1.413 1.717 1.710
5 1.468 1.593 1.971 2.004
10 1.597 1.758 2.139 2.151
15 1.678 1.865 2.262 2.359
20 1.734 1.932 2.345 2.380
TABLE 4600 ms reverberation environment four methods PESQ value comparison
SNR(dB) DUET IBM-DNN IRM-DNN IRM-LSTM
0 1.322 1.410 1.664 1.645
5 1.429 1.570 1.913 2.024
10 1.524 1.713 2.069 2.120
15 1.579 1.800 2.180 2.253
20 1.617 1.857 2.252 2.298
According to tables 3 and 4, the PESQ value of the binaural speech separation algorithm based on the LSTM network in the reverberation environment is higher than that of the DNN-based algorithm and is significantly higher than that of the other two traditional algorithms, so that the algorithm has better generalization performance for the reverberation environment and has better robustness.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (6)

1. A binaural voice separation method based on a long-time memory network LSTM is characterized by comprising the following steps:
(1) convolving two different training single sound channel voice signals with head related impulse response functions HRIR with different azimuth angles to generate two training single sound source double-ear voice signals with different azimuth angles;
(2) mixing the two training single sound source binaural voice signals with different azimuth angles to obtain a mixed training binaural voice signal containing the two sound sources, and simultaneously adding noises with different signal-to-noise ratios to obtain a noise-containing mixed training binaural voice signal containing the two sound sources with different azimuth angles in different acoustic environments;
(3) performing subband filtering, framing and windowing on the noise-containing mixed training binaural voice signal obtained in the step (2) to obtain a training binaural voice signal after each subband is framed, namely each time-frequency unit of the training binaural voice signal;
(4) calculating an interaural cross-correlation function CCF, an interaural time difference ITD and an interaural intensity difference ILD of each time-frequency unit of the training binaural speech signal obtained in the step (3) to serve as spatial features of each time-frequency unit of the training binaural speech signal;
(5) taking the spatial characteristics of each time-frequency unit obtained in the step (4) and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band as the input of a long-time memory network LSTM network, taking the ideal masking ratio IRM of the time-frequency unit as the target value of the LSTM network, and training the LSTM network;
(6) processing a mixed test binaural voice signal containing two sound sources with different azimuth angles under different acoustic environments according to the step (3) and the step (4) to obtain spatial characteristics of each time-frequency unit of the test binaural voice signal;
(7) inputting the spatial characteristics of each time-frequency unit obtained in the step (6) and the spatial characteristics of the time-frequency units corresponding to the front and rear 5 frames in the sub-band into a trained LSTM network to obtain an estimated masking ratio ERM of each time-frequency unit;
(8) and (4) separating the mixed test binaural voice signal according to the estimated masking ratio ERM obtained in the step (7) to obtain a time domain voice signal corresponding to a single sound source.
2. The long-and-short-term memory network (LSTM) -based binaural speech separation method according to claim 1, characterized by: in the step (1), the calculation formula of the two single-sound-source binaural speech signals with different azimuth angles is as follows:
Figure FDA0002219981350000011
wherein s is1(n)、s2(n) two different monaural source speech signals, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, h1,L、h1,RLeft ear HRIR, right ear HRIR, s representing azimuth 12,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2, h2,L、h2,RLeft ear HRIR and right ear HRIR indicating azimuth 2 are convolution operations, and n is a sampling number.
3. The long-and-short-term memory network (LSTM) -based binaural speech separation method according to claim 1, characterized by: the method for calculating the mixed training binaural speech signal including the two sound sources in the step (2) comprises the following steps:
sleft(n)=s1,L(n)+s2,L(n)
sright(n)=s1,R(n)+s2,R(n)
wherein s isleft(n)、sright(n) left and right ear signals of a mixed training binaural speech signal comprising two different azimuthal sound sources, s1,L(n)、s1,R(n) represents the single sound source left and right ear voice signals corresponding to azimuth 1, s2,L(n)、s2,R(n) represents the single sound source left and right ear voice signals corresponding to the azimuth angle 2;
the computing method of the noise-containing mixed training binaural speech signal is as follows:
xleft(n)=sleft(n)+vL(n)
xright(n)=sright(n)+vR(n)
wherein x isLeft(n)、xRight(n) respectively represents a noisy mixed training left side and a noisy mixed training left side which comprise two sound sources with different azimuth angles,Right ear speech signal, vL(n)、vR(n) represents the left and right ear noise signals at different signal-to-noise ratios, vL(n)、vR(n) are not relevant.
4. The long-and-short-term memory network (LSTM) -based binaural speech separation method according to claim 1, characterized by: the subband filtering calculation method in the step (3) comprises the following steps:
xL(i,n)=xleft(n)*gi(n)
xR(i,n)=xright(n)*gi(n)
wherein x isLeft(n)、xRight(n) respectively representing the noisy mixed training left and right ear speech signals containing two sound sources with different azimuth angles, xL(i,n)、xR(i, n) represents the time domain signal of the ith subband obtained after passing through the subband filter, gi(n) is the impulse response function of the ith subband filter;
the framing method comprises the following steps: using a predetermined frame length and frame shift to convert the signal x into a digital signalL(i,n)、xR(i, n) into a plurality of single-frame signals xL(i,k·N/2+m)、xR(i, k · N/2+ m), where k is a frame number, m represents a sampling number within one frame, m is 0, 1.. and N-1, N is a frame length, and the frame is shifted by half;
the windowing method comprises the following steps:
xL(i,k,m)=wH(m)xL(i,k·N/2+m)
xR(i,k,m)=wH(m)xR(i,k·N/2+m)
wherein, wH(m) represents a window function, xL(i,k,m)、xRAnd (i, k, m) respectively represent the left and right ear voice signals of the ith sub-band and the kth frame, and are used as time-frequency units for training the binaural voice signals.
5. The long-and-short-term memory network (LSTM) -based binaural speech separation method according to claim 1, characterized by: the formula for calculating the cross-correlation function CCF in the step (4) is as follows:
Figure FDA0002219981350000031
wherein x isL(i,k,m)、xR(i, k, m) represents the time-frequency units of the i-th sub-band and the k-th frame of the training binaural speech signal, CCF (i, k, d) represents the cross-correlation function of the time-frequency units of the i-th sub-band and the k-th frame corresponding to the binaural speech signal, d is the number of delay sampling points, L is the maximum number of delay sampling points, and N is the frame length;
the calculation method of the interaural time difference ITD comprises the following steps:
Figure FDA0002219981350000033
the interaural intensity difference ILD was calculated as:
Figure FDA0002219981350000032
the time-frequency unit space characteristic F (i, k) is composed of CCF, ITD and ILD:
F(i,k)=[CCF(i,k,-L)CCF(i,k,-L+1)···CCF(i,k,L)ITD(i,k)ILD(i,k)]
and F (i, k) represents a spatial feature vector corresponding to the time-frequency unit of the i-th subband and the k-th frame binaural speech signal.
6. The long-and-short-term memory network (LSTM) -based binaural speech separation method according to claim 1, characterized by: the LSTM network training process of step (5) specifically includes:
(5-1) constructing an LSTM network, wherein the LSTM network consists of an input layer, an LSTM network layer and an output layer, the input layer comprises input at each moment, the output layer comprises output at each moment, the LSTM network layer comprises LSTM units at each moment, the time step of the LSTM is set to be 11, namely the current frame and 5 frames before and after the current frame, and each LSTM unit in the LSTM network layer is respectively in bidirectional connection with the LSTM units at the moment before and after the current frame;
(5-2) randomly initializing the weight of the LSTM network layer;
(5-3) calculating an ideal masking ratio (IRM) of the current time-frequency unit as a target value of the LSTM network; wherein, the ideal masking ratio IRM is calculated according to the following formula:
taking the left sound channel as a separation sound channel, and taking a single sound source left ear voice signal s with the azimuth angle 1 corresponding to the current time frequency unit1,L(n) filtering, framing and windowing the sub-bands to obtain the left-ear speech signal s of the ith sub-band and the kth frameL(i, k, m) according to sL(i, k, m) calculating IRM:
Figure FDA0002219981350000041
(5-4) inputting the spatial characteristics of each time-frequency unit and the spatial characteristics of the time-frequency units of the front and rear 5 frames in the same sub-band into an input layer of the LSTM network, namely [ F (i, k-5), F (i, k-4), …, F (i, k), …, F (i, k +4), F (i, k +5) ];
(5-5) obtaining the output of the LSTM network, namely the estimated masking ratio ERM, according to a forward propagation algorithm, and calculating a loss function according to the difference value of the ERM and the ideal masking value IRM:
Figure FDA0002219981350000042
wherein E [. C]Representing the desired operation, | · |2Represents the L2 norm;
(5-6) calculating the partial derivative of the loss function J to the network weight by using a back propagation algorithm, and correcting the network weight;
and (5-7) if the current iteration times are less than the preset total iteration times, returning to the step (5-4), continuously inputting the spatial features of the next time-frequency units for calculation until the preset iteration times are obtained, ending the iteration, and ending the LSTM network training.
CN201910930176.XA 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM Active CN110728989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910930176.XA CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910930176.XA CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Publications (2)

Publication Number Publication Date
CN110728989A true CN110728989A (en) 2020-01-24
CN110728989B CN110728989B (en) 2020-07-14

Family

ID=69219570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910930176.XA Active CN110728989B (en) 2019-09-29 2019-09-29 Binaural speech separation method based on long-time and short-time memory network L STM

Country Status (1)

Country Link
CN (1) CN110728989B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN112216301A (en) * 2020-11-17 2021-01-12 东南大学 Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference
CN113079452A (en) * 2021-03-30 2021-07-06 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio direction information generating method, electronic device, and medium
CN113327624A (en) * 2021-05-25 2021-08-31 西北工业大学 Method for intelligently monitoring environmental noise by adopting end-to-end time domain sound source separation system
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network
CN114446316A (en) * 2022-01-27 2022-05-06 腾讯科技(深圳)有限公司 Audio separation method, and training method, device and equipment of audio separation model
CN115862676A (en) * 2023-02-22 2023-03-28 南方电网数字电网研究院有限公司 Voice superposition detection method and device based on deep learning and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111108A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Audio Signal using Phase Information
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111108A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Audio Signal using Phase Information
CN108091345A (en) * 2017-12-27 2018-05-29 东南大学 A kind of ears speech separating method based on support vector machines
CN108899047A (en) * 2018-08-20 2018-11-27 百度在线网络技术(北京)有限公司 The masking threshold estimation method, apparatus and storage medium of audio signal
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WENINGER F, ERDOGAN H AND WATANABE S.: "Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR", 《PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LATENT》 *
李璐君;: "基于深度学习的语音增强技术研究", 《解放军信息工程大学 硕士学位论文》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111707990A (en) * 2020-08-19 2020-09-25 东南大学 Binaural sound source positioning method based on dense convolutional network
CN111707990B (en) * 2020-08-19 2021-05-14 东南大学 Binaural sound source positioning method based on dense convolutional network
CN112216301A (en) * 2020-11-17 2021-01-12 东南大学 Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference
CN113079452A (en) * 2021-03-30 2021-07-06 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio direction information generating method, electronic device, and medium
CN113327624A (en) * 2021-05-25 2021-08-31 西北工业大学 Method for intelligently monitoring environmental noise by adopting end-to-end time domain sound source separation system
CN113327624B (en) * 2021-05-25 2023-06-23 西北工业大学 Method for intelligent monitoring of environmental noise by adopting end-to-end time domain sound source separation system
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network
CN113936681B (en) * 2021-10-13 2024-04-09 东南大学 Speech enhancement method based on mask mapping and mixed cavity convolution network
CN113823309B (en) * 2021-11-22 2022-02-08 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114446316A (en) * 2022-01-27 2022-05-06 腾讯科技(深圳)有限公司 Audio separation method, and training method, device and equipment of audio separation model
CN114446316B (en) * 2022-01-27 2024-03-12 腾讯科技(深圳)有限公司 Audio separation method, training method, device and equipment of audio separation model
CN115862676A (en) * 2023-02-22 2023-03-28 南方电网数字电网研究院有限公司 Voice superposition detection method and device based on deep learning and computer equipment

Also Published As

Publication number Publication date
CN110728989B (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN110728989B (en) Binaural speech separation method based on long-time and short-time memory network L STM
CN109164415B (en) Binaural sound source positioning method based on convolutional neural network
Vecchiotti et al. End-to-end binaural sound localisation from the raw waveform
CN109410976B (en) Speech enhancement method based on binaural sound source localization and deep learning in binaural hearing aid
CN107942290B (en) Binaural sound sources localization method based on BP neural network
Roman et al. Speech segregation based on sound localization
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN110517705A (en) A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks
CN105869651A (en) Two-channel beam forming speech enhancement method based on noise mixed coherence
CN108122559A (en) Binaural sound sources localization method based on deep learning in a kind of digital deaf-aid
Wang et al. Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR
Dadvar et al. Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN111948609B (en) Binaural sound source positioning method based on Soft-argmax regression device
CN111707990B (en) Binaural sound source positioning method based on dense convolutional network
Liu et al. Head‐related transfer function–reserved time‐frequency masking for robust binaural sound source localization
CN112201276B (en) TC-ResNet network-based microphone array voice separation method
CN112216301B (en) Deep clustering voice separation method based on logarithmic magnitude spectrum and interaural phase difference
CN112731291B (en) Binaural sound source localization method and system for collaborative two-channel time-frequency mask estimation task learning
Talagala et al. Binaural localization of speech sources in the median plane using cepstral HRTF extraction
Youssef et al. Binaural speaker recognition for humanoid robots
Feng et al. Preservation Of Interaural Level Difference Cue In A Deep Learning-Based Speech Separation System For Bilateral And Bimodal Cochlear Implants Users
CN112346013B (en) Binaural sound source positioning method based on deep learning
Jiang et al. A DNN parameter mask for the binaural reverberant speech segregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant