CN112259120B

CN112259120B - Single-channel human voice and background voice separation method based on convolution cyclic neural network

Info

Publication number: CN112259120B
Application number: CN202011119804.5A
Authority: CN
Inventors: 孙超
Original assignee: Nanjing Guiji Intelligent Technology Co ltd
Current assignee: Nanjing Guiji Intelligent Technology Co ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-06-29
Anticipated expiration: 2040-10-19
Also published as: CN112259120A

Abstract

The invention discloses a method for separating single-channel human voice from background voice based on a convolution cyclic neural network, which comprises the following steps: s1, acquiring an original mixed voice signal; s2, obtaining an original mixed signal amplitude spectrum and an original mixed signal phase spectrum; s3, inputting the original mixed signal amplitude spectrum into a convolution neural network; s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask; and S5, respectively combining the predicted value of the human voice after the human voice passes through the time-frequency mask and the predicted value of the background voice after the background voice passes through the time-frequency mask with the phase spectrum of the original mixed signal to obtain a predicted human voice signal and a predicted background voice signal. Compared with the prior art, the separation method provided by the invention can capture the time domain and frequency domain information of the voice, and generate the multi-scale features to separate the human voice signal and the background voice signal of the mixed voice.

Description

Single-channel human voice and background voice separation method based on convolution cyclic neural network

Technical Field

The invention relates to human voice and background voice separation, in particular to a single-channel human voice and background voice separation method based on a convolution cyclic neural network.

Background

The purpose of voice separation is to separate target voice from background interference, and since the voice collected by the microphone may include noise, voice of other people speaking, background music and other interference items, if the voice separation is not performed and the recognition is directly performed, the accuracy of the recognition may be affected. Therefore, the source separated and recognized has important value in the signal processing field of voice and automatic voice recognition, and the separation of the voice and background music in a single channel is a basic and important branch in voice separation.

In recent years, with the improvement of software and hardware performance and the popularization of machine learning algorithms, deep learning gradually exhibits extremely high effects in the fields of natural language processing, images and the like. The speech separation based on deep learning is to learn the characteristics of speech, speakers and noise from training data and construct an integral neural network so as to realize the aim of speech separation. The voice information can be simultaneously embodied in a time domain and a frequency domain, and the time domain information and the frequency domain information of the voice are valuable characteristic information, but for voice separation, most deep learning methods are separated by using a single convolutional neural network or a cyclic neural network, a unified generalized universal framework is not available for voice separation, the time domain information and the frequency domain information in the mixed voice cannot be accurately extracted, and the separation effect of the human voice and the background voice of the mixed voice is poor.

Disclosure of Invention

The invention aims to overcome the defects that the prior art can not accurately extract time domain and frequency domain information in voice and the separation effect of human voice and background sound in mixed voice is poor, and provides a single-channel human voice and background sound separation method based on a convolution cyclic neural network.

The purpose of the invention is mainly realized by the following technical scheme:

a single-channel human voice and background voice separation method based on a convolution cyclic neural network comprises the following steps:

s1, acquiring an original mixed voice signal, wherein the original mixed voice signal is a single-channel mixed signal of human voice and background voice;

s2, performing framing windowing and time-frequency conversion on the obtained original mixed voice signal to obtain an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;

s3, inputting the original mixed signal amplitude spectrum into a convolutional neural network, wherein the convolutional neural network comprises a convolutional layer and a pooling layer which are sequentially arranged; the convolution layer obtains local characteristics of an original mixed signal amplitude spectrum, and the pooling layer reduces the dimension of the characteristics, converts the characteristics into a low-resolution characteristic diagram and outputs the characteristic diagram; the convolutional layer comprises two layers, and convolutional kernels in the two layers of convolutional layers are different in size;

s4, inputting the low-resolution feature map and the original mixed signal amplitude spectrum into a recurrent neural network, and combining a time-frequency mask to obtain a predicted value of human voice after the human voice passes through the time-frequency mask and a predicted value of background voice after the background voice passes through the time-frequency mask;

s5, respectively combining the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and respectively obtaining a predicted human voice signal and a predicted background voice signal through inverse Fourier transform;

and the convolutional neural network and the cyclic neural network are both provided with original mixed signal amplitude spectrum channels.

In the prior art, when deep learning is adopted for voice separation, time domain and frequency domain information of voice in mixed voice cannot be accurately extracted, and the mixed voice separation effect is poor. In order to solve the technical problem, the technical scheme provides a voice separation method, a convolutional neural network is used as a front end, a single-channel mixed signal of human voice and background voice is input, the purpose is to perform dimensionality reduction on a spectrogram and extract local features of the spectrogram, then the extracted features and multi-scale features combined by an original spectrogram are sent into a rear-end cyclic neural network together, and finally, a time-frequency mask is utilized to obtain predicted spectrograms of the human voice and the background music; the technical scheme is that two convolution kernels with different sizes are designed in a convolution neural network according to the characteristics of audio signals, the purpose is to capture context information of a time domain and a frequency domain from an input spectrogram, human voice signals and background voice signals of mixed voice can be accurately separated, when convolution operation is carried out on an original mixed signal magnitude spectrum, the first convolution is carried out on the time domain and the frequency domain of the spectrogram respectively, the two obtained features are fused immediately after the first convolution operation is completed, and subsequent convolution operation is facilitated. As the number of convolution layers increases, the network becomes deeper, so that the fused time domain features and frequency domain features are further expressed in a deep level. Specifically, on the basis of obtaining the characteristics of the audio signal, a direct connection channel and an original mixed signal magnitude spectrum channel are added in the technical scheme, the extracted characteristics and the multi-scale characteristics combined by the original spectrogram are sent to a back-end circulating neural network together by utilizing 'multi-scale information' combined by the extracted characteristics and the original spectrogram, the so-called scale refers to the resolution ratio of an image, the resolution ratio of the original spectrogram is 513 × 64, the image after the convolution pooling operation is changed into 256 × 64, namely, the image is unchanged in the dimension of a time domain and halved in the frequency domain, and the operation of combining the two is that the time domain is aligned and the frequency domain is added; the dimension after combination is (513 + 256) multiplied by 64, so that the input of lower resolution features is increased while the integrity of the original overall feature information is maintained, and the complementarity of the two is reflected; in the design of the convolutional neural network, in order to compress the number of data and parameters, a pooling layer is used for feature dimension reduction after a convolutional layer, so that overfitting of the network is reduced, and the generalization of a model is improved; the neural network model of the technical scheme does not change the phase of an original voice frequency spectrogram, but respectively combines a predicted value of human voice after passing through a time-frequency mask and a predicted value of background voice after passing through the time-frequency mask with the phase spectrum of an original mixed signal, respectively obtains a predicted human voice signal and a predicted background voice signal through inverse Fourier transform, and adds the constraint that the sum of prediction results is equal to the original mixed signal by utilizing the time-frequency mask. According to the technical scheme, two filters in different shapes are designed in the convolutional neural network, time domain and frequency domain information of voice is captured, meanwhile, a pooling layer is utilized to perform feature dimensionality reduction and extract local features of the filters, and the features and an original mixed signal amplitude spectrum are combined to form multi-scale features which are input into the convolutional neural network, so that human voice signals and background voice signals of mixed voice can be accurately separated.

It should be noted that in the technical scheme, the time-frequency conversion adopts short-time Fourier transform, the structure adopted by the recurrent neural network in the technical scheme is GRU, the GRU model is simpler than the standard LSTM model, the parameters are reduced, and overfitting is not easy to generate; the convolution cyclic neural network is a neural network jointly adopting a convolution neural network and a cyclic neural network.

Furthermore, convolution kernels in the two convolution layers are rectangular strip-shaped filters. Because the general square convolution kernel can not well utilize the time-frequency domain characteristic information of the audio frequency, the convolution kernels in the technical scheme are two rectangular strip-shaped filters respectively.

Further, the convolution kernel size of the first layer of convolutional layers is 2 × 10, and the convolution kernel size of the second layer of convolutional layers is 10 × 2. When processing the sequence type data such as voice, a common convolution kernel such as 3 × 3 cannot sufficiently and effectively utilize the features of voice, because the 3 × 3 is for a common image, because the horizontal and vertical axes of the common image have no specific meaning, and the horizontal coordinate of the amplitude spectrum of the original mixed signal represents time and the vertical coordinate represents frequency, the technical solution can extract the frequency domain features of the voice signal by using a 2 × 10 convolution kernel, and can extract the time domain features by using a 10 × 2 convolution kernel.

Further, a batch processing normalization layer is arranged behind the two convolution layers, the batch processing normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:

wherein

Is an independent variable, and

is a fixed parameter in the interval (1, + ∞). The training time of the model is accelerated by setting a batch processing normalization layer, so that the model can be converged more quickly.

Further, the pooled layer convolution kernel size is 2 × 1. In the technical scheme, the size of a convolution kernel arranged in the pooling layer is 2 multiplied by 1, so that the time dimension is unchanged and the frequency dimension is halved after the characteristics pass through the pooling layer; by adopting the method, the data volume can be reduced, the space size of the data is continuously reduced, so the quantity and the calculation amount of the parameters are reduced, and the overfitting is controlled to a certain extent.

Further, the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, where 513=1024/2+1 and the sampling rate is 16000 Hz.

The original mixed speech signal read from the audio file is a one-dimensional array, the length of which is determined by the audio length and the sampling rate. After the original mixed voice signal is subjected to framing and windowing, a plurality of frames can be obtained, each frame is subjected to fast Fourier transform, namely, a time domain signal is converted into a frequency domain signal, and the frequency domain signals after each frame of Fourier transform are stacked in time to obtain a spectrogram. The Fourier transform has N frequency points, and due to the symmetry of the Fourier transform, N/2+1 points are taken when N is an even number, and (N +1)/2 points are taken when N is an odd number. In the technical scheme, 10 frames of input are adopted to model and train the convolution cyclic neural network, the window length n _ fft of Fourier transform is set to be 1024 points, and 50% of overlap is used for extracting sound spectrum representation. The input to the neural network herein is therefore a spectrogram of size 10 × 513. Where 513=1024/2+1, the sampling rate is 16000 Hz.

Furthermore, an attention layer is arranged between the convolutional layer and the pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the weight of the characteristic channel with low use for the current task is reduced.

In the prior art, a neural network is adopted for separation work, generally, the network performance is improved from the spatial dimension, namely, the multi-scale characteristic information or different resolution characteristic diagrams are combined, and the relation among characteristic channels is rarely concerned; the inventor finds that the output channels of the network can be increased by setting the number of convolution kernels after the mixed voice separation research, however, the channels are not all equally important, and too many redundant feature channels can influence the expression capability of the network.

Further, the attention layer is globally pooled by a maximum pooling method.

In the technical scheme, the attention layer adopts a maximum pooling method to carry out global pooling, so that weights corresponding to different characteristic channels have distinctiveness. The convolutional recurrent neural network structure provided by the technical scheme can be divided into a convolutional layer, an attention layer, a pooling layer and a cyclic layer, wherein the attention layer adopts global maximum pooling, the pooling layer only adopts maximum pooling, and the cyclic layer is the recurrent neural network.

Further, the convolutional neural network and the cyclic neural network use a mean square error loss function, as follows:

，

is a predicted value of the human voice after passing through a time-frequency mask,

is a predicted value of the background sound after passing through a time-frequency mask,

and

respectively representing the real values of human voice and background voice; or a mean square error and source-to-interference ratio combined loss function is adopted, as follows:

wherein

In order to be a hyper-parameter,

and

representing the real values of human voice and background voice, respectively. Where time t refers to the tth frame.

Preferably, the technical scheme adopts a mean square error and source-to-interference ratio combined loss function, and the mean square error and source-to-interference ratio combined loss function not only enables the predicted human voice signal to be closer to the real human voice signal, but also enables the predicted human voice signal to contain less background signals; preferably, it is 0.05.

Further, in S4, the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask is as follows:

，

wherein

Is defined as the multiplication of elements and the multiplication of elements,

in order to be the amplitude spectrum of the original mixed signal,

the output of the human voice at time t representing the convolution cyclic neural network prediction,

is the output of the background sound of the convolutional recurrent neural network at time t,

time-frequency mask for predicting value of human voice after passing through time-frequency mask and method for predicting value of human voice

And the time frequency mask is a predicted value of the background sound after the background sound passes through the time frequency mask. In the technical scheme, the time-frequency masking technology is utilized to further smooth the source separation result, so that the sum of the prediction results is equal to the constraint of the original mixture.

In conclusion, compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a voice separation method, which comprises the steps of taking a convolutional neural network as a front end, inputting a single-channel mixed signal of human voice and background voice, reducing the voice spectrogram and extracting local features of the voice spectrogram, then sending the extracted features and multi-scale features combined by an original voice spectrogram into a rear-end cyclic neural network together, and finally obtaining predicted voice and background music voice spectrograms by utilizing a time-frequency mask; the invention designs two convolution kernels with different shapes in a convolution neural network according to the characteristics of audio signals, aims to capture context information of a time domain and a frequency domain from an input spectrogram and can accurately separate a human sound signal and a background sound signal of mixed voice.

2. According to the method, the attention layer is added, the importance degree of each feature channel is automatically acquired in a learning mode, then, useful features are promoted according to the importance degree, the features which are not useful for the current task are restrained, the feature channels after convolution have different weights, and the weights corresponding to the redundant feature channels are correspondingly reduced, so that the expression capacity of the network is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a single-channel human voice and background voice separation method based on a convolution cyclic neural network;

fig. 2 is a schematic diagram of an attention layer of a single-channel human voice and background voice separation method based on a convolution cyclic neural network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1:

Preferably, the convolution kernels in the two convolution layers are both rectangular strip-shaped filters.

Preferably, the convolution kernel size of the first layer of convolutional layers is 2 × 10, and the convolution kernel size of the second layer of convolutional layers is 10 × 2.

Preferably, a batch normalization layer is arranged after the two convolutional layers, and the batch normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:

wherein

Is an independent variable, and

is a fixed parameter in the interval (1, + ∞).

Preferably, the pooled layer convolution kernel size is 2 × 1.

Preferably, the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, where 513=1024/2+1 and the sampling rate is 16000 Hz.

Preferably, the convolutional neural network and the cyclic neural network use a mean square error loss function, as follows:

，

and

wherein

In order to be a hyper-parameter,

and

respectively representing the real values of human voice and background voice; preferably, a mean square error and source-to-interference ratio combined loss function is adopted, and the mean square error and source-to-interference ratio combined loss function not only enables the predicted human voice signal to be closer to the real human voice signal, but also enables the predicted human voice signal to contain less background signals; preferably, γ is 0.05.

Preferably, in S4, the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask is as follows:

，

wherein

in order to be the amplitude spectrum of the original mixed signal,

And the time frequency mask is a predicted value of the background sound after the background sound passes through the time frequency mask.

The voice separation method provided by this embodiment inputs a single-channel mixed signal of human voice and background voice by using a convolutional neural network as a front end, and aims to perform subtraction and extraction of local features of a spectrogram, then sends the extracted features and multi-scale features combined with an original spectrogram into a back-end cyclic neural network, and finally obtains predicted spectrograms of human voice and background music by using a time-frequency mask; the number of the convolution kernels can be set to increase the output channels of the network, and in the embodiment, two convolution kernels with different sizes are designed in the convolution neural network according to the characteristics of the audio signal, so that the method aims to capture the context information of a time domain and a frequency domain from an input spectrogram and accurately separate a human sound signal and a background sound signal of mixed voice. Specifically, in the embodiment, two rectangular and long filters are selected as convolution kernels, the number of the convolution kernels is set to increase output channels of the network, and the expression capability of the network is improved by combining the selection of the size; on the basis of obtaining the characteristics of the audio signal, a direct connection channel and an original mixed signal amplitude spectrum channel are added, and the multi-scale information combining the extracted characteristics and the original spectrogram is utilized, so that the integrity of original overall characteristic information is maintained, meanwhile, the input of lower-resolution characteristics is increased, and the complementarity of the extracted characteristics and the original spectrogram is reflected; in the design of the convolutional neural network, in order to compress the number of data and parameters, a pooling layer is used for feature dimension reduction after a convolutional layer, overfitting of the network is reduced, the generalization of a model is improved, and the training time of the model is shortened by setting a batch normalization layer, so that the model is converged more quickly; the neural network model of this embodiment does not change the phase of the original speech spectrogram, but combines the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask with the phase spectrum of the original mixed signal, and obtains the predicted human voice signal and the predicted background voice signal through inverse fourier transform, respectively, and the time-frequency mask is used to add the constraint that the sum of the prediction results is equal to the original mixed signal. In the embodiment, two filters with different shapes are designed in the convolutional neural network to capture time domain and frequency domain information of voice, meanwhile, a pooling layer is utilized to perform feature dimensionality reduction and extract local features of the filters, and the features and the original mixed signal amplitude spectrum are combined to form multi-scale features which are input into the convolutional neural network, so that human voice signals and background voice signals of mixed voice can be accurately separated.

Example 2:

as shown in fig. 1 and 2, the present embodiment further includes, on the basis of embodiment 1: an attention layer is arranged between the convolution layer and the pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the characteristic channel with low use for the current task is restrained. Fig. 4 shows the attention layer. The attention layer is disposed between the convolutional layer and the pooling layer.

Preferably, the attention layer is globally pooled using a max-pooling method.

Fig. 2 provides a schematic diagram of the attention layer in this embodiment, and given an input x, the number of characteristic channels is c _1, and a characteristic with the number of characteristic channels c _2 is obtained through a series of general transformations such as convolution. Unlike the conventional CNN, the present embodiment then re-calibrates the previously derived features through three operations. Firstly, the extrusion operation is carried out, the feature compression is carried out along the space dimension, each two-dimensional feature channel is changed into a real number, the real number has a global receptive field to some extent, and the output dimension is matched with the number of the input feature channels. It characterizes the global distribution of responses over the feature channels and makes it possible to obtain a global receptive field also for layers close to the input, which is very useful in many tasks. The second is the excitation operation, which is a mechanism similar to the gate in the recurrent neural network, which generates weights for each eigen-channel by a parameter w that is learned to explicitly model the correlation between eigen-channels. And finally, correcting the weight, wherein the weight of the output of the excitation operation is regarded as the importance of each feature channel after feature selection, and then the original features are recalibrated in the channel dimension by weighting the previous features channel by channel through multiplication.

As can be seen from fig. 2, this embodiment considers that the importance occupied by different channels may be different, and the previous network does not consider this, but treats the importance of all channels as the same. The importance of different channels is scaled by a learned set of weights, which is equivalent to a new calibration of the original features after adding weights.

The effect of using two layers of convolution kernels to combine in embodiment 1 is improved compared with the effect of a single convolution kernel, and after a study on mixed speech separation, the inventor finds that the number of output channels of a network can be increased by setting the number of convolution kernels, however, the channels are not all equally important, and too many redundant feature channels can affect the expression capability of the network. In this embodiment, after the attention module is further added on the basis of embodiment 1, the performance is further improved. In addition, the inventor adds an attention module after the last convolutional layer of the convolutional recurrent neural network by using maximum pooling, and the module uses a maximum pooling function to perform global pooling, so that weights corresponding to different channels have distinctiveness, and the importance degree between different channels may be weakened by using average pooling.

1. Verification and comparative test: to verify the separation effect of the method of example 2, the inventors used an MIR-1K data set, where the audio of the MIR-1K data set includes 2 folders, one is undvidedwavfile including 1000 audio data of 4-13 seconds, and the other is Wavfile including 110 audio data. These segments were extracted from 110 karaoke songs in china sung by both men and women. The inventors used specific males and specific females as training sets, containing 175 segments in total. The remaining 825 fragments served as test sets. The sampling rate is 16000Hz, and the sampling points are 16 bits.

(1) And performance evaluation indexes: using bss _ eval _ sources in the mir _ eval packet as evaluation indexes, the separation effect is evaluated using the following four indexes,

source-distortion ratio (SDR):

source-to-interference ratio (SIR):

source-to-noise ratio (SNR):

source-algorithm induced artifact ratio (SAR):

wherein the content of the first and second substances,

is a prediction signal that is a function of the signal,

is a signal that is an interference signal or a signal,

is a noise signal that is a function of,

are artifacts introduced by the algorithm; SDR evaluates the separation effect of the separation algorithm from a relatively comprehensive angle, SIR analyzes the separation effect from the angle of interference, SNR analyzes the separation effect from the angle of noise, and SAR analyzes the separation effect from the angle of artifact; the larger the values of SDR, SIR, SNR and SAR are, the better the separation effect of human voice and background music is. Global NSDR (gnsdr), global SIR (gsir) and global SAR (gsar) are weighted averages of NSDR, SIR and SAR, respectively, weighted by source length. Where the normalized SDR is defined as:

wherein

Defined as the estimated human voice/background music of the model, T_oFor pure human voice/background music, T, in the original signal_mIs the original mixed signal.

Table: algorithm comparison under loss function of mean square error and source-to-interference ratio combination

In the above table, methods 1 to 8 adopt a conventional mixed speech separation method, method 9 is to replace two layers of convolution kernels with one layer of convolution kernel on the basis of embodiment 1, method 10 is the method of embodiment 1, and method 11 is the method of embodiment 2; the method 12 is that the maximum pooling adopted by the attention layer is replaced by the average pooling on the basis of the embodiment 2, and two layers of convolution kernels are replaced by one layer of convolution kernel; the method 13 replaces two layers of convolution kernels with one layer of convolution kernel on the basis of the embodiment 1, and replaces the maximum pooling adopted by the attention layer with the average pooling; mode 14 replaces the maximum pooling employed by the attention tier with an average pooling on the basis of example 2.

As can be seen from the table, the mixed speech separation effects obtained by the method of embodiment 1 (method 10) and the method of embodiment 2 (method 11) are all equal to those obtained by the methods 1 to 8 in the prior art; from methods 9 and 10, the effect when two layers of convolution kernels are combined is improved compared with the effect of a single convolution kernel, and the performance is further improved after the attention module is added to method 11. In addition, the inventor uses two pooling methods, average pooling and maximum pooling, in the above table, and compared with the mixed voice separation effect of the method 11 and the method 14, the mixed voice separation effect using average pooling is not as good as the effect using average pooling is combined, because an attention module is added after the last convolutional layer of the convolutional recurrent neural network, and the module uses the maximum pooling function to perform global pooling, so that weights corresponding to different channels are differentiated, while the use of average pooling may weaken the importance degree between different channels, and it can be seen that the mixed voice separation effect using the attention layer of the embodiment 2 and the maximum pooling in the attention layer is better.

The inventor also finds in the research process that the reduction ratio r is an important hyper-parameter in the attention layer, when the value of r is small, the performance improvement is not facilitated, and when r =64 is set, the performance improvement on three indexes of GNSAR, GSIR and GSAR can be realized.

In addition, after the inventor researches the convolutional neural network and the cyclic neural network by adopting gamma with different values in a mean square error and source-interference ratio combined loss function, the inventor finds that when the gamma is 0.05, the balance exists on GNSAR, GSIR and GSAR; the three evaluation indexes of GNSAR, GSIR and GSAR are larger, the higher the signal-to-noise ratio is, and the better the separation effect is. However, the three indexes are not increased or decreased simultaneously with the change of the hyper-parameter, and 0.05 is selected in order to balance the three indexes, namely, the indexes are relatively large, rather than only considering one of the indexes.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A single-channel human voice and background voice separation method based on a convolution cyclic neural network is characterized by comprising the following steps:

the convolutional neural network and the cyclic neural network are both provided with an original mixed signal amplitude spectrum channel, an attention layer is arranged between a convolutional layer and a pooling layer of the convolutional neural network, the attention layer automatically acquires the importance degree of each characteristic channel in a learning mode, the weight of the useful characteristic channel is improved according to the importance degree, and the weight of the characteristic channel which is not used for the current task is reduced.

2. The method of claim 1, wherein the convolution kernels in the two convolutional layers are both rectangular strip filters.

3. The method of claim 2, wherein the convolution kernel size of the first convolutional layer is 2 x 10, and the convolution kernel size of the second convolutional layer is 10 x 2.

4. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 2, wherein a batch normalization layer is arranged after the two convolutional layers, the batch normalization layer uses a Leaky-relu activation function, and the formula of the Leaky-relu activation function is as follows:

wherein

Is an independent variable, and

is a fixed parameter in the interval (1, + ∞).

5. The convolution-cyclic neural network-based method for separating single-channel human voice from background sound according to claim 1, wherein the pooling layer convolution kernel size is 2 x 1.

6. The method for separating a single-channel human voice from a background voice based on a convolutional circular neural network as claimed in claim 1, wherein the spectrogram size of the input of the convolutional neural network in S3 is 10 × 513, wherein 513=1024/2+1 and the sampling rate is 16000 Hz.

7. The convolution-cyclic neural network-based single-channel human voice and background voice separation method of claim 1, wherein the attention layer is globally pooled by a max-pooling method.

8. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 1, wherein the convolutional neural network and the recurrent neural network use a mean square error loss function as follows:

，

and

respectively representing the real values of human voice and background voice;

or a mean square error and source-to-interference ratio combined loss function is adopted, as follows:

wherein

In order to be a hyper-parameter,

is a human beingThe predicted value of the sound after passing through the time-frequency mask,

and

representing the real values of human voice and background voice, respectively.

9. The method for separating the single-channel human voice from the background voice based on the convolutional recurrent neural network as claimed in claim 1, wherein the method for calculating the predicted value of the human voice after passing through the time-frequency mask and the predicted value of the background voice after passing through the time-frequency mask in S4 is as follows:

，

，

wherein

in order to be the amplitude spectrum of the original mixed signal,