CN113053407A

CN113053407A - Single-channel voice separation method and system for multiple speakers

Info

Publication number: CN113053407A
Application number: CN202110173700.0A
Authority: CN
Inventors: 史慧宇; 欧阳鹏
Original assignee: Nanjing Yunzhi Technology Co ltd
Current assignee: Nanjing Yunzhi Technology Co ltd
Priority date: 2021-02-06
Filing date: 2021-02-06
Publication date: 2021-06-29

Abstract

The invention provides a single-channel voice separation method for multiple speakers, which comprises the following steps: constructing a voice separation deep neural network; acquiring a first code, a second code and a third code; decoding the first code and the third code by the deconvolution layer to obtain voice separation signals of a plurality of output channels; carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network; and inputting the voice sample to be tested into the trained voice separation deep neural network to obtain a plurality of voice separation signals in the voice sample to be tested. The invention can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method. Meanwhile, the invention also provides a single-channel voice separation system aiming at multiple speakers.

Description

Single-channel voice separation method and system for multiple speakers

Technical Field

The invention relates to the field of voice signal processing, in particular to a single-channel voice separation method and a single-channel voice separation system for multiple speakers.

Background

With the rapid development of high-end intelligent devices such as intelligent earphones, hearing aids, conference recorders and the like, voice interaction is being widely studied as the most convenient way for human-computer interaction. In the field of voice signal processing, the voice separation technology is used as a tie for connecting the front end and the rear end, so that the interference of factors such as noise can be filtered, and key information required by technologies such as voice recognition can be extracted, thereby playing a vital role. However, in the current algorithm, when the voice to be separated contains large noise or accompanied reverberation, the separation effect is greatly reduced. The most widely studied and applied speech separation algorithm is the single-channel speech separation technique. The single-channel voice separation technology mainly utilizes signals collected by a single microphone, modeling is carried out by means of time-frequency domain acoustics and statistical characteristic differences between target voice and interference signals, and compared with a multi-channel voice separation task, the single-channel voice separation technology is low in hardware requirement and cost, small in calculation amount and high in difficulty.

In recent years, neural networks and deep learning techniques have been rapidly developed, and a speech separation algorithm has adopted a deep learning calculation mode. The basic idea of the deep learning-based voice separation method is as follows: establishing a voice separation model, extracting characteristic parameters from mixed voice, searching a mapping relation between the characteristic parameters and the characteristic parameters of each target voice signal through network training, and outputting signals of each target voice through the trained model by any input mixed signal, thereby realizing voice separation. In the early years, researchers mostly research separation algorithms in the frequency domain, and recently, end-to-end separation algorithms in the time domain can avoid the problem of phase estimation errors in the frequency domain, so that the end-to-end separation algorithms in the time domain are widely researched. The current time domain speech separation algorithm mainly comprises: Cov-TasNet, BLSTM-TasNet, and FurcaNeXt, among others. These algorithms are often applied to data that is purely speech-mixed, but the effect is somewhat reduced when interference such as noise and reverberation is mixed in the data. One reason for this is that most of these algorithms only input the mixed signal directly into the network model for training, and thus the trained network is of limited help to improve the separation accuracy.

Disclosure of Invention

The invention aims to provide a single-channel voice separation method for multiple speakers, which can increase the basis of the network for separating voice signals by separating the phoneme additional input and the attention mechanism of the network, effectively improve the accuracy of voice output, reduce the distortion rate of voice and improve the intelligibility compared with the original method.

The invention provides a single-channel voice separation system for multiple speakers, which can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, effectively improve the accuracy of voice output, reduce the distortion rate of voice and improve the intelligibility compared with the original system.

In a first aspect of the present invention, a single-channel speech separation method for multiple speakers is provided, which includes:

and step S101, constructing a voice separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder, a phoneme information encoder, an attention mechanism module, and a comprehensive decoder.

Step S102, inputting the mixed audio sample signal to the input end of the mixed audio signal encoder, and encoding the mixed audio sample signal through a two-layer time-delay convolution network to obtain a first code.

And serially inputting the voice phoneme of each target speaker to the input end of a phoneme coder, coding the voice phoneme of each target speaker through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code.

The attention mechanism module obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.

And the comprehensive decoder is used for decoding the first code and the third code through the deconvolution layer to obtain voice separation signals of a plurality of output channels.

And step S103, taking the clean audio of each target speaker as a training target of the voice separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.

And step S104, inputting the voice sample to be tested into the trained voice separation deep neural network, and obtaining a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels through the processing of the voice separation deep neural network. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

In one embodiment of the single-channel speech separation method for multiple speakers according to the present invention, the hybrid audio signal encoder and the phoneme information encoder each include two hidden layers. The decoder comprises 2 hidden layers. The attention mechanism module includes a hidden layer.

In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S102 further includes:

the voice sample signals in the voice sample database are resampled under 8kHz, and random audio mixing is carried out on a plurality of target speakers, noise and reverberation data between signal to noise ratio-2.5 dB and 2.5dB to obtain a plurality of mixed audio sample signals. Each mixed audio sample signal is 4s in length.

And acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.

In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S103 further includes randomly initializing parameters of a speech separation deep neural network.

In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of training supervised by a speech separation deep neural network further includes: a forward propagation phase.

The forward propagation phase comprises: weights and biases between neuron nodes in a speech-separating deep neural network are initialized. Forward propagating speech separates deep neural networks. In the forward propagation process of the speech separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.

In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, in step S103, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of supervised training for a speech separation deep neural network includes:

and step S1031, calculating the gradient of the loss function of the output layer of the voice separation deep neural network. The loss function is equation 1:

wherein s is_targetIs the target of speech extraction. e.g. of the type_noiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.

Step S1032 obtains a gradient corresponding to each network layer when the number of network layers L is L-1, L-2, …,2 in the speech separation deep neural network.

And step S1033, updating the weight and the bias of the whole network according to the gradient of the loss function of the output layer and the gradient corresponding to each layer.

In a second aspect of the present invention, a single-channel speech separation system for multiple speakers is provided, which includes:

a building network unit configured to build a speech separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder, a phoneme information encoder, an attention mechanism module, and a comprehensive decoder.

And the network configuration unit is configured to input the mixed audio sample signal to the input end of the mixed audio signal encoder, encode the mixed audio sample signal through the two layers of time-delay convolutional networks and obtain a first code.

A training unit configured to use the clean audio of each target speaker as a training target for the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network. And

and the separation unit is configured to input the voice sample to be tested into the trained voice separation deep neural network, process the voice separation deep neural network and acquire a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

In an embodiment of the single-channel speech separation system for multiple speakers, the network configuration unit is further configured to:

In another embodiment of the single-channel speech separation system for multiple speakers of the present invention, the training unit is further configured to randomly initialize parameters of the speech separation deep neural network.

In another embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the training unit uses a loss function to update the weights and the bias by back propagation through a gradient descent method, and the step of training the speech separation deep neural network with supervision further comprises: a forward propagation phase.

The features, technical features, advantages and implementations of the method and system for single-channel speech separation for multiple speakers will be further described in an unambiguous and understandable manner with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow diagram illustrating a single-channel speech separation method for multiple speakers in one embodiment of the present invention.

Fig. 2 is a schematic diagram for illustrating the components of the voice separation network in one embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating the components of a single-channel speech separation system for multiple speakers in an embodiment of the present invention.

FIG. 4 is a schematic diagram illustrating the components of the deep neural network for speech separation according to an embodiment of the present invention.

Fig. 5a is a schematic diagram for illustrating a partial layer structure of a hybrid audio signal encoder according to an embodiment of the present invention.

Fig. 5b is a diagram illustrating a partial layer structure of a phoneme information encoder in an embodiment of the present invention.

FIG. 6 is a schematic diagram illustrating a partial layer structure of an attention mechanism module according to an embodiment of the present invention

Fig. 7 is a schematic diagram for explaining a partial layer structure of an integrated decoder according to still another embodiment of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings, in which the same reference numerals indicate the same or structurally similar but functionally identical elements.

"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings only schematically show the parts relevant to the present exemplary embodiment, and they do not represent the actual structure and the true scale of the product.

In a first aspect of the invention, a single-channel speech separation method for multiple speakers is provided, as shown in fig. 1, which includes:

and step S101, constructing a voice separation deep neural network.

In this step, a speech separation deep neural network is constructed. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. As shown in fig. 2, the voice separation network includes: a hybrid audio signal encoder 101, a phoneme information encoder 201, an attention mechanism module 301, and an integrated decoder 401.

Step S102, voice separation signals of a plurality of output channels are obtained.

In this step, the mixed audio sample signal is input to the input end of the mixed audio signal encoder 101, and the mixed audio sample signal is encoded by the two-layer delay convolutional network to obtain the first code.

The attention mechanism module 301 obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.

The integrated decoder 401 decodes the first code and the third code by the deconvolution layer to obtain the speech separation signals of the plurality of output channels.

And step S103, obtaining the trained voice separation deep neural network.

In this step, the clean audio of each target speaker is used as a training target of the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.

Step S104, a plurality of voice separation signals are obtained.

In the step, the voice sample to be tested is input into the trained voice separation deep neural network, and a plurality of voice separation signals in the voice sample to be tested are obtained from a plurality of output channels after the voice separation deep neural network processing. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

Compared with the existing single-channel voice separation algorithm, the single-channel voice separation method for multiple speakers can further extract the voice characteristics of each speaker and effectively remove signals except the corresponding target speaker, so that the accuracy of algorithm separation is improved, the distortion rate of voice is reduced, and the intelligibility of each voice after separation is effectively improved. The method can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method.

In one embodiment of the single-channel speech separation method for multiple speakers according to the present invention, the hybrid audio signal encoder 101 and the phoneme information encoder 201 each include two hidden layers. The decoder comprises 2 hidden layers. Note that force mechanism module 301 includes a hidden layer.

In a second aspect of the present invention, there is provided a single-channel speech separation system for multiple speakers, as shown in fig. 3, the single-channel speech separation system for multiple speakers comprises: a building network unit 10, a network configuration unit 20, a training unit 30 and a separation unit 40.

A building network unit 10 configured to build a speech separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder 101, a phoneme information encoder 201, an attention mechanism module 301, and an integrated decoder 401.

A network configuration unit 20 configured to input the mixed audio sample signal to an input of the mixed audio signal encoder 101, encode the mixed audio sample signal via a two-layer delayed convolutional network, and obtain a first encoding.

A training unit 30 configured to use clean audio of each target speaker as a training target for the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.

And the separation unit 40 is configured to input the voice sample to be tested into the trained voice separation deep neural network, and obtain a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels after the voice separation deep neural network processes the voice sample to be tested. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

In one embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the network configuration unit 20 is further configured to:

In another embodiment of the present invention directed to a multi-speaker single-channel speech separation system, the training unit 30 is further configured to randomly initialize parameters of the speech separation deep neural network.

In another embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the training unit 30 updates the weights and the bias by back-propagating through gradient descent using a loss function, and the step of training the speech separation deep neural network with supervision further comprises: a forward propagation phase.

In another preferred embodiment of the single-channel multi-speaker speech separation method of the present invention, the present invention is directed to a phoneme time-domain convolution speech separation method for a single-channel multi-speaker speech separation technique. Compared with the existing single-channel voice separation algorithm, the algorithm can further extract the voice characteristics of each speaker and effectively remove signals except the corresponding target speaker, so that the accuracy of algorithm separation is improved, the distortion rate of voice is reduced, and the intelligibility of each voice after separation is effectively improved.

The algorithm provided by the invention consists of a mixed audio signal encoder, a phoneme information encoder, a comprehensive decoder and an attention mechanism module.

The specific operation flow comprises the following steps:

a first part: preprocessing a training voice sample and inputting the training voice sample to a network input end;

a second part: training the deep neural network by using a loss function to obtain a deep neural network model;

and a third part: and inputting the voice sample to be tested into the trained network model for voice separation to obtain a test result.

The first part specifically comprises:

1.1, resampling the database sample signals at 8kHz, carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB to 2.5dB, and simultaneously storing clean audio and voice phonemes of the target speakers corresponding to each mixed audio. The length of each sample was 4 s.

1.2, inputting the mixed audio signal into an input end of a mixed audio signal coder, serially inputting phonemes corresponding to each target voice into an input end of a phoneme coder, and taking clean voice audio corresponding to a speaker as a training target of a neural network.

The second part includes:

2.1, randomly initializing the parameters of the deep neural network;

and 2.2, carrying out supervised training on the deep neural network according to the initialized parameters of 2.1, namely reversely propagating and updating the weight and the bias by using a loss function through a gradient descent method to obtain a deep neural network model.

The 2.1 includes a forward propagation stage and a backward propagation stage.

The forward propagation phase comprises: initializing weights and biases among the network neuron nodes; the deep neural network performs forward propagation.

The neural network can use the activation function to increase the nonlinear relation between networks in the forward propagation process, and finally can generate the nonlinear mapping between the input result and the output result.

The back propagation phase comprises:

<1> calculating a loss function of the deep neural network;

<2> updating parameters of the deep neural network by a gradient descent method.

The loss function of the entire network is equation 2:

wherein s is_targetIs the target of speech extraction; e.g. of the type_noiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.

The network will use the gradient descent method to update the parameters alternately:

a. and constructing a voice separation network. It is a multi-output network, and the number of output channels is related to the number of speakers in the mixed audio. The entire network consists of four modules, including: a hybrid audio signal encoder, a phoneme information encoder, a comprehensive decoder, and an attention mechanism module. Besides the input layer and the output layer, the mixed audio signal encoder and the phoneme information encoder respectively comprise two hidden layers, and the decoder comprises 2 hidden layers. The attention mechanism contains a hidden layer. .

b. Calculating the gradient of a loss function of a network output layer;

c. calculating the gradient corresponding to each layer when the network layer number is L-1, L-2, …, 2;

d. the weights and biases for the entire network are updated.

A mixed audio signal encoder section: the mixed audio y is input to a network input end, and then the signal is encoded through a two-layer delay convolution network to obtain G-0, … and gN-1. N is the output length corresponding to the second layer network of the encoder.

A phoneme information encoder section: the phonemes corresponding to the target speech of each mixed audio are connected in series to form p and input into the input end of the module, then the two layers of delay convolutional network coding is carried out, and the high-dimensional characteristics are extracted to obtain H-H-0, … and hM-1. And M is the corresponding output length of the second layer network of the encoder.

Attention mechanism module part: and simultaneously receiving outputs G and H of the mixed audio signal encoder and the phoneme information encoder, and calculating all hm and gn through an internal scoring mechanism to obtain a formula 3:

thus the attention weight α_n，mEquation 4 can be obtained by a softmax operator:

the integrated decoder section: multiplying the output of the hybrid audio signal encoder by the weighted output of the attention mechanism module_nInput to the decoder, equation 5:

then decoding through a deconvolution layer of the network to obtain a corresponding multi-channel voice separation signal.

The voice test operation in the third section is: the voice sample to be tested is input into the trained network model, and the estimation signal corresponding to each path of voice, namely the voice separation result of each speaker, can be obtained through calculation.

The method can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 4, the present invention provides a phoneme time domain convolution speech separation algorithm, here taking the speech separation of two speakers as an example. The method mainly comprises the following steps:

Each of the portions will be described in detail below.

Wherein the first part specifically comprises:

1-1, resampling the database sample signal at 8kHz, carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB to obtain y, and simultaneously storing clean audio xi (i is 1,2, …, N) and voice phoneme pi of the target speaker corresponding to each mixed audio. The length of each sample was 4 s.

1-2, taking the mixed audio signal y as an input end of a mixed audio signal encoder, serially connecting phonemes corresponding to each target voice into p, and inputting the p into an input end of a phoneme encoder, wherein clean voice audio xi corresponding to a speaker is taken as a training target of a neural network.

The second part specifically comprises:

(1) randomly initializing the parameters of the deep neural network;

(2) and (2) carrying out supervised training on the deep neural network according to the initialized parameters in the step (1), namely reversely propagating and updating the weight and the bias by a gradient descent method by using a loss function to obtain a deep neural network model.

The above (2) includes a forward propagation stage and a backward propagation stage.

The back propagation phase comprises:

<1> calculating a loss function of the deep neural network;

The loss function of the entire network is equation 6:

a. and constructing a voice separation network. It is a multi-output network, and the number of output channels is related to the number of speakers in the mixed audio. The entire network consists of four modules (fig. 4) including: a hybrid audio signal encoder 101 (fig. 5a), a phoneme information encoder 201 (fig. 5b), an integrated decoder 401 (fig. 7), and an attention mechanism module 301 (fig. 6). Except for an input layer and an output layer, the network comprises a mixed audio signal encoder and a phoneme information encoder which respectively comprise two hidden layers, and a comprehensive decoder comprises 2 hidden layers. The attention mechanism contains a hidden layer. The synthesis decoder 401 outputs a first separated speech separation signal and a second separated speech separation signal.

b. Calculating the gradient of a loss function of a network output layer;

d. the weights and biases for the entire network are updated.

Attention mechanism module part: and simultaneously receiving outputs G and H of the mixed audio signal encoder and the phoneme information encoder, and calculating all hm and gn through an internal scoring mechanism to obtain a formula 7:

thus the attention weight α_n，mEquation 8 can be obtained by a softmax operator:

the integrated decoder section: multiplying the output of the hybrid audio signal encoder by the weighted output of the attention mechanism module_nInput to the decoder, equation 9:

then decoding is carried out through a deconvolution layer of the network, and corresponding multiple channel voice separation signals can be obtained.

It should be understood that although the present description is described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein as a whole may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

1. A method for single-channel speech separation for multiple speakers, comprising:

step S101, constructing a voice separation deep neural network; the speech separation deep neural network includes: the device comprises an input layer, an output layer and a plurality of output channels; the number of the output channels corresponds to the number of speakers in the mixed audio; the voice separation network includes: a mixed audio signal encoder, a phoneme information encoder, an attention mechanism module and a comprehensive decoder;

step S102, inputting a mixed audio sample signal to an input end of a mixed audio signal encoder, and encoding the mixed audio sample signal through a two-layer time-delay convolution network to obtain a first code;

serially inputting each target speaker voice phoneme to an input end of a phoneme coder, coding each target speaker voice phoneme through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code;

the attention mechanism module acquires scores of the first code and the second code through an internal scoring mechanism, and acquires an attention weight value through the scores of the first code and the second code; obtaining the weighted first code according to the attention weight value; acquiring a third code through the weighted first code and the weighted second code;

the comprehensive decoder decodes the first code and the third code through a deconvolution layer to obtain voice separation signals of the plurality of output channels;

step S103, using the clean audio of each target speaker as a training target of the voice separation deep neural network; training the voice separation deep neural network; propagating the updated weight and the bias in a reverse direction by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network;

step S104, inputting a voice sample to be tested into the trained voice separation deep neural network, and obtaining a plurality of voice separation signals in the voice sample to be tested from the plurality of output channels through the processing of the voice separation deep neural network; and taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

2. The method of claim 1, wherein the hybrid audio signal encoder and the phoneme information encoder each comprise two hidden layers; the decoder comprises 2 hidden layers; the attention mechanism module includes a hidden layer.

3. The method for separating single-channel speech for multiple speakers according to claim 1, wherein the step S102 further comprises:

resampling voice sample signals in a voice sample database at 8kHz, and carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB and signal-to-noise ratio-2.5 dB to obtain a plurality of mixed audio sample signals; each mixed audio sample signal is 4s in length;

4. The method for separating single-channel speech for multiple speakers according to claim 1, wherein the step S103 further comprises initializing parameters of the deep speech separation neural network randomly.

5. The single-channel speech separation method for multiple speakers according to claim 1 or 4, wherein the updating weights and biases are back-propagated by gradient descent using a loss function, and the step of supervised training of the speech separation deep neural network further comprises: a forward propagation stage;

the forward propagation phase comprises: initializing weights and biases between neuron nodes in a speech separation deep neural network; forward propagating the voice separation deep neural network; in the forward propagation process of the voice separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.

6. The single-channel speech separation method for multiple speakers according to claim 1 or 4, wherein in step S103, the updating weights and biases are back-propagated by gradient descent method using a loss function, and the step of supervised training of the speech separation deep neural network comprises:

step S1031, calculating the gradient of the loss function of the output layer of the voice separation deep neural network; the loss function is equation 1:

wherein s is_targetIs the target of speech extraction; e.g. of the type_noiseIs an estimated noise, derived from the difference between the estimated speech and the mixed speech;

step S1032, obtaining a gradient corresponding to each layer when the number of network layers L is L-1, L-2, …,2 in the speech separation deep neural network;

7. A single channel speech separation system for multiple speakers, comprising:

a constructing network unit configured to construct a voice separation deep neural network; the speech separation deep neural network includes: the device comprises an input layer, an output layer and a plurality of output channels; the number of the output channels corresponds to the number of speakers in the mixed audio; the voice separation network includes: a mixed audio signal encoder, a phoneme information encoder, an attention mechanism module and a comprehensive decoder; a network configuration unit configured to input a mixed audio sample signal to an input of a mixed audio signal encoder, encode the mixed audio sample signal via a two-layer delayed convolutional network, and obtain a first code;

a training unit configured to use clean audio of each target speaker as a training target of the speech separation deep neural network; training the voice separation deep neural network; propagating the updated weight and the bias in a reverse direction by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network; and

the separation unit is configured to input a voice sample to be tested into the trained voice separation deep neural network, and obtain a plurality of voice separation signals in the voice sample to be tested from the plurality of output channels after the voice sample to be tested is processed by the voice separation deep neural network; and taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.

8. The single channel voice separation system for multiple speakers as claimed in claim 7, wherein the network configuration unit is further configured to:

9. The single channel speech separation system for multiple speakers according to claim 7, wherein the training unit is further configured to randomly initialize parameters of the speech separation deep neural network.

10. The single channel speech separation system for multiple speakers according to claim 7 or 9, wherein the updating weights and biases are back-propagated by gradient descent using a loss function in the training unit, and the step of supervised training of the speech separation deep neural network further comprises: a forward propagation stage;