CN113053407A - Single-channel voice separation method and system for multiple speakers - Google Patents

Single-channel voice separation method and system for multiple speakers Download PDF

Info

Publication number
CN113053407A
CN113053407A CN202110173700.0A CN202110173700A CN113053407A CN 113053407 A CN113053407 A CN 113053407A CN 202110173700 A CN202110173700 A CN 202110173700A CN 113053407 A CN113053407 A CN 113053407A
Authority
CN
China
Prior art keywords
voice
neural network
deep neural
separation
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110173700.0A
Other languages
Chinese (zh)
Inventor
史慧宇
欧阳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yunzhi Technology Co ltd
Original Assignee
Nanjing Yunzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yunzhi Technology Co ltd filed Critical Nanjing Yunzhi Technology Co ltd
Priority to CN202110173700.0A priority Critical patent/CN113053407A/en
Publication of CN113053407A publication Critical patent/CN113053407A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)

Abstract

The invention provides a single-channel voice separation method for multiple speakers, which comprises the following steps: constructing a voice separation deep neural network; acquiring a first code, a second code and a third code; decoding the first code and the third code by the deconvolution layer to obtain voice separation signals of a plurality of output channels; carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network; and inputting the voice sample to be tested into the trained voice separation deep neural network to obtain a plurality of voice separation signals in the voice sample to be tested. The invention can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method. Meanwhile, the invention also provides a single-channel voice separation system aiming at multiple speakers.

Description

Single-channel voice separation method and system for multiple speakers
Technical Field
The invention relates to the field of voice signal processing, in particular to a single-channel voice separation method and a single-channel voice separation system for multiple speakers.
Background
With the rapid development of high-end intelligent devices such as intelligent earphones, hearing aids, conference recorders and the like, voice interaction is being widely studied as the most convenient way for human-computer interaction. In the field of voice signal processing, the voice separation technology is used as a tie for connecting the front end and the rear end, so that the interference of factors such as noise can be filtered, and key information required by technologies such as voice recognition can be extracted, thereby playing a vital role. However, in the current algorithm, when the voice to be separated contains large noise or accompanied reverberation, the separation effect is greatly reduced. The most widely studied and applied speech separation algorithm is the single-channel speech separation technique. The single-channel voice separation technology mainly utilizes signals collected by a single microphone, modeling is carried out by means of time-frequency domain acoustics and statistical characteristic differences between target voice and interference signals, and compared with a multi-channel voice separation task, the single-channel voice separation technology is low in hardware requirement and cost, small in calculation amount and high in difficulty.
In recent years, neural networks and deep learning techniques have been rapidly developed, and a speech separation algorithm has adopted a deep learning calculation mode. The basic idea of the deep learning-based voice separation method is as follows: establishing a voice separation model, extracting characteristic parameters from mixed voice, searching a mapping relation between the characteristic parameters and the characteristic parameters of each target voice signal through network training, and outputting signals of each target voice through the trained model by any input mixed signal, thereby realizing voice separation. In the early years, researchers mostly research separation algorithms in the frequency domain, and recently, end-to-end separation algorithms in the time domain can avoid the problem of phase estimation errors in the frequency domain, so that the end-to-end separation algorithms in the time domain are widely researched. The current time domain speech separation algorithm mainly comprises: Cov-TasNet, BLSTM-TasNet, and FurcaNeXt, among others. These algorithms are often applied to data that is purely speech-mixed, but the effect is somewhat reduced when interference such as noise and reverberation is mixed in the data. One reason for this is that most of these algorithms only input the mixed signal directly into the network model for training, and thus the trained network is of limited help to improve the separation accuracy.
Disclosure of Invention
The invention aims to provide a single-channel voice separation method for multiple speakers, which can increase the basis of the network for separating voice signals by separating the phoneme additional input and the attention mechanism of the network, effectively improve the accuracy of voice output, reduce the distortion rate of voice and improve the intelligibility compared with the original method.
The invention provides a single-channel voice separation system for multiple speakers, which can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, effectively improve the accuracy of voice output, reduce the distortion rate of voice and improve the intelligibility compared with the original system.
In a first aspect of the present invention, a single-channel speech separation method for multiple speakers is provided, which includes:
and step S101, constructing a voice separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder, a phoneme information encoder, an attention mechanism module, and a comprehensive decoder.
Step S102, inputting the mixed audio sample signal to the input end of the mixed audio signal encoder, and encoding the mixed audio sample signal through a two-layer time-delay convolution network to obtain a first code.
And serially inputting the voice phoneme of each target speaker to the input end of a phoneme coder, coding the voice phoneme of each target speaker through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code.
The attention mechanism module obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.
And the comprehensive decoder is used for decoding the first code and the third code through the deconvolution layer to obtain voice separation signals of a plurality of output channels.
And step S103, taking the clean audio of each target speaker as a training target of the voice separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.
And step S104, inputting the voice sample to be tested into the trained voice separation deep neural network, and obtaining a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels through the processing of the voice separation deep neural network. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
In one embodiment of the single-channel speech separation method for multiple speakers according to the present invention, the hybrid audio signal encoder and the phoneme information encoder each include two hidden layers. The decoder comprises 2 hidden layers. The attention mechanism module includes a hidden layer.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S102 further includes:
the voice sample signals in the voice sample database are resampled under 8kHz, and random audio mixing is carried out on a plurality of target speakers, noise and reverberation data between signal to noise ratio-2.5 dB and 2.5dB to obtain a plurality of mixed audio sample signals. Each mixed audio sample signal is 4s in length.
And acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S103 further includes randomly initializing parameters of a speech separation deep neural network.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of training supervised by a speech separation deep neural network further includes: a forward propagation phase.
The forward propagation phase comprises: weights and biases between neuron nodes in a speech-separating deep neural network are initialized. Forward propagating speech separates deep neural networks. In the forward propagation process of the speech separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, in step S103, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of supervised training for a speech separation deep neural network includes:
and step S1031, calculating the gradient of the loss function of the output layer of the voice separation deep neural network. The loss function is equation 1:
Figure BDA0002939690460000031
wherein s istargetIs the target of speech extraction. e.g. of the typenoiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.
Step S1032 obtains a gradient corresponding to each network layer when the number of network layers L is L-1, L-2, …,2 in the speech separation deep neural network.
And step S1033, updating the weight and the bias of the whole network according to the gradient of the loss function of the output layer and the gradient corresponding to each layer.
In a second aspect of the present invention, a single-channel speech separation system for multiple speakers is provided, which includes:
a building network unit configured to build a speech separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder, a phoneme information encoder, an attention mechanism module, and a comprehensive decoder.
And the network configuration unit is configured to input the mixed audio sample signal to the input end of the mixed audio signal encoder, encode the mixed audio sample signal through the two layers of time-delay convolutional networks and obtain a first code.
And serially inputting the voice phoneme of each target speaker to the input end of a phoneme coder, coding the voice phoneme of each target speaker through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code.
The attention mechanism module obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.
And the comprehensive decoder is used for decoding the first code and the third code through the deconvolution layer to obtain voice separation signals of a plurality of output channels.
A training unit configured to use the clean audio of each target speaker as a training target for the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network. And
and the separation unit is configured to input the voice sample to be tested into the trained voice separation deep neural network, process the voice separation deep neural network and acquire a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
In an embodiment of the single-channel speech separation system for multiple speakers, the network configuration unit is further configured to:
the voice sample signals in the voice sample database are resampled under 8kHz, and random audio mixing is carried out on a plurality of target speakers, noise and reverberation data between signal to noise ratio-2.5 dB and 2.5dB to obtain a plurality of mixed audio sample signals. Each mixed audio sample signal is 4s in length.
And acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
In another embodiment of the single-channel speech separation system for multiple speakers of the present invention, the training unit is further configured to randomly initialize parameters of the speech separation deep neural network.
In another embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the training unit uses a loss function to update the weights and the bias by back propagation through a gradient descent method, and the step of training the speech separation deep neural network with supervision further comprises: a forward propagation phase.
The forward propagation phase comprises: weights and biases between neuron nodes in a speech-separating deep neural network are initialized. Forward propagating speech separates deep neural networks. In the forward propagation process of the speech separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
The features, technical features, advantages and implementations of the method and system for single-channel speech separation for multiple speakers will be further described in an unambiguous and understandable manner with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow diagram illustrating a single-channel speech separation method for multiple speakers in one embodiment of the present invention.
Fig. 2 is a schematic diagram for illustrating the components of the voice separation network in one embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating the components of a single-channel speech separation system for multiple speakers in an embodiment of the present invention.
FIG. 4 is a schematic diagram illustrating the components of the deep neural network for speech separation according to an embodiment of the present invention.
Fig. 5a is a schematic diagram for illustrating a partial layer structure of a hybrid audio signal encoder according to an embodiment of the present invention.
Fig. 5b is a diagram illustrating a partial layer structure of a phoneme information encoder in an embodiment of the present invention.
FIG. 6 is a schematic diagram illustrating a partial layer structure of an attention mechanism module according to an embodiment of the present invention
Fig. 7 is a schematic diagram for explaining a partial layer structure of an integrated decoder according to still another embodiment of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings, in which the same reference numerals indicate the same or structurally similar but functionally identical elements.
"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings only schematically show the parts relevant to the present exemplary embodiment, and they do not represent the actual structure and the true scale of the product.
In a first aspect of the invention, a single-channel speech separation method for multiple speakers is provided, as shown in fig. 1, which includes:
and step S101, constructing a voice separation deep neural network.
In this step, a speech separation deep neural network is constructed. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. As shown in fig. 2, the voice separation network includes: a hybrid audio signal encoder 101, a phoneme information encoder 201, an attention mechanism module 301, and an integrated decoder 401.
Step S102, voice separation signals of a plurality of output channels are obtained.
In this step, the mixed audio sample signal is input to the input end of the mixed audio signal encoder 101, and the mixed audio sample signal is encoded by the two-layer delay convolutional network to obtain the first code.
And serially inputting the voice phoneme of each target speaker to the input end of a phoneme coder, coding the voice phoneme of each target speaker through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code.
The attention mechanism module 301 obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.
The integrated decoder 401 decodes the first code and the third code by the deconvolution layer to obtain the speech separation signals of the plurality of output channels.
And step S103, obtaining the trained voice separation deep neural network.
In this step, the clean audio of each target speaker is used as a training target of the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.
Step S104, a plurality of voice separation signals are obtained.
In the step, the voice sample to be tested is input into the trained voice separation deep neural network, and a plurality of voice separation signals in the voice sample to be tested are obtained from a plurality of output channels after the voice separation deep neural network processing. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
Compared with the existing single-channel voice separation algorithm, the single-channel voice separation method for multiple speakers can further extract the voice characteristics of each speaker and effectively remove signals except the corresponding target speaker, so that the accuracy of algorithm separation is improved, the distortion rate of voice is reduced, and the intelligibility of each voice after separation is effectively improved. The method can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method.
In one embodiment of the single-channel speech separation method for multiple speakers according to the present invention, the hybrid audio signal encoder 101 and the phoneme information encoder 201 each include two hidden layers. The decoder comprises 2 hidden layers. Note that force mechanism module 301 includes a hidden layer.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S102 further includes:
the voice sample signals in the voice sample database are resampled under 8kHz, and random audio mixing is carried out on a plurality of target speakers, noise and reverberation data between signal to noise ratio-2.5 dB and 2.5dB to obtain a plurality of mixed audio sample signals. Each mixed audio sample signal is 4s in length.
And acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, step S103 further includes randomly initializing parameters of a speech separation deep neural network.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of training supervised by a speech separation deep neural network further includes: a forward propagation phase.
The forward propagation phase comprises: weights and biases between neuron nodes in a speech-separating deep neural network are initialized. Forward propagating speech separates deep neural networks. In the forward propagation process of the speech separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
In another embodiment of the single-channel speech separation method for multiple speakers according to the present invention, in step S103, updating weights and biases by back propagation using a loss function through a gradient descent method, and the step of supervised training for a speech separation deep neural network includes:
and step S1031, calculating the gradient of the loss function of the output layer of the voice separation deep neural network. The loss function is equation 1:
Figure BDA0002939690460000071
wherein s istargetIs the target of speech extraction. e.g. of the typenoiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.
Step S1032 obtains a gradient corresponding to each network layer when the number of network layers L is L-1, L-2, …,2 in the speech separation deep neural network.
And step S1033, updating the weight and the bias of the whole network according to the gradient of the loss function of the output layer and the gradient corresponding to each layer.
In a second aspect of the present invention, there is provided a single-channel speech separation system for multiple speakers, as shown in fig. 3, the single-channel speech separation system for multiple speakers comprises: a building network unit 10, a network configuration unit 20, a training unit 30 and a separation unit 40.
A building network unit 10 configured to build a speech separation deep neural network. The voice separation deep neural network comprises: the device comprises an input layer, an output layer and a plurality of output channels. The number of output channels corresponds to the number of speakers in the mixed audio. The voice separation network includes: a hybrid audio signal encoder 101, a phoneme information encoder 201, an attention mechanism module 301, and an integrated decoder 401.
A network configuration unit 20 configured to input the mixed audio sample signal to an input of the mixed audio signal encoder 101, encode the mixed audio sample signal via a two-layer delayed convolutional network, and obtain a first encoding.
And serially inputting the voice phoneme of each target speaker to the input end of a phoneme coder, coding the voice phoneme of each target speaker through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code.
The attention mechanism module 301 obtains scores of the first code and the second code through an internal scoring mechanism, and obtains an attention weight value through the scores of the first code and the second code. The weighted first code is obtained by the attention weight value. And acquiring a third code through the weighted first code and the second code.
The integrated decoder 401 decodes the first code and the third code by the deconvolution layer to obtain the speech separation signals of the plurality of output channels.
A training unit 30 configured to use clean audio of each target speaker as a training target for the speech separation deep neural network. And training the voice separation deep neural network. And (4) reversely propagating the updated weight and the bias by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain the trained voice separation deep neural network.
And the separation unit 40 is configured to input the voice sample to be tested into the trained voice separation deep neural network, and obtain a plurality of voice separation signals in the voice sample to be tested from a plurality of output channels after the voice separation deep neural network processes the voice sample to be tested. And taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
In one embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the network configuration unit 20 is further configured to:
the voice sample signals in the voice sample database are resampled under 8kHz, and random audio mixing is carried out on a plurality of target speakers, noise and reverberation data between signal to noise ratio-2.5 dB and 2.5dB to obtain a plurality of mixed audio sample signals. Each mixed audio sample signal is 4s in length.
And acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
In another embodiment of the present invention directed to a multi-speaker single-channel speech separation system, the training unit 30 is further configured to randomly initialize parameters of the speech separation deep neural network.
In another embodiment of the single-channel speech separation system for multiple speakers according to the present invention, the training unit 30 updates the weights and the bias by back-propagating through gradient descent using a loss function, and the step of training the speech separation deep neural network with supervision further comprises: a forward propagation phase.
The forward propagation phase comprises: weights and biases between neuron nodes in a speech-separating deep neural network are initialized. Forward propagating speech separates deep neural networks. In the forward propagation process of the speech separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
In another preferred embodiment of the single-channel multi-speaker speech separation method of the present invention, the present invention is directed to a phoneme time-domain convolution speech separation method for a single-channel multi-speaker speech separation technique. Compared with the existing single-channel voice separation algorithm, the algorithm can further extract the voice characteristics of each speaker and effectively remove signals except the corresponding target speaker, so that the accuracy of algorithm separation is improved, the distortion rate of voice is reduced, and the intelligibility of each voice after separation is effectively improved.
The algorithm provided by the invention consists of a mixed audio signal encoder, a phoneme information encoder, a comprehensive decoder and an attention mechanism module.
The specific operation flow comprises the following steps:
a first part: preprocessing a training voice sample and inputting the training voice sample to a network input end;
a second part: training the deep neural network by using a loss function to obtain a deep neural network model;
and a third part: and inputting the voice sample to be tested into the trained network model for voice separation to obtain a test result.
The first part specifically comprises:
1.1, resampling the database sample signals at 8kHz, carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB to 2.5dB, and simultaneously storing clean audio and voice phonemes of the target speakers corresponding to each mixed audio. The length of each sample was 4 s.
1.2, inputting the mixed audio signal into an input end of a mixed audio signal coder, serially inputting phonemes corresponding to each target voice into an input end of a phoneme coder, and taking clean voice audio corresponding to a speaker as a training target of a neural network.
The second part includes:
2.1, randomly initializing the parameters of the deep neural network;
and 2.2, carrying out supervised training on the deep neural network according to the initialized parameters of 2.1, namely reversely propagating and updating the weight and the bias by using a loss function through a gradient descent method to obtain a deep neural network model.
The 2.1 includes a forward propagation stage and a backward propagation stage.
The forward propagation phase comprises: initializing weights and biases among the network neuron nodes; the deep neural network performs forward propagation.
The neural network can use the activation function to increase the nonlinear relation between networks in the forward propagation process, and finally can generate the nonlinear mapping between the input result and the output result.
The back propagation phase comprises:
<1> calculating a loss function of the deep neural network;
<2> updating parameters of the deep neural network by a gradient descent method.
The loss function of the entire network is equation 2:
Figure BDA0002939690460000091
wherein s istargetIs the target of speech extraction; e.g. of the typenoiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.
The network will use the gradient descent method to update the parameters alternately:
a. and constructing a voice separation network. It is a multi-output network, and the number of output channels is related to the number of speakers in the mixed audio. The entire network consists of four modules, including: a hybrid audio signal encoder, a phoneme information encoder, a comprehensive decoder, and an attention mechanism module. Besides the input layer and the output layer, the mixed audio signal encoder and the phoneme information encoder respectively comprise two hidden layers, and the decoder comprises 2 hidden layers. The attention mechanism contains a hidden layer. .
b. Calculating the gradient of a loss function of a network output layer;
c. calculating the gradient corresponding to each layer when the network layer number is L-1, L-2, …, 2;
d. the weights and biases for the entire network are updated.
A mixed audio signal encoder section: the mixed audio y is input to a network input end, and then the signal is encoded through a two-layer delay convolution network to obtain G-0, … and gN-1. N is the output length corresponding to the second layer network of the encoder.
A phoneme information encoder section: the phonemes corresponding to the target speech of each mixed audio are connected in series to form p and input into the input end of the module, then the two layers of delay convolutional network coding is carried out, and the high-dimensional characteristics are extracted to obtain H-H-0, … and hM-1. And M is the corresponding output length of the second layer network of the encoder.
Attention mechanism module part: and simultaneously receiving outputs G and H of the mixed audio signal encoder and the phoneme information encoder, and calculating all hm and gn through an internal scoring mechanism to obtain a formula 3:
Figure BDA0002939690460000101
thus the attention weight αn,mEquation 4 can be obtained by a softmax operator:
Figure BDA0002939690460000102
the integrated decoder section: multiplying the output of the hybrid audio signal encoder by the weighted output of the attention mechanism modulenInput to the decoder, equation 5:
Figure BDA0002939690460000103
then decoding through a deconvolution layer of the network to obtain a corresponding multi-channel voice separation signal.
The voice test operation in the third section is: the voice sample to be tested is input into the trained network model, and the estimation signal corresponding to each path of voice, namely the voice separation result of each speaker, can be obtained through calculation.
The method can increase the basis of the network for separating the voice signals by separating the phoneme additional input and the attention mechanism of the network, and can effectively improve the accuracy of voice output, reduce the distortion rate of the voice and improve the intelligibility compared with the original method.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 4, the present invention provides a phoneme time domain convolution speech separation algorithm, here taking the speech separation of two speakers as an example. The method mainly comprises the following steps:
a first part: preprocessing a training voice sample and inputting the training voice sample to a network input end;
a second part: training the deep neural network by using a loss function to obtain a deep neural network model;
and a third part: and inputting the voice sample to be tested into the trained network model for voice separation to obtain a test result.
Each of the portions will be described in detail below.
Wherein the first part specifically comprises:
1-1, resampling the database sample signal at 8kHz, carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB to obtain y, and simultaneously storing clean audio xi (i is 1,2, …, N) and voice phoneme pi of the target speaker corresponding to each mixed audio. The length of each sample was 4 s.
1-2, taking the mixed audio signal y as an input end of a mixed audio signal encoder, serially connecting phonemes corresponding to each target voice into p, and inputting the p into an input end of a phoneme encoder, wherein clean voice audio xi corresponding to a speaker is taken as a training target of a neural network.
The second part specifically comprises:
(1) randomly initializing the parameters of the deep neural network;
(2) and (2) carrying out supervised training on the deep neural network according to the initialized parameters in the step (1), namely reversely propagating and updating the weight and the bias by a gradient descent method by using a loss function to obtain a deep neural network model.
The above (2) includes a forward propagation stage and a backward propagation stage.
The forward propagation phase comprises: initializing weights and biases among the network neuron nodes; the deep neural network performs forward propagation.
The neural network can use the activation function to increase the nonlinear relation between networks in the forward propagation process, and finally can generate the nonlinear mapping between the input result and the output result.
The back propagation phase comprises:
<1> calculating a loss function of the deep neural network;
<2> updating parameters of the deep neural network by a gradient descent method.
The loss function of the entire network is equation 6:
Figure BDA0002939690460000111
wherein s istargetIs the target of speech extraction; e.g. of the typenoiseIs the estimated noise, and is found from the difference between the estimated speech and the mixed speech.
The network will use the gradient descent method to update the parameters alternately:
a. and constructing a voice separation network. It is a multi-output network, and the number of output channels is related to the number of speakers in the mixed audio. The entire network consists of four modules (fig. 4) including: a hybrid audio signal encoder 101 (fig. 5a), a phoneme information encoder 201 (fig. 5b), an integrated decoder 401 (fig. 7), and an attention mechanism module 301 (fig. 6). Except for an input layer and an output layer, the network comprises a mixed audio signal encoder and a phoneme information encoder which respectively comprise two hidden layers, and a comprehensive decoder comprises 2 hidden layers. The attention mechanism contains a hidden layer. The synthesis decoder 401 outputs a first separated speech separation signal and a second separated speech separation signal.
b. Calculating the gradient of a loss function of a network output layer;
c. calculating the gradient corresponding to each layer when the network layer number is L-1, L-2, …, 2;
d. the weights and biases for the entire network are updated.
A mixed audio signal encoder section: the mixed audio y is input to a network input end, and then the signal is encoded through a two-layer delay convolution network to obtain G-0, … and gN-1. N is the output length corresponding to the second layer network of the encoder.
A phoneme information encoder section: the phonemes corresponding to the target speech of each mixed audio are connected in series to form p and input into the input end of the module, then the two layers of delay convolutional network coding is carried out, and the high-dimensional characteristics are extracted to obtain H-H-0, … and hM-1. And M is the corresponding output length of the second layer network of the encoder.
Attention mechanism module part: and simultaneously receiving outputs G and H of the mixed audio signal encoder and the phoneme information encoder, and calculating all hm and gn through an internal scoring mechanism to obtain a formula 7:
Figure BDA0002939690460000121
thus the attention weight αn,mEquation 8 can be obtained by a softmax operator:
Figure BDA0002939690460000122
the integrated decoder section: multiplying the output of the hybrid audio signal encoder by the weighted output of the attention mechanism modulenInput to the decoder, equation 9:
Figure BDA0002939690460000123
then decoding is carried out through a deconvolution layer of the network, and corresponding multiple channel voice separation signals can be obtained.
The voice test operation in the third section is: the voice sample to be tested is input into the trained network model, and the estimation signal corresponding to each path of voice, namely the voice separation result of each speaker, can be obtained through calculation.
It should be understood that although the present description is described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein as a whole may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for single-channel speech separation for multiple speakers, comprising:
step S101, constructing a voice separation deep neural network; the speech separation deep neural network includes: the device comprises an input layer, an output layer and a plurality of output channels; the number of the output channels corresponds to the number of speakers in the mixed audio; the voice separation network includes: a mixed audio signal encoder, a phoneme information encoder, an attention mechanism module and a comprehensive decoder;
step S102, inputting a mixed audio sample signal to an input end of a mixed audio signal encoder, and encoding the mixed audio sample signal through a two-layer time-delay convolution network to obtain a first code;
serially inputting each target speaker voice phoneme to an input end of a phoneme coder, coding each target speaker voice phoneme through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code;
the attention mechanism module acquires scores of the first code and the second code through an internal scoring mechanism, and acquires an attention weight value through the scores of the first code and the second code; obtaining the weighted first code according to the attention weight value; acquiring a third code through the weighted first code and the weighted second code;
the comprehensive decoder decodes the first code and the third code through a deconvolution layer to obtain voice separation signals of the plurality of output channels;
step S103, using the clean audio of each target speaker as a training target of the voice separation deep neural network; training the voice separation deep neural network; propagating the updated weight and the bias in a reverse direction by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network;
step S104, inputting a voice sample to be tested into the trained voice separation deep neural network, and obtaining a plurality of voice separation signals in the voice sample to be tested from the plurality of output channels through the processing of the voice separation deep neural network; and taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
2. The method of claim 1, wherein the hybrid audio signal encoder and the phoneme information encoder each comprise two hidden layers; the decoder comprises 2 hidden layers; the attention mechanism module includes a hidden layer.
3. The method for separating single-channel speech for multiple speakers according to claim 1, wherein the step S102 further comprises:
resampling voice sample signals in a voice sample database at 8kHz, and carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB and signal-to-noise ratio-2.5 dB to obtain a plurality of mixed audio sample signals; each mixed audio sample signal is 4s in length;
and acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
4. The method for separating single-channel speech for multiple speakers according to claim 1, wherein the step S103 further comprises initializing parameters of the deep speech separation neural network randomly.
5. The single-channel speech separation method for multiple speakers according to claim 1 or 4, wherein the updating weights and biases are back-propagated by gradient descent using a loss function, and the step of supervised training of the speech separation deep neural network further comprises: a forward propagation stage;
the forward propagation phase comprises: initializing weights and biases between neuron nodes in a speech separation deep neural network; forward propagating the voice separation deep neural network; in the forward propagation process of the voice separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
6. The single-channel speech separation method for multiple speakers according to claim 1 or 4, wherein in step S103, the updating weights and biases are back-propagated by gradient descent method using a loss function, and the step of supervised training of the speech separation deep neural network comprises:
step S1031, calculating the gradient of the loss function of the output layer of the voice separation deep neural network; the loss function is equation 1:
Figure FDA0002939690450000021
wherein s istargetIs the target of speech extraction; e.g. of the typenoiseIs an estimated noise, derived from the difference between the estimated speech and the mixed speech;
step S1032, obtaining a gradient corresponding to each layer when the number of network layers L is L-1, L-2, …,2 in the speech separation deep neural network;
and step S1033, updating the weight and the bias of the whole network according to the gradient of the loss function of the output layer and the gradient corresponding to each layer.
7. A single channel speech separation system for multiple speakers, comprising:
a constructing network unit configured to construct a voice separation deep neural network; the speech separation deep neural network includes: the device comprises an input layer, an output layer and a plurality of output channels; the number of the output channels corresponds to the number of speakers in the mixed audio; the voice separation network includes: a mixed audio signal encoder, a phoneme information encoder, an attention mechanism module and a comprehensive decoder; a network configuration unit configured to input a mixed audio sample signal to an input of a mixed audio signal encoder, encode the mixed audio sample signal via a two-layer delayed convolutional network, and obtain a first code;
serially inputting each target speaker voice phoneme to an input end of a phoneme coder, coding each target speaker voice phoneme through a two-layer time-delay convolutional network, and extracting high-dimensional characteristics to obtain a second code;
the attention mechanism module acquires scores of the first code and the second code through an internal scoring mechanism, and acquires an attention weight value through the scores of the first code and the second code; obtaining the weighted first code according to the attention weight value; acquiring a third code through the weighted first code and the weighted second code;
the comprehensive decoder decodes the first code and the third code through a deconvolution layer to obtain voice separation signals of the plurality of output channels;
a training unit configured to use clean audio of each target speaker as a training target of the speech separation deep neural network; training the voice separation deep neural network; propagating the updated weight and the bias in a reverse direction by a gradient descent method by using a loss function, and carrying out supervised training on the voice separation deep neural network to obtain a trained voice separation deep neural network; and
the separation unit is configured to input a voice sample to be tested into the trained voice separation deep neural network, and obtain a plurality of voice separation signals in the voice sample to be tested from the plurality of output channels after the voice sample to be tested is processed by the voice separation deep neural network; and taking a plurality of voice separation signals in the voice sample to be tested as voice separation result signals of each target speaker.
8. The single channel voice separation system for multiple speakers as claimed in claim 7, wherein the network configuration unit is further configured to:
resampling voice sample signals in a voice sample database at 8kHz, and carrying out random audio mixing on a plurality of target speakers, noise and reverberation data between signal-to-noise ratio-2.5 dB and signal-to-noise ratio-2.5 dB to obtain a plurality of mixed audio sample signals; each mixed audio sample signal is 4s in length;
and acquiring clean audio and voice phonemes of the target speaker corresponding to each mixed audio sample signal.
9. The single channel speech separation system for multiple speakers according to claim 7, wherein the training unit is further configured to randomly initialize parameters of the speech separation deep neural network.
10. The single channel speech separation system for multiple speakers according to claim 7 or 9, wherein the updating weights and biases are back-propagated by gradient descent using a loss function in the training unit, and the step of supervised training of the speech separation deep neural network further comprises: a forward propagation stage;
the forward propagation phase comprises: initializing weights and biases between neuron nodes in a speech separation deep neural network; forward propagating the voice separation deep neural network; in the forward propagation process of the voice separation deep neural network, the nonlinear relation between layers is increased through an activation function, so that a nonlinear mapping between input and output results can be generated.
CN202110173700.0A 2021-02-06 2021-02-06 Single-channel voice separation method and system for multiple speakers Pending CN113053407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110173700.0A CN113053407A (en) 2021-02-06 2021-02-06 Single-channel voice separation method and system for multiple speakers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110173700.0A CN113053407A (en) 2021-02-06 2021-02-06 Single-channel voice separation method and system for multiple speakers

Publications (1)

Publication Number Publication Date
CN113053407A true CN113053407A (en) 2021-06-29

Family

ID=76508902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110173700.0A Pending CN113053407A (en) 2021-02-06 2021-02-06 Single-channel voice separation method and system for multiple speakers

Country Status (1)

Country Link
CN (1) CN113053407A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744719A (en) * 2021-09-03 2021-12-03 清华大学 Voice extraction method, device and equipment
CN113744753A (en) * 2021-08-11 2021-12-03 清华大学苏州汽车研究院(相城) Multi-person voice separation method and training method of voice separation model
CN113782006A (en) * 2021-09-03 2021-12-10 清华大学 Voice extraction method, device and equipment
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309343A (en) * 2019-06-28 2019-10-08 南京大学 A kind of vocal print search method based on depth Hash
CN110634502A (en) * 2019-09-06 2019-12-31 南京邮电大学 Single-channel voice separation algorithm based on deep neural network
KR102066264B1 (en) * 2018-07-05 2020-01-14 서울대학교산학협력단 Speech recognition method and system using deep neural network
CN111899757A (en) * 2020-09-29 2020-11-06 南京蕴智科技有限公司 Single-channel voice separation method and system for target speaker extraction
CN112331218A (en) * 2020-09-29 2021-02-05 北京清微智能科技有限公司 Single-channel voice separation method and device for multiple speakers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102066264B1 (en) * 2018-07-05 2020-01-14 서울대학교산학협력단 Speech recognition method and system using deep neural network
CN110309343A (en) * 2019-06-28 2019-10-08 南京大学 A kind of vocal print search method based on depth Hash
CN110634502A (en) * 2019-09-06 2019-12-31 南京邮电大学 Single-channel voice separation algorithm based on deep neural network
CN111899757A (en) * 2020-09-29 2020-11-06 南京蕴智科技有限公司 Single-channel voice separation method and system for target speaker extraction
CN112331218A (en) * 2020-09-29 2021-02-05 北京清微智能科技有限公司 Single-channel voice separation method and device for multiple speakers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李涛;曹辉;郭乐乐;: "深度神经网络的语音深度特征提取方法", 声学技术, no. 04 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113744753A (en) * 2021-08-11 2021-12-03 清华大学苏州汽车研究院(相城) Multi-person voice separation method and training method of voice separation model
CN113744753B (en) * 2021-08-11 2023-09-08 清华大学苏州汽车研究院(相城) Multi-person voice separation method and training method of voice separation model
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN113782045B (en) * 2021-08-30 2024-01-05 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN113744719A (en) * 2021-09-03 2021-12-03 清华大学 Voice extraction method, device and equipment
CN113782006A (en) * 2021-09-03 2021-12-10 清华大学 Voice extraction method, device and equipment
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115116448B (en) * 2022-08-29 2022-11-15 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium

Similar Documents

Publication Publication Date Title
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
Wang et al. Speech emotion recognition with dual-sequence LSTM architecture
CN113053407A (en) Single-channel voice separation method and system for multiple speakers
Qian et al. Very deep convolutional neural networks for noise robust speech recognition
CN109949824B (en) City sound event classification method based on N-DenseNet and high-dimensional mfcc characteristics
Luo et al. Ultra-lightweight speech separation via group communication
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
JPH096386A (en) Design method for state transition model and voice recognition device using the same
CN112567459B (en) Sound separation device, sound separation system, sound separation method, and storage medium
JP2002014692A (en) Device and method for generating acoustic model
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
CN113823273B (en) Audio signal processing method, device, electronic equipment and storage medium
CN111862952B (en) Dereverberation model training method and device
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
Nakagome et al. Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation.
CN112259119A (en) Music source separation method based on stacked hourglass network
Sagi et al. A biologically motivated solution to the cocktail party problem
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
CN112180318A (en) Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method
CN113963718B (en) Voice conversation segmentation method based on deep learning
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
JP2003005785A (en) Separating method and separating device for sound source
CN114023350A (en) Sound source separation method based on shallow feature reactivation and multi-stage mixed attention
CN114155883B (en) Progressive type based speech deep neural network training method and device
Utomo et al. Spoken word and speaker recognition using MFCC and multiple recurrent neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination