CN115691541B

CN115691541B - Voice separation method, device and storage medium

Info

Publication number: CN115691541B
Application number: CN202211680551.8A
Authority: CN
Inventors: 康世胤; 吴志勇; 童玮男; 朱佳旭; 陈鋆
Original assignee: Shenzhen Yuanxiang Information Technology Co ltd; Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen Yuanxiang Information Technology Co ltd; Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-03-21
Anticipated expiration: 2042-12-27
Also published as: CN115691541A

Abstract

The application discloses a voice separation method, a device and a storage medium, wherein the method comprises the following steps: acquiring a first spectrogram and a plurality of second spectrograms, wherein the first spectrogram is a spectrogram of an original voice signal, and the plurality of second spectrograms are spectrogram of a plurality of original separated voice signals separated from the original voice signal; correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using a correction model to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, wherein the correction model comprises a two-dimensional convolution module; obtaining a plurality of corrected second spectrogram according to the original phases and the original amplitudes of the plurality of second spectrogram and the corresponding corrected phases and corrected amplitudes; and obtaining a plurality of corrected separated voice signals according to the plurality of corrected second spectrogram. In this way, the present application can reduce the difference between the separated speech signal and the true separated source speech signal.

Description

Voice separation method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a voice separation method, a voice separation apparatus, and a storage medium.

Background

The purpose of speech separation is to separate the source signal of each speaker from the mixed speech. In recent years, methods based on deep learning have achieved significant success in the field of speech separation. At present, the mainstream speech separation method is a time domain model-based method, the input of the time domain model is the waveform of speech, and the waveform of a separation source is obtained through neural network prediction. However, the spectrogram of the separated speech signal obtained based on the time domain model prediction often has some obvious errors of amplitude and phase, so that the separated speech signal is very different from the real separated source speech signal.

Disclosure of Invention

Based on this, embodiments of the present application provide a speech separation method, a speech separation apparatus, and a storage medium, which can reduce the difference between a separated speech signal and a true separated source speech signal.

In a first aspect, the present application provides a speech separation method, including:

acquiring a first spectrogram and a plurality of second spectrograms, wherein the first spectrogram is a spectrogram of an original voice signal, and the plurality of second spectrograms are spectrogram of a plurality of original separated voice signals separated from the original voice signal;

correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using a correction model to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, wherein the correction model comprises a two-dimensional convolution module;

obtaining a plurality of corrected second spectrogram according to the original phases and the original amplitudes of the plurality of second spectrogram and the corresponding corrected phases and corrected amplitudes;

and obtaining a plurality of corrected separated voice signals according to the plurality of corrected second spectrogram.

In a second aspect, the present application provides a speech separation apparatus comprising a memory and a processor; the memory is used for storing a computer program; the processor is adapted to execute the computer program and to implement the speech separation method as described above when executing the computer program.

In a third aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the speech separation method as described above.

The embodiment of the application provides a voice separation method, a voice separation device and a storage medium, because the original phase and the original amplitude of a second spectrogram of a plurality of separated original separation voice signals are corrected by using a correction model comprising a two-dimensional convolution module based on a first spectrogram of the original voice signals, a plurality of corrected second spectrograms are obtained according to the original phase and the original amplitude of the second spectrograms and the corresponding corrected phase and corrected amplitude, and then a plurality of corrected separation voice signals are obtained.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speech separation method according to the present application;

FIG. 2 is a diagram of an embodiment of a calibration model in the speech separation method of the present application;

FIG. 3 is a schematic flow chart diagram illustrating another embodiment of a speech separation method according to the present application;

FIG. 4 is a schematic diagram of another embodiment of a calibration model in the speech separation method of the present application;

FIG. 5 is a schematic flow chart diagram illustrating a voice separation method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of another embodiment of a calibration model in the speech separation method of the present application;

FIG. 7 is a schematic flow chart diagram illustrating a voice separation method according to another embodiment of the present application;

FIG. 8 is a schematic diagram of another embodiment of a calibration model in the speech separation method of the present application;

FIG. 9 is a schematic flow chart diagram illustrating a voice separation method according to another embodiment of the present application;

FIG. 10 is a schematic flow chart diagram illustrating a voice separation method according to another embodiment of the present application;

fig. 11 is a block diagram of an embodiment of a speech separation apparatus according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

At present, the mainstream speech separation method is a time domain model-based method, the input of the time domain model is the waveform of speech, and the waveform of a separation source is obtained through neural network prediction. However, the spectrogram of the separated speech signal obtained based on the time domain model prediction often has some obvious errors of amplitude and phase, so that the separated speech signal is very different from the real separated source speech signal.

The method aims to solve the technical problem, the original phase and the original amplitude of the second spectrogram of a plurality of separated original separated speech signals are corrected by utilizing a correction model comprising a two-dimensional convolution module based on the first spectrogram of the original speech signals, a plurality of corrected second spectrograms are obtained according to the original phase and the original amplitude of the second spectrograms and the corresponding correction phase and correction amplitude, and then a plurality of corrected separated speech signals are obtained.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of a speech separation method according to the present application, where the method includes: step S101, step S102, step S103, and step S104.

Step S101: the method comprises the steps of obtaining a first spectrogram and a plurality of second spectrograms, wherein the first spectrogram is a spectrogram of an original voice signal, and the plurality of second spectrogram is spectrogram of a plurality of original separated voice signals separated from the original voice signal.

The voice signal may refer to a waveform signal of voice, the original voice signal may be a mixed voice signal without separation, and the original separated voice signal may be a voice signal separated from the original voice signal. The spectrogram is a time-frequency representation method of a voice signal, the abscissa of the spectrogram is time, the ordinate is frequency, the coordinate point value is voice energy, the spectrogram adopts a two-dimensional plane to express three-dimensional information, the energy value is represented by color, the color is dark, and the voice energy representing the point is stronger. The first spectrogram is a spectrogram of the original voice signal, and the second spectrogram is a spectrogram of the original separated voice signal.

Since the phase and the amplitude need to be corrected, and the spectrogram includes frequency domain information, it is first necessary to obtain a first spectrogram of an original voice signal and obtain a second spectrogram of a plurality of original separated voice signals separated from the original voice signal.

Step S102: and correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using a correction model to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, wherein the correction model comprises a two-dimensional convolution module.

The two-dimensional convolution module may refer to a two-dimensional convolution layer in which a two-dimensional input array and a two-dimensional kernel (kernel) array output a two-dimensional array through cross-correlation operations. Since the result of the cross-correlation operation can reflect the measure of the similarity between two signals, the correction model including the two-dimensional convolution module can be used to correct the original phases and the original amplitudes of the plurality of second spectrogram by using the first spectrogram as a comparison, so as to obtain the corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram. The corrected phase and corrected amplitude are to be understood as deviations between the original phase and original amplitude and the true phase and true amplitude. Any section of voice waveform becomes a complex matrix after STFT (Short-Time Fourier Transform), and can be specifically expressed as a real part and an imaginary part, or an amplitude and a phase; in this embodiment, the input of the two-dimensional convolution module is the original phase and the original amplitude of the first spectrogram and the original phases and the original amplitudes of the plurality of second spectrograms, or the input of the two-dimensional convolution module is the imaginary part and the real part of the first spectrogram and the imaginary part and the real part of the plurality of second spectrograms. Wherein, the relation between the phase and the amplitude and the real part and the imaginary part is as follows: phase is a, amplitude is b, real part is c, imaginary part is d, then real part c = b cos (a), imaginary part d = b sin (a). The first spectrogram is a spectrogram of an original voice signal and hides real phases and real amplitudes of the spectrograms of a plurality of real voice signals, so that the first spectrogram is required to be used as comparison, and a correction model is used for correcting the original phases and the original amplitudes of a plurality of second spectrograms.

Step S103: and obtaining a plurality of corrected second spectrogram according to the original phases and the original amplitudes of the plurality of second spectrogram and the corresponding corrected phases and corrected amplitudes.

After the corrected phase and the corrected amplitude are obtained, the sum of the corrected phase and the original phase and the sum of the corrected amplitude and the original amplitude can be obtained by combining the original phase and the original amplitude, and then a plurality of corrected second spectrogram can be obtained.

Step S104: and obtaining a plurality of corrected separated voice signals according to the plurality of corrected second spectrogram.

And after the second spectrogram is obtained, obtaining a plurality of corrected separated voice signals through inverse Fourier transform.

According to the embodiment of the application, the correction model comprising the two-dimensional convolution module is used for correcting the original phase and the original amplitude of the second spectrogram of the separated multiple original separated speech signals based on the first spectrogram of the original speech signals, and multiple corrected second spectrograms are obtained according to the original phase and the original amplitude of the multiple second spectrograms and the corresponding correction phase and correction amplitude, so that multiple corrected separated speech signals are obtained.

In some embodiments, the correction model further comprises a time-domain frequency-domain correction module, the time-domain frequency-domain correction module being configured to determine a dependency relationship between a time direction and a frequency direction of the second spectrogram; at this time, in step S102, the correcting the original phases and the original amplitudes of the plurality of second spectrograms by using the correction model based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrograms, may further include: and correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using the two-dimensional convolution module and the time-domain frequency-domain correction module to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrograms.

In this embodiment of the application, the correction model further includes a time domain and frequency domain correction module besides the two-dimensional convolution module, the time domain and frequency domain correction module is configured to determine a dependency relationship between a time direction and a frequency direction of the second spectrogram, where the dependency relationship includes but is not limited to: frequency dependency within a frame, time dependency within the same frequency band, etc.; the time-domain frequency-domain correction module may comprise any module capable of determining a dependency relationship between a time direction and a frequency direction of the second spectrogram, for example: RNN (Recurrent Neural Network), and the like, and specifically LSTM (Long-Short Term Memory), BLSTM (Bi-directional Long-Short Term Memory), and the like. The original phase and the original amplitude can be corrected in more detailed directions according to the dependency relationship.

In some embodiments, the correcting model further includes a dense connection dilation convolution module, in this case, in step S102, the correcting, by using the two-dimensional convolution module and the time-domain/frequency-domain correction module, the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, further includes: and correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using the two-dimensional convolution module, the dense connection expansion convolution module and the time-domain frequency-domain correction module to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrograms.

In the embodiment of the present application, the correction model further includes a dense connection expansion convolution module in addition to the two-dimensional convolution module and the time domain and frequency domain correction module, where the dense connection expansion convolution module is used to increase the receptive field (the parameter corresponding to the dense connection expansion convolution module is the expansion rate); the Receptive Field (Receptive Field) may be the area size in which a single element in the feature map output by each layer of the convolutional neural network is mapped back into the original input features; the larger the receptive field, the larger the area of the original input, and the more global information can be observed. The dense connection expansion convolution module is based on the dense block, can keep effective transmission of features, gradually adds expansion convolutions with different expansion rates into the dense block, aggregates multi-scale spatial context information through the expansion convolutions with different expansion rates of the dense connection, enables the network to increase the receptive field of the network under the condition of not increasing the parameter quantity and not losing the spatial resolution, and avoids grid artifacts brought by the expansion convolutions.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an embodiment of a correction model 10, where the correction model 10 includes a two-dimensional convolution module, a dense connection dilation convolution module, and a time-domain frequency-domain correction module 103. In some embodiments, the two-dimensional convolution modules include a first two-dimensional convolution module 101a and a second two-dimensional convolution module 101b, the dense connection dilation convolution module includes a first dense connection dilation convolution module 102a and a second dense connection dilation convolution module 102b, and the first dense connection dilation convolution module 102a and the second dense connection dilation convolution module 102b each include four two-dimensional dilation convolution sub-modules. In the embodiment of the present application, the number of two-dimensional convolution modules is two or more (two are illustrated in the figure), and the number of dense connection expansion convolution modules is two or more (two are illustrated in the figure); the more the number of the two-dimensional convolution modules is, the more the number of the densely connected expansion convolution modules is, the more the required computing resources and computing time are; if the user wishes to shorten the computation time, two-dimensional convolution modules, two densely connected dilation convolution modules, may generally be selected. The time-domain frequency-domain correction module 103 of the correction model 10 may be referred to as a first two-dimensional convolution module 101a and a first densely-connected dilation convolution module 102a, and the time-domain frequency-domain correction module 103 may be referred to as a second two-dimensional convolution module 101b and a second densely-connected dilation convolution module 102b. The dense blocks of the first dense connection expansion convolution module 102a and the second dense connection expansion convolution module 102b each include four two-dimensional expansion convolution sub-modules, that is, 4 layers, and each of the first 3 layers receives information of all the previous layers and outputs the information as the input of the next layer; from front to back, the expansion rate of the 4-layer convolution increases gradually. For example: the first layer convolution has a dilation rate of 1, the second layer convolution has a dilation rate of 2, the third layer convolution has a dilation rate of 4, and the fourth layer convolution has a dilation rate of 8.

At this time, in step S102, the correcting, by using the two-dimensional convolution module, the dense connection dilation convolution module, and the time domain/frequency domain correction module, the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, may further include: sub-step S1021, sub-step S1022, sub-step S1023, sub-step S1024, sub-step S1025, sub-step S1026, and sub-step S1027, as shown in fig. 3.

Substep S1021: and splicing the first spectrogram and the plurality of second spectrograms to obtain a spliced spectrogram.

Substep S1022: and inputting the spliced spectrogram into the first two-dimensional convolution module.

Substep S1023: and inputting the output result of the first two-dimensional convolution module into the first dense connection expansion convolution module.

Substep S1024: and inputting the output result of the first dense connection expansion convolution module into the time domain and frequency domain correction module.

Substep S1025: and inputting the output result of the time domain frequency domain correction module into the second dense connection expansion convolution module.

Substep S1026: and inputting the output result of the second dense joint expansion convolution module into the second two-dimensional convolution module.

Substep S1027: and obtaining correction phases and correction amplitudes corresponding to the second spectrogram according to an output result of the second two-dimensional convolution module.

In some embodiments, the correction model includes a two-dimensional convolution module, a time-domain frequency-domain correction module. In some embodiments, the two-dimensional convolution module includes a first two-dimensional convolution module and a second two-dimensional convolution module, and the step S102 is to correct, by using the two-dimensional convolution module and the time-domain frequency-domain correction module, original phases and original amplitudes of a plurality of second spectrogram based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, and may further include: splicing the first spectrogram and the plurality of second spectrograms to obtain a spliced spectrogram; inputting the spliced spectrogram into the first two-dimensional convolution module; inputting the output result of the first two-dimensional convolution module into the time domain and frequency domain correction module; inputting the output result of the time domain and frequency domain correction module into the second two-dimensional convolution module; and obtaining correction phases and correction amplitudes corresponding to the second spectrogram according to an output result of the second two-dimensional convolution module.

Referring to fig. 4, in some embodiments, the time-domain frequency-domain correction module 103 includes a first structure reshaping sub-module 1031, a two-way long-short term memory sub-module 1032, a second structure reshaping sub-module 1033, a multi-head self-attention sub-module 1034, and a third structure reshaping sub-module 1035, where the first structure reshaping sub-module 1031 receives the output result of the first densely connected dilation convolution module 102a, and the output result is passed to the two-way long-short term memory sub-module 1032; the second structure reshaping submodule 1033 receives the output result of the bidirectional long-short term memory submodule 1032, and the output result is transmitted to the multi-head self-attention submodule 1034; third structure reshaping submodule 1035 receives the output of multi-headed self-attention submodule 1034, and passes the output to second densely connected dilation convolution module 102b. The three structure reshaping sub-modules in the embodiment of the application mainly perform structure reshaping on the data input by the previous layer so as to meet the data requirement of the next layer.

The substep S1024, inputting the output result of the first dense connection dilation convolution module to the time-domain frequency-domain correction module, may include: substep S10241, substep S10242, substep S10243, substep S10244, substep S10245, as shown in fig. 5.

Substep S10241: and inputting the output result of the first intensive connection expansion convolution module into the first structure reshaping submodule so as to perform first structure reshaping on the output result of the first intensive connection expansion convolution module.

The output result of the first dense connection expansion convolution module is a characteristic R ^B*F*T*C B represents information of a batch size (batch) dimension, C represents information of a channel (channel) dimension, T represents information of a time dimension, and F represents information of a frequency dimension; the feature R needs to be reshaped by a first structural reshaping sub-module ^B*F*T*C Undergoing structural remodeling to R ^{BT *F*C} In order for the bi-directional long-short term memory submodule to capture the frequency dependence within the frame.

Substep S10242: and inputting the output result of the first structure reshaping submodule into the bidirectional long-short term memory submodule.

The bidirectional Long-Short Term Memory submodule comprises a BLSTM (Bi-directional Long-Short Term Memory) Neural Network, the BLSTM comprises two independent LSTM (Long-Short Term Memory) Neural networks, the LSTM is used as a variant of RNN (Recurrent Neural Network), and Long-Term dependence in data can be learned compared with a traditional RNN; compared with the LSTM, the BLSTM can better deal with the problems of gradient disappearance and explosion, can better extract and represent features, and has better effect compared with the LSTM. In the embodiment of the application, the frequency dependence in the frame can be captured by the bidirectional long-short term memory submodule.

Substep S10243: and inputting the output result of the bidirectional long-short term memory submodule into the second structure remodeling submodule so as to perform second structure remodeling on the output result of the bidirectional long-short term memory submodule.

The output result of the bidirectional long-short term memory submodule is a characteristic R ^{BT *F*C} The feature R needs to be reshaped by a second structure reshaping submodule ^{BT *F*C} Undergoing structural remodeling to R ^BF*T*C So that the multi-head self-attention submodule captures the time dependency between different frames.

Substep S10244: and inputting the output result of the second structure reshaping submodule into the multi-head self-attention submodule.

The Multi-headed Self-attention submodule includes an MHSA mechanism (Multi-headed Self-attention). The multi-head self-attention mechanism can be considered to be better than the single-head self-attention mechanism in learning a relation, namely the relation between the current token and other tokens in a sequence. In the embodiment of the application, the multi-head self-attention submodule can capture the time dependency among different frames.

Substep S10245: and inputting the output result of the multi-head self-attention submodule into the third structure reshaping submodule so as to reshape the third structure of the output result of the multi-head self-attention submodule.

The output result of the multi-head self-attention submodule is a characteristic R ^BF*T*C The feature R needs to be reshaped by a third structure ^BF*T*C Undergoing structural remodeling to R ^B*F*T*C So as to be input into the second densely connected dilated convolution module.

At this time, the sub-step S1025 of inputting the output result of the time-domain frequency-domain correction module to the second dense-connection dilation convolution module may include: and inputting the output result of the third structure reshaping submodule into the second dense connection expansion convolution module.

Referring collectively to fig. 6, in some embodiments, the time-domain, frequency-domain correction module 103 further includes a first residual concatenation sub-module 1036, a first layer normalization sub-module 1037, a second residual concatenation sub-module 1038, and a second layer normalization sub-module 1039. In this embodiment, the time domain and frequency domain correction module 103 further includes two residual connecting sub-modules and two layer normalization sub-modules, an output result of the first structure reshaping sub-module 1031 is duplicated into two parts, an output result of the first structure reshaping sub-module 1031 is input into the bidirectional long-short term memory sub-module 1032, the first residual connecting sub-module 1036 receives an output result of the bidirectional long-short term memory sub-module 1032 and an output result of the other first structure reshaping sub-module 1031, an output result of the first residual connecting sub-module 1036 is transmitted to the first layer normalization sub-module 1037, and an output result of the first layer normalization sub-module 1037 is transmitted to the second structure reshaping sub-module 1033; the output result of the second structure reshaping sub-module 1033 is copied into two parts, one part of the output result of the second structure reshaping sub-module 1033 is input into the multi-head self-attention sub-module 1034, the second residual connecting sub-module 1038 receives the output result of the multi-head self-attention sub-module 1034 and the output result of the other part of the output result of the second structure reshaping sub-module 1033, the output result of the second residual connecting sub-module 1038 is transmitted to the second layer normalization sub-module 1039, and the output result of the second layer normalization sub-module 1039 is transmitted to the third structure reshaping sub-module 1035.

The effects of residual concatenation mainly include: for some layers, whether the effect is positive or not is not determined, after the residual error connection is added, the output result of the previous layer is copied into two parts, and then the two parts are divided into two paths: the output result of the previous layer is transmitted to the layer, so that the performance of the model can be effectively improved, the output result of the layer is added with the output result of the previous layer of the other layer, and the added result is input into the next layer; in this way, on the one hand, the model complexity can be reduced to reduce overfitting, and on the other hand, the gradient explosion and disappearance problems can be prevented. The role of layer normalization mainly includes that the training process of the model can be accelerated and the model can be converged faster by normalizing the activation values of the layers.

As shown in fig. 7, the method may further include: step SA, copying the output result of the first structure remolding submodule into two parts; the substep S10242, inputting the output result of the first structure reshaping submodule to the bidirectional long-term and short-term memory submodule, may further include: and inputting an output result of the first structure reshaping submodule into the bidirectional long-short term memory submodule. The substep S10243, inputting the output result of the bidirectional long-term and short-term memory submodule to the second structure reshaping submodule, may further include: substep S102431, substep S102432, substep S102433.

Substep S102431: and inputting the output result of the bidirectional long-short term memory submodule and the output result of the other first structure reshaping submodule into the first residual connecting submodule.

Sub-step S102432: and inputting the output result of the first residual error connection sub-module into the first-layer normalization sub-module.

Substep S102433: and inputting the output result of the first layer of normalization sub-module into the second structure reshaping sub-module.

Referring to fig. 7, the method further includes: SB, copying the output result of the second structure remolding submodule into two parts; in this case, in sub-step S10244, the inputting the output result of the second structure reshaping sub-module into the multi-head self-attention sub-module may further include: and inputting an output result of the second structure reshaping submodule into the multi-head self-attention submodule.

The substep S10245, inputting the output result of the multi-head self-attention submodule to the third structure reshaping submodule, may further include: substep S102451, substep S102452, substep S102453.

Sub-step S102451: and inputting the output result of the multi-head self-attention sub-module and the output result of the other second structure reshaping sub-module into the second residual error connecting sub-module.

Substep S102452: and inputting the output result of the second residual error connection sub-module into the second-layer normalization sub-module.

Substep S102453: and inputting the output result of the second layer of normalization sub-module into the third structure reshaping sub-module.

Referring to fig. 8, in some embodiments, the time-domain frequency-domain correction module 103 includes a first time-domain frequency-domain correction module 103a and a second time-domain frequency-domain correction module 103b, and a finer-grained dependency relationship between a frequency direction and a time direction can be extracted by stacking a plurality of time-domain frequency-domain correction modules.

In some embodiments, the correction model 10 further comprises: a frequency-domain-time-domain feature conversion module 104 and a time-domain-frequency-domain feature conversion module 105. The frequency domain-time domain feature conversion module 104 performs frequency domain-time domain conversion, and the time domain-frequency domain feature conversion module 105 performs time domain-frequency domain conversion, which is the inverse process of the frequency domain-time domain conversion. Since repeated learning of frequency characteristics tends to limit the improvement of the effect in terms of characteristics, a frame frequency domain information corresponds to waveform information within a time window, and they can be mutually converted by FFT (Fast Fourier Transform). Through such a simple and effective conversion, the correction model can learn the dependency relationship of the time characteristic and the frequency characteristic from two dimensions of the waveform and the spectrogram. Therefore, the correction model adds two modules: a frequency-domain-time-domain feature conversion module 104 and a time-domain-frequency-domain feature conversion module 105.

In this embodiment of the application, the number of the time domain and frequency domain correction modules is two or more, the time domain and frequency domain correction module configured before the frequency domain-time domain feature conversion module 104 may be referred to as a first time domain and frequency domain correction module 103a, and the time domain-frequency domain feature conversion module 105 may be referred to as a second time domain and frequency domain correction module 103b, where the configuration connection relationship may include n basic connection units connected in sequence, where the basic connection units are: a first time domain and frequency domain correction module 103a, a frequency domain-time domain characteristic conversion module 104, a second time domain and frequency domain correction module 103b, and a time domain-frequency domain characteristic conversion module 105. The figure illustrates two time domain/frequency domain correction modules as an example, the first time domain/frequency domain correction module 103a receives the output result of the first dense connection dilation convolution module 102a, and the output result of the first time domain/frequency domain correction module 103a is transmitted to the frequency domain/time domain feature conversion module 104; the second time-domain and frequency-domain correction module 103b receives the output result of the frequency-domain-to-time-domain feature conversion module 104, the output result of the second time-domain and frequency-domain correction module 103b is transmitted to the time-domain-to-frequency-domain feature conversion module 105, and the output result of the time-domain-to-frequency-domain feature conversion module 105 is transmitted to the second dense connection dilation convolution module 102b.

Referring to fig. 9 in this case, in sub-step S1024, the inputting the output result of the first densely connected dilation convolution module to the time-domain frequency-domain correction module may include: and inputting the output result of the first intensive connection expansion convolution module into the first time domain and frequency domain correction module.

The method may further comprise: step SC1 and step SC2.

And step SC1, inputting the output result of the first time domain frequency domain correction module into the frequency domain-time domain characteristic conversion module.

And step SC2, inputting the output result of the frequency domain-time domain characteristic conversion module into the second time domain and frequency domain correction module.

The sub-step S1025 of inputting the output result of the time domain-frequency domain correction module into the second dense connection dilation convolution module may include: substep S10251 and substep S10252.

Substep S10251: and inputting the output result of the second time domain and frequency domain correction module into the time domain-frequency domain characteristic conversion module.

Substep S10252: and inputting the output result of the time domain-frequency domain feature conversion module into the second dense connection expansion convolution module.

It should be noted that, in the above embodiment of the correction model including the dense connection dilation convolution module, the dense connection dilation convolution module may not be needed in the case of not increasing the receptive field requirement. The concrete description is as follows:

In some embodiments, the time-domain frequency-domain correction module comprises a first structure reshaping submodule, a bidirectional long-short-term memory submodule, a second structure reshaping submodule, a multi-head self-attention submodule and a third structure reshaping submodule, wherein the first structure reshaping submodule receives an output result of the first two-dimensional convolution module, and the output result is transmitted to the bidirectional long-short-term memory submodule; the second structure reshaping submodule receives an output result of the bidirectional long-short term memory submodule, and the output result is transmitted to the multi-head self-attention submodule; and the third structure reshaping submodule receives an output result of the multi-head self-attention submodule, and the output result is transmitted to the second two-dimensional convolution module.

In some embodiments, the time-domain frequency-domain correction module further comprises a first residual connecting submodule, a first layer normalization submodule, a second residual connecting submodule, and a second layer normalization submodule. The output result of the first structure reshaping submodule is copied into two parts, one part of the output result of the first structure reshaping submodule is input into the bidirectional long-short term memory submodule, the first residual connecting submodule receives the output result of the bidirectional long-short term memory submodule and the output result of the other part of the output result of the first structure reshaping submodule, the output result of the first residual connecting submodule is transmitted to the first layer of normalization submodule, and the output result of the first layer of normalization submodule is transmitted to the second structure reshaping submodule; and the output result of the second structure reshaping submodule is copied into two parts, the output result of one part of the second structure reshaping submodule is input into the multi-head self-attention submodule, the output result of the multi-head self-attention submodule and the output result of the other part of the second structure reshaping submodule are received by the second residual connecting submodule, the output result of the second residual connecting submodule is transmitted to the second layer of normalization submodule, and the output result of the second layer of normalization submodule is transmitted to the third structure reshaping submodule.

In some embodiments, the time-domain frequency-domain correction module comprises a first time-domain frequency-domain correction module and a second time-domain frequency-domain correction module.

In some embodiments, the correction model further comprises: the device comprises a frequency domain-time domain characteristic conversion module and a time domain-frequency domain characteristic conversion module. The frequency domain-time domain feature conversion module is used for carrying out frequency domain-time domain conversion, and the time domain-frequency domain feature conversion module is used for carrying out time domain-frequency domain conversion, and is the inverse process of the frequency domain-time domain conversion.

In this embodiment of the present application, the number of the time domain and frequency domain correction modules is two or more, and the time domain and frequency domain correction modules configured before the frequency domain-time domain feature conversion module may be referred to as a first time domain and frequency domain correction module, and the time domain-frequency domain feature conversion module may be referred to as a second time domain and frequency domain correction module, where the configuration connection relationship may include n sequential basic connection units, where the basic connection units are: the device comprises a first time domain and frequency domain correction module, a frequency domain-time domain characteristic conversion module, a second time domain and frequency domain correction module and a time domain-frequency domain characteristic conversion module. Taking two time domain and frequency domain correction modules as an example for explanation, the first time domain and frequency domain correction module receives the output result of the first two-dimensional convolution module, and the output result of the first time domain and frequency domain correction module is transmitted to the frequency domain-time domain characteristic conversion module; the second time domain and frequency domain correction module receives the output result of the frequency domain-time domain characteristic conversion module, the output result of the second time domain and frequency domain correction module is transmitted to the time domain-frequency domain characteristic conversion module, and the output result of the time domain-frequency domain characteristic conversion module is transmitted to the second two-dimensional convolution module.

In some embodiments, the step S101, acquiring the first spectrogram and the plurality of second spectrogram, may include: substeps 1011 and substep S1012, as shown in fig. 10.

Substep S1011: and separating the original voice signals by utilizing a time domain model to obtain a plurality of original separated voice signals.

The input of the Time domain model is a waveform of voice, and a waveform of a separate source is obtained by prediction of a Neural Network, including but not limited to DPRNN (Dual-Path Recurrent Neural Network), DPTNet (Dual-Path Transformer Network), tasNet (Time-domain Audio Separation Network), and so on.

Sub-step S1012: and respectively carrying out short-time Fourier transform on the original voice signal and the original separated voice signals to obtain a first spectrogram of the original voice signal and a second spectrogram of the original separated voice signals.

In some embodiments, the step S104 of obtaining a plurality of corrected separated speech signals according to a plurality of corrected second spectrogram patterns may include: and carrying out inverse short-time Fourier transform on the plurality of corrected second spectrogram to obtain a plurality of corrected separated voice signals.

According to the embodiment of the application, the correction model is added after the time domain model separation step, so that real part information and imaginary part information or phase and amplitude in a frequency domain can be corrected. In one embodiment, the correction model employed is as follows: two-dimensional convolution modules (namely a first two-dimensional convolution module and a second two-dimensional convolution module), two densely connected expansion convolution modules (namely a first densely connected expansion convolution module and a second densely connected expansion convolution module), each densely connected expansion convolution module comprises four two-dimensional expansion convolution sub-modules, eight time-domain frequency-domain correction modules (namely four first time-domain frequency-domain correction modules and four second time-domain frequency-domain correction modules), and each time-domain frequency-domain correction module comprises: the system comprises a first structure reshaping submodule, a bidirectional long-short term memory submodule, a first residual error connecting submodule, a first layer of normalization submodule, a second structure reshaping submodule, a multi-head self-attention submodule, a second residual error connecting submodule, a second layer of normalization submodule and a third structure reshaping submodule; when the voice separation is carried out, the correction model is adopted, the SI-SDR (Scale Invariant Source-to-Distortion Ratio) on the WSJ0-2mix data set is 22.2dB, and the SI-SDR on the lib-2 mix data set is 19.4dB, so that the most advanced performance at present can be achieved.

Referring to fig. 11, fig. 11 is a block diagram of an embodiment of the speech separation apparatus of the present application, it should be noted that the speech separation apparatus of the present application can implement the speech separation method, and please refer to the above method section for detailed description of related contents, which is not described herein again.

The apparatus 100 comprises a memory 1 and a processor 2; the memory 1 is used for storing a computer program; the processor 2 is adapted to execute the computer program and to implement the speech separation method as described in any of the above when executing the computer program.

The processor 2 may be a micro-control unit, a central processing unit, a digital signal processor, or the like. The memory 1 may be a Flash chip, a read-only memory, a magnetic disk, an optical disk, a usb-disk or a removable hard disk, etc.

The present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement a speech separation method as described in any one of the above.

The computer readable storage medium may be an internal storage unit of the above device, such as a hard disk or a memory. The computer readable storage medium may also be an external storage device of the apparatus, such as a hard drive, smart memory card, secure digital card, flash memory card, etc. provided.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The above description is only for the specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech separation, the method comprising:

obtaining a plurality of corrected separated voice signals according to the plurality of corrected second spectrogram;

the correction model further comprises a time domain and frequency domain correction module, and the time domain and frequency domain correction module is used for determining the dependency relationship between the time direction and the frequency direction of the second spectrogram;

the correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using the correction model to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram, includes:

correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using the two-dimensional convolution module and the time-domain frequency-domain correction module to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram;

wherein the correction model further comprises a dense connection dilation convolution module;

the correcting, by using the two-dimensional convolution module and the time-domain and frequency-domain correction module, the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram includes:

and correcting the original phases and the original amplitudes of the plurality of second spectrogram based on the first spectrogram by using the two-dimensional convolution module, the dense connection expansion convolution module and the time-domain frequency-domain correction module to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrograms.

2. The method of claim 1, wherein the two-dimensional convolution modules comprise a first two-dimensional convolution module and a second two-dimensional convolution module, wherein the densely-connected dilation convolution modules comprise a first densely-connected dilation convolution module and a second densely-connected dilation convolution module, and wherein the first densely-connected dilation convolution module and the second densely-connected dilation convolution module each comprise four two-dimensional dilation convolution sub-modules;

the correcting the original phases and the original amplitudes of the plurality of second spectrogram by using the two-dimensional convolution module, the dense connection expansion convolution module and the time-domain frequency-domain correction module based on the first spectrogram to obtain corrected phases and corrected amplitudes corresponding to the plurality of second spectrogram comprises:

splicing the first spectrogram and the plurality of second spectrograms to obtain a spliced spectrogram;

inputting the spliced spectrogram into the first two-dimensional convolution module;

inputting the output result of the first two-dimensional convolution module into the first dense connection expansion convolution module;

inputting the output result of the first dense connection expansion convolution module into the time domain and frequency domain correction module;

inputting the output result of the time domain frequency domain correction module into the second dense connection expansion convolution module;

inputting the output result of the second dense joint expansion convolution module into the second two-dimensional convolution module;

and obtaining correction phases and correction amplitudes corresponding to the second spectrogram according to an output result of the second two-dimensional convolution module.

3. The method of claim 2, wherein the time-domain frequency-domain correction module comprises a first structural reshaping sub-module, a bidirectional long-short term memory sub-module, a second structural reshaping sub-module, a multi-head self-attention sub-module, and a third structural reshaping sub-module;

the inputting the output result of the first dense connection dilation convolution module to the time-domain and frequency-domain correction module includes:

inputting the output result of the first densely connected expanding convolution module into the first structure reshaping submodule so as to perform first structure reshaping on the output result of the first densely connected expanding convolution module;

inputting the output result of the first structure reshaping submodule into the bidirectional long-short term memory submodule;

inputting the output result of the bidirectional long-short term memory submodule into the second structure remodeling submodule so as to perform second structure remodeling on the output result of the bidirectional long-short term memory submodule;

inputting the output result of the second structure reshaping submodule into the multi-head self-attention submodule;

inputting the output result of the multi-head self-attention submodule into the third structure reshaping submodule so as to perform third structure reshaping on the output result of the multi-head self-attention submodule;

the inputting the output result of the time domain frequency domain correction module into the second dense connection dilation convolution module includes:

and inputting the output result of the third structure reshaping submodule into the second dense connection expansion convolution module.

4. The method of claim 3, wherein the time-domain, frequency-domain correction module further comprises a first residual concatenation sub-module, a first layer normalization sub-module, a second residual concatenation sub-module, a second layer normalization sub-module;

the method further comprises the following steps:

copying the output result of the first structure reshaping submodule into two parts;

the inputting of the output result of the first structure reshaping submodule into the bidirectional long-short term memory submodule comprises:

inputting an output result of the first structure reshaping submodule into the bidirectional long-short term memory submodule;

the inputting of the output result of the bidirectional long-short term memory submodule into the second structure reshaping submodule includes:

inputting the output result of the bidirectional long-short term memory submodule and the output result of the other first structure reshaping submodule into the first residual error connecting submodule;

inputting the output result of the first residual error connection sub-module into the first layer normalization sub-module;

inputting the output result of the first layer of normalization sub-module into the second structure reshaping sub-module;

the method further comprises the following steps:

copying the output result of the second structure reshaping submodule into two parts;

the inputting the output result of the second structure reshaping submodule into the multi-head self-attention submodule includes:

inputting an output result of the second structure reshaping submodule into the multi-head self-attention submodule;

the inputting the output result of the multi-head self-attention submodule into the third structure reshaping submodule includes:

inputting the output result of the multi-head self-attention sub-module and the output result of the other second structure reshaping sub-module into the second residual error connection sub-module;

inputting the output result of the second residual error connection sub-module into the second-layer normalization sub-module;

and inputting the output result of the second-layer normalization submodule into the third structure reshaping submodule.

5. The method of claim 2, wherein the time-domain, frequency-domain correction module comprises a first time-domain, frequency-domain correction module and a second time-domain, frequency-domain correction module, and wherein the correction model further comprises: the device comprises a frequency domain-time domain characteristic conversion module and a time domain-frequency domain characteristic conversion module;

the inputting the output result of the first dense connection dilation convolution module to the time-domain frequency-domain correction module includes:

inputting the output result of the first dense connection expansion convolution module into the first time domain and frequency domain correction module;

the method further comprises the following steps:

inputting the output result of the first time domain and frequency domain correction module into the frequency domain-time domain characteristic conversion module;

inputting the output result of the frequency domain-time domain feature conversion module into the second time domain frequency domain correction module;

inputting the output result of the second time domain and frequency domain correction module into the time domain-frequency domain characteristic conversion module;

and inputting the output result of the time domain-frequency domain feature conversion module into the second dense connection expansion convolution module.

6. The method of claim 1, wherein said obtaining a first spectrogram and a plurality of second spectrogram comprises:

separating the original voice signals by using a time domain model to obtain a plurality of original separated voice signals;

and respectively carrying out short-time Fourier transform on the original voice signal and the original separated voice signals to obtain a first spectrogram of the original voice signal and a second spectrogram of the original separated voice signals.

7. The method of claim 1, wherein said deriving a plurality of corrected separated speech signals from a plurality of said corrected second spectrogram comprises:

and carrying out inverse short-time Fourier transform on the plurality of corrected second spectrogram to obtain a plurality of corrected separated voice signals.

8. A speech separation apparatus, comprising a memory and a processor; the memory is used for storing a computer program; the processor is adapted to execute the computer program and to carry out the method of speech separation according to any of claims 1-7 when executing the computer program.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the speech separation method according to any one of claims 1-7.