CN110992966B

CN110992966B - Human voice separation method and system

Info

Publication number: CN110992966B
Application number: CN201911360803.7A
Authority: CN
Inventors: 黄明飞; 姚宏贵; 郝瀚
Original assignee: Open Intelligent Machine Shanghai Co ltd
Current assignee: Open Intelligent Machine Shanghai Co ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2022-07-01
Anticipated expiration: 2039-12-25
Also published as: CN110992966A

Abstract

The invention discloses a voice separation method and a voice separation system, which belong to the technical field of voice and noise separation, and the method comprises the following steps: step S1, obtaining the original audio data which is input from outside and needs to be separated; step S2, extracting the characteristics of the original audio data by a characteristic extraction model; step S3, importing the characteristic coefficients into a recurrent neural network model for processing, and step S4, respectively performing characteristic reduction on each processing result by adopting a characteristic reduction model; the system comprises: the device comprises an acquisition module, a feature extraction module, a neural network module and a feature restoration module; the beneficial effects are that: the method has the advantages that the method does not depend on any hypothesis, has strong anti-interference capability, can realize the human voice separation only by preparing a plurality of pure human voice data and noise data to be separated in advance as training data and training to generate a recurrent neural network model, and can separate not only the human voice but also noise sources mixed in the human voice.

Description

Human voice separation method and system

Technical Field

The invention relates to the technical field of voice and noise separation, in particular to a voice separation method and system.

Background

The voice separation refers to processing mixed voice in order to separate the voice of a target speaker in a complex noise environment, and the traditional voice separation is mainly based on a traditional voice separation algorithm, for example: least mean square error (LMS) algorithms, Least Squares (LS), etc., which rely on many assumptions and have large limitations, such as source signals being independent of each other, etc. Since the actual application scenarios are complex, these assumptions are difficult to be satisfied at the same time, which results in that the conventional human voice separation algorithm may only be effective in a specific application scenario, and the conventional algorithm has poor anti-interference capability.

Disclosure of Invention

According to the problems in the prior art, the method and the system for separating the human voice are provided, the method is based on artificial intelligence deep learning, does not depend on any hypothesis, only needs to prepare a plurality of pure human voice data and noise data needing to be separated in advance as training data and train to generate a recurrent neural network model, and can realize human voice separation.

The technical scheme specifically comprises the following steps:

a voice separation method is characterized in that mixed voice data which are mixed in advance are used as training data to train and generate a recurrent neural network model, the mixed voice data comprise multiple paths of voice data, the voice data comprise at least one path of voice data and at least one path of noise data, and the recurrent neural network model is used for respectively identifying the voice data and the noise data, and the method further comprises the following steps:

step S1, obtaining original audio data which is input from outside and needs to be separated, wherein at least one path of voice data and at least one path of noise data are mixed in the original audio data;

step S2, a feature extraction model is adopted to carry out feature extraction on the original audio data to obtain a feature coefficient, wherein the feature coefficient is a 22-dimensional BFCC coefficient;

step S3, importing the characteristic coefficients into the recurrent neural network model for processing to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;

and step S4, respectively performing feature restoration on each processing result by using a feature restoration model to obtain the separated human voice data and each path of noise data.

Preferably, the input data in the training data is the feature coefficients of the mixed voice data, and the expected output data in the training data is the pure human voice data and the pure noise data before mixing.

Preferably, wherein the step S2 further comprises:

step S21, dividing the original audio data into a plurality of short-time audios by adopting a mode of overlapping windows;

and step S22, performing Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.

Preferably, the recurrent neural network model includes a first gated cyclic unit, a plurality of second gated cyclic units, and a plurality of full-connected units, where the second gated cyclic units and the full-connected units are in one-to-one correspondence, and each of the second gated cyclic units uniquely corresponds to one path of the speech data;

the input end of the first gating circulation unit is used as the input end of the recurrent neural network;

the input end of each second gated circulation unit is respectively connected with the output end of the first gated circulation unit, the output end of each second gated circulation unit is respectively connected with the input end of the corresponding full connection unit, and the output end of each full connection unit is used as the output end of the recurrent neural network model;

the step S3 further includes:

step S31, the first gating circulation unit processes the input original audio data according to the characteristic coefficient to obtain 22-dimensional first characteristic data;

step S32, performing characteristic splicing on the first characteristic data output by the first gating circulation unit and the characteristic coefficient to obtain second characteristic data, and then respectively inputting the second characteristic data into a plurality of different second gating circulation units for processing;

step S33, each second gating cycle unit processes the second feature data and outputs 44-dimensional third feature data to the corresponding full-connection unit;

step S34, each full-connection unit obtains and outputs a corresponding processing result according to the third feature data, and then the step S4 is turned to, wherein all the processing results are 22-dimensional.

Preferably, wherein the step S4 further comprises:

step S41, performing inverse Fourier transform on each processing result to obtain a corresponding intermediate result;

and step S42, restoring each intermediate result in an overlapping windowing manner, and respectively restoring to obtain each path of voice data.

A voice separating system, wherein a recurrent neural network model is generated by training mixed voice data which is mixed as training data in advance, the mixed voice data comprises a plurality of paths of voice data, the voice data comprises at least one path of voice data and at least one path of noise data, the recurrent neural network model is used for respectively identifying the voice data and the noise data, the voice separating system specifically comprises:

the acquisition module is used for acquiring original audio data to be separated, wherein at least one path of the voice data and at least one path of the noise data are mixed in the original audio data;

the characteristic extraction module is connected with the acquisition module and used for extracting the characteristics of the original audio data to obtain a characteristic coefficient, wherein the characteristic coefficient is a 22-dimensional BFCC coefficient;

the neural network module is connected with the feature extraction module, the recurrent neural network model is preset in the neural network module, and the recurrent neural network model is used for processing the feature coefficients to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;

and the characteristic restoration module is connected with the neural network module and is used for respectively carrying out characteristic restoration on each processing result to obtain the separated human voice data and each path of noise data.

Preferably, wherein the feature extraction module further comprises:

the segmentation unit is used for segmenting the original audio data into a plurality of short-time audios in a mode of overlapping windows;

and the first processing unit is connected with the segmentation unit and used for carrying out Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.

Preferably, wherein the neural network module further comprises:

the first gating circulating unit is used for calculating the characteristic coefficient to obtain 22-dimensional first characteristic data;

the splicing unit is connected with the first gating circulating unit and is used for performing characteristic splicing on the first characteristic data and the characteristic coefficient to obtain second characteristic data;

the second gate control circulation units are respectively connected with the splicing units, each second gate control circulation unit is respectively and uniquely corresponding to one path of voice data and used for respectively processing the second characteristic data, and each second gate control circulation unit respectively outputs 44-dimensional third characteristic data;

and the full connection units are connected with the second gate control circulation units in a one-to-one correspondence manner and are used for processing the third characteristic data to obtain a corresponding processing result and outputting the processing result, wherein each processing result is 22-dimensional.

Preferably, wherein the feature reduction module further comprises:

the second processing unit is used for carrying out inverse Fourier transform on each calculation result to obtain a corresponding intermediate result;

and the restoring unit is connected with the second processing unit and used for restoring each intermediate result in an overlapping windowing manner and respectively restoring to obtain each path of voice data.

The beneficial effects of the above technical scheme are that:

the method is based on artificial intelligence deep learning, does not depend on any hypothesis, is high in anti-interference capacity, can realize voice separation by only needing to prepare a plurality of pure voice data and noise data needing to be separated in advance as training data and train to generate a recurrent neural network model, and can separate voice and noise sources mixed in the voice.

Drawings

FIG. 1 is a flow chart of the steps of a method for separating human voice according to the preferred embodiment of the present invention;

FIG. 2 is a flowchart illustrating the sub-steps of step S2 according to the preferred embodiment of the present invention;

FIG. 3 is a flowchart illustrating the steps of step S3 according to the preferred embodiment of the present invention;

FIG. 4 is a flowchart illustrating the steps of step S4 according to the preferred embodiment of the present invention;

FIG. 5 is a schematic diagram of a human voice separating system according to a preferred embodiment of the present invention;

FIG. 6 is a schematic diagram of the internal structure of the feature extraction module according to the preferred embodiment of the invention;

FIG. 7 is a schematic diagram of the internal structure of a neural network module according to a preferred embodiment of the present invention;

FIG. 8 is a schematic diagram of an internal structure of a feature reduction module according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive efforts based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

A voice separation method, wherein mixed voice data which is mixed in advance is used as training data to train and generate a recurrent neural network model, the mixed voice data comprises a plurality of paths of voice data, the voice data comprises at least one path of voice data and at least one path of noise data, the recurrent neural network model is used for respectively identifying the voice data and the noise data, as shown in FIG. 1, the method also comprises the following steps:

step S1, obtaining the original audio data which is input from outside and needs to be separated, wherein the original audio data is mixed with at least one path of voice data and at least one path of noise data;

step S2, a feature extraction model is adopted to carry out feature extraction on the original audio data to obtain a feature coefficient, and the feature coefficient is a 22-dimensional BFCC coefficient;

step S3, importing the characteristic coefficients into a recurrent neural network model for processing to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of noise data;

and step S4, respectively performing characteristic restoration on each processing result by using a characteristic restoration model to obtain separated human voice data and each path of noise data.

As a preferred embodiment, when training data is prepared to train the recurrent neural network model, the training data is divided into voice data and noise data, and the unified format is 8000hz sampling rate and 16bit sampling precision, wherein the voice data and the noise data may be one path or multiple paths. In the process of machine learning and model training, input data is data obtained by mixing at least one path of human voice data and at least one path of noise data, a real label of each path of data is pure data before mixing, and a loss function is a difference sum. The process is to clarify the number of the sound sources to be separated, and if the human voice and a noise source are to be separated, when the input data is mixed, one noise data is mixed into the human voice data, and so on.

In a preferred embodiment of the present invention, the input data in the training data is the feature coefficients of the mixed speech data, and the expected output data in the training data is the clean human voice data before mixing and the clean noise data.

In a preferred embodiment of the present invention, as shown in fig. 2, step S2 further includes:

step S21, dividing the original audio data into multiple short-time audios by overlapping windows;

and step S22, performing Fourier transform and BARK frequency conversion on each short-time audio to obtain a characteristic coefficient.

In the preferred embodiment of the present invention, the recurrent neural network model includes a first gated loop unit 30, a plurality of second gated loop units 32, and a plurality of full-connected units 33, where the second gated loop units 32 and the full-connected units 33 are in one-to-one correspondence, and each second gated loop unit 32 uniquely corresponds to one path of speech data;

the input of the first gated loop unit 30 serves as the input of the recurrent neural network;

the input end of each second gated loop unit 32 is connected with the output end of the first gated loop unit 30, the output end of each second gated loop unit 32 is connected with the input end of the corresponding full-connection unit 33, and the output end of each full-connection unit 33 is used as the output end of the recurrent neural network model;

as shown in fig. 3, step S3 further includes:

step S31, the first gating circulating unit 30 processes the input original audio data according to the feature coefficients to obtain 22-dimensional first feature data;

step S32, performing feature concatenation on the first feature data output by the first gating cycle unit 30 and the feature coefficients to obtain second feature data, and then inputting the second feature data into a plurality of different second gating cycle units 32 respectively for processing;

step S33, each second gating circulation unit 32 processes the second feature data and outputs 44-dimensional third feature data to the corresponding full-connection unit 33;

in step S34, each full-link unit 33 obtains and outputs a corresponding processing result according to the third feature data, and then the process goes to step S4, where all the processing results are 22-dimensional.

Specifically, in this embodiment, the original noise data to be separated is subjected to feature extraction, a 20ms short-time audio is obtained by overlapping windows, then a 22-dimensional feature is obtained by performing fourier transform and Bark frequency conversion on the audio data, and then the obtained 22-dimensional feature is introduced into a recurrent neural network to perform processing calculation, so as to obtain n 22-dimensional calculation results, where n refers to the number of audios to be separated, for example, x is 2 when a human voice and a noise source are to be separated.

In a preferred embodiment of the present invention, as shown in fig. 4, step S4 further includes:

and step S42, restoring each intermediate result in an overlapping windowing mode, and respectively restoring to obtain each path of voice data.

Specifically, in this embodiment, the n 22-dimensional features obtained in the above embodiment are respectively restored by a feature restoration model, and the features are first subjected to inverse fourier transform (IFFT) and then restored to 20ms audio by overlapping windowing.

A voice separation system, as shown in fig. 5, wherein mixed voice data that is mixed in advance is used as training data to train and generate a recurrent neural network model, the mixed voice data includes multiple paths of voice data, the voice data includes at least one path of voice data and at least one path of noise data, the recurrent neural network model is used to respectively identify the voice data and the noise data, and the voice separation system specifically includes:

the system comprises an acquisition module 1, a processing module and a processing module, wherein the acquisition module is used for acquiring original audio data to be separated, and at least one path of voice data and at least one path of noise data are mixed in the original audio data;

the characteristic extraction module 2 is connected with the acquisition module 1 and is used for extracting the characteristics of the original audio data to obtain a characteristic coefficient, wherein the characteristic coefficient is a 22-dimensional BFCC coefficient;

the neural network module 3 is connected with the characteristic extraction module 2, a recurrent neural network model is preset in the neural network module 3, and the recurrent neural network model is used for processing the characteristic coefficients to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of noise data;

and the characteristic restoration module 4 is connected with the neural network module 3 and is used for respectively restoring the characteristics of each processing result to obtain the separated human voice data and each path of noise data.

In a preferred embodiment of the present invention, as shown in fig. 6, the feature extraction module 2 further includes:

a dividing unit 20, configured to divide original audio data into a plurality of short-time audios in a manner of overlapping windows;

and the first processing unit 21 is connected with the dividing unit 20 and is used for performing fourier transform and BARK frequency conversion on each short-time audio to obtain a characteristic coefficient.

In a preferred embodiment of the present invention, as shown in fig. 7, the neural network module 3 further includes:

the first gating circulating unit 30 is used for calculating the characteristic coefficient to obtain 22-dimensional first characteristic data;

the splicing unit 31 is connected with the first gating circulating unit 30 and is used for performing characteristic splicing on the first characteristic data and the characteristic coefficients to obtain second characteristic data;

the plurality of second gating circulation units 32 are respectively connected with the splicing unit 31, each second gating circulation unit 32 respectively and uniquely corresponds to one path of voice data and is used for respectively processing the second characteristic data, and each second gating circulation unit 32 respectively outputs 44-dimensional third characteristic data;

and the full connection units 33 are connected with the second gate control circulation units 32 in a one-to-one correspondence manner, and are used for processing the third feature data to obtain and output a corresponding processing result, wherein each processing result is 22 dimensions.

Specifically, in the present embodiment, the 22-dimensional feature coefficients are transmitted into the first gated loop unit 30(GRU) for processing, the output dimension is still 22-dimensional, and the activation function is Relu; then, according to the number n of the audio to be separated, the 22-dimensional first feature data output from the first gating cycle unit 30 and the initially input 22-dimensional feature coefficients are spliced (concat) and then are respectively transmitted into n different second gating cycle units 32 to be processed in the second step, the output dimension of each second gating cycle unit 32 is 44 dimensions, and the activation function is still Relu. Then, the n 44-dimensional third feature data output by the second gating circulation unit 32 are respectively transmitted into n full-connection units 33 corresponding to each other, and the output dimension of each full-connection unit 33 is 22 dimensions.

In a preferred embodiment of the present invention, as shown in fig. 8, the feature restoration module 4 further includes:

the second processing unit 40 is configured to perform inverse fourier transform on each calculation result to obtain a corresponding intermediate result;

and the restoring unit 41 is connected to the second processing unit 40, and is configured to restore each intermediate result in an overlapping windowing manner, and restore each path of voice data respectively.

The beneficial effects of the above technical scheme are that:

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention.

Claims

1. A voice separation method is characterized in that mixed voice data which are mixed in advance are used as training data to train and generate a recurrent neural network model, the mixed voice data comprise multiple paths of voice data, the voice data comprise at least one path of voice data and at least one path of noise data, and the recurrent neural network model is used for respectively identifying the voice data and the noise data, and the method further comprises the following steps:

step S4, a feature reduction model is adopted to respectively carry out feature reduction on each processing result, and the separated voice data and each path of noise data are obtained;

the recurrent neural network model comprises a first gating circulation unit, a plurality of second gating circulation units and a plurality of full connection units, wherein the second gating circulation units are in one-to-one correspondence with the full connection units, and each second gating circulation unit is respectively and uniquely corresponding to one path of voice data;

the step S3 further includes:

2. The human voice separation method according to claim 1, wherein the input data in the training data is a feature coefficient of the mixed voice data, and the expected output data in the training data is pure human voice data before mixing and pure noise data.

3. The human voice separation method according to claim 1, wherein the step S2 further includes:

4. The human voice separation method according to claim 1, wherein the step S4 further includes:

and step S42, restoring each intermediate result in an overlapping windowing manner, and restoring to obtain each path of voice data respectively.

5. The utility model provides a voice separation system, its characterized in that, will mix the voice data through mixing in advance as training data training generation recurrent neural network model, including multichannel voice data in the mixed voice data, including at least one way voice data and at least one way noise data in the voice data, recurrent neural network model is used for discerning respectively the voice data with the noise data, voice separation system specifically includes:

the characteristic restoration module is connected with the neural network module and is used for respectively carrying out characteristic restoration on each processing result to obtain the separated human voice data and each path of noise data;

the neural network module further comprises:

6. The human voice separation system according to claim 5, wherein the input data in the training data is a feature coefficient of the mixed voice data, and the expected output data in the training data is pure human voice data before mixing and pure noise data.

7. The human voice separation system of claim 5, wherein the feature extraction module further comprises:

8. The human voice separation system of claim 5, wherein the feature restoration module further comprises:

the second processing unit is used for carrying out inverse Fourier transform on each processing result to obtain a corresponding intermediate result;

and the restoring unit is connected with the second processing unit and used for restoring each intermediate result in an overlapping windowing mode and respectively restoring to obtain each path of the voice data.