CN110992966B - Human voice separation method and system - Google Patents

Human voice separation method and system Download PDF

Info

Publication number
CN110992966B
CN110992966B CN201911360803.7A CN201911360803A CN110992966B CN 110992966 B CN110992966 B CN 110992966B CN 201911360803 A CN201911360803 A CN 201911360803A CN 110992966 B CN110992966 B CN 110992966B
Authority
CN
China
Prior art keywords
data
characteristic
voice
neural network
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911360803.7A
Other languages
Chinese (zh)
Other versions
CN110992966A (en
Inventor
黄明飞
姚宏贵
郝瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Open Intelligent Machine Shanghai Co ltd
Original Assignee
Open Intelligent Machine Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Open Intelligent Machine Shanghai Co ltd filed Critical Open Intelligent Machine Shanghai Co ltd
Priority to CN201911360803.7A priority Critical patent/CN110992966B/en
Publication of CN110992966A publication Critical patent/CN110992966A/en
Application granted granted Critical
Publication of CN110992966B publication Critical patent/CN110992966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a voice separation method and a voice separation system, which belong to the technical field of voice and noise separation, and the method comprises the following steps: step S1, obtaining the original audio data which is input from outside and needs to be separated; step S2, extracting the characteristics of the original audio data by a characteristic extraction model; step S3, importing the characteristic coefficients into a recurrent neural network model for processing, and step S4, respectively performing characteristic reduction on each processing result by adopting a characteristic reduction model; the system comprises: the device comprises an acquisition module, a feature extraction module, a neural network module and a feature restoration module; the beneficial effects are that: the method has the advantages that the method does not depend on any hypothesis, has strong anti-interference capability, can realize the human voice separation only by preparing a plurality of pure human voice data and noise data to be separated in advance as training data and training to generate a recurrent neural network model, and can separate not only the human voice but also noise sources mixed in the human voice.

Description

Human voice separation method and system
Technical Field
The invention relates to the technical field of voice and noise separation, in particular to a voice separation method and system.
Background
The voice separation refers to processing mixed voice in order to separate the voice of a target speaker in a complex noise environment, and the traditional voice separation is mainly based on a traditional voice separation algorithm, for example: least mean square error (LMS) algorithms, Least Squares (LS), etc., which rely on many assumptions and have large limitations, such as source signals being independent of each other, etc. Since the actual application scenarios are complex, these assumptions are difficult to be satisfied at the same time, which results in that the conventional human voice separation algorithm may only be effective in a specific application scenario, and the conventional algorithm has poor anti-interference capability.
Disclosure of Invention
According to the problems in the prior art, the method and the system for separating the human voice are provided, the method is based on artificial intelligence deep learning, does not depend on any hypothesis, only needs to prepare a plurality of pure human voice data and noise data needing to be separated in advance as training data and train to generate a recurrent neural network model, and can realize human voice separation.
The technical scheme specifically comprises the following steps:
a voice separation method is characterized in that mixed voice data which are mixed in advance are used as training data to train and generate a recurrent neural network model, the mixed voice data comprise multiple paths of voice data, the voice data comprise at least one path of voice data and at least one path of noise data, and the recurrent neural network model is used for respectively identifying the voice data and the noise data, and the method further comprises the following steps:
step S1, obtaining original audio data which is input from outside and needs to be separated, wherein at least one path of voice data and at least one path of noise data are mixed in the original audio data;
step S2, a feature extraction model is adopted to carry out feature extraction on the original audio data to obtain a feature coefficient, wherein the feature coefficient is a 22-dimensional BFCC coefficient;
step S3, importing the characteristic coefficients into the recurrent neural network model for processing to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;
and step S4, respectively performing feature restoration on each processing result by using a feature restoration model to obtain the separated human voice data and each path of noise data.
Preferably, the input data in the training data is the feature coefficients of the mixed voice data, and the expected output data in the training data is the pure human voice data and the pure noise data before mixing.
Preferably, wherein the step S2 further comprises:
step S21, dividing the original audio data into a plurality of short-time audios by adopting a mode of overlapping windows;
and step S22, performing Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.
Preferably, the recurrent neural network model includes a first gated cyclic unit, a plurality of second gated cyclic units, and a plurality of full-connected units, where the second gated cyclic units and the full-connected units are in one-to-one correspondence, and each of the second gated cyclic units uniquely corresponds to one path of the speech data;
the input end of the first gating circulation unit is used as the input end of the recurrent neural network;
the input end of each second gated circulation unit is respectively connected with the output end of the first gated circulation unit, the output end of each second gated circulation unit is respectively connected with the input end of the corresponding full connection unit, and the output end of each full connection unit is used as the output end of the recurrent neural network model;
the step S3 further includes:
step S31, the first gating circulation unit processes the input original audio data according to the characteristic coefficient to obtain 22-dimensional first characteristic data;
step S32, performing characteristic splicing on the first characteristic data output by the first gating circulation unit and the characteristic coefficient to obtain second characteristic data, and then respectively inputting the second characteristic data into a plurality of different second gating circulation units for processing;
step S33, each second gating cycle unit processes the second feature data and outputs 44-dimensional third feature data to the corresponding full-connection unit;
step S34, each full-connection unit obtains and outputs a corresponding processing result according to the third feature data, and then the step S4 is turned to, wherein all the processing results are 22-dimensional.
Preferably, wherein the step S4 further comprises:
step S41, performing inverse Fourier transform on each processing result to obtain a corresponding intermediate result;
and step S42, restoring each intermediate result in an overlapping windowing manner, and respectively restoring to obtain each path of voice data.
A voice separating system, wherein a recurrent neural network model is generated by training mixed voice data which is mixed as training data in advance, the mixed voice data comprises a plurality of paths of voice data, the voice data comprises at least one path of voice data and at least one path of noise data, the recurrent neural network model is used for respectively identifying the voice data and the noise data, the voice separating system specifically comprises:
the acquisition module is used for acquiring original audio data to be separated, wherein at least one path of the voice data and at least one path of the noise data are mixed in the original audio data;
the characteristic extraction module is connected with the acquisition module and used for extracting the characteristics of the original audio data to obtain a characteristic coefficient, wherein the characteristic coefficient is a 22-dimensional BFCC coefficient;
the neural network module is connected with the feature extraction module, the recurrent neural network model is preset in the neural network module, and the recurrent neural network model is used for processing the feature coefficients to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;
and the characteristic restoration module is connected with the neural network module and is used for respectively carrying out characteristic restoration on each processing result to obtain the separated human voice data and each path of noise data.
Preferably, the input data in the training data is the feature coefficients of the mixed voice data, and the expected output data in the training data is the pure human voice data and the pure noise data before mixing.
Preferably, wherein the feature extraction module further comprises:
the segmentation unit is used for segmenting the original audio data into a plurality of short-time audios in a mode of overlapping windows;
and the first processing unit is connected with the segmentation unit and used for carrying out Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.
Preferably, wherein the neural network module further comprises:
the first gating circulating unit is used for calculating the characteristic coefficient to obtain 22-dimensional first characteristic data;
the splicing unit is connected with the first gating circulating unit and is used for performing characteristic splicing on the first characteristic data and the characteristic coefficient to obtain second characteristic data;
the second gate control circulation units are respectively connected with the splicing units, each second gate control circulation unit is respectively and uniquely corresponding to one path of voice data and used for respectively processing the second characteristic data, and each second gate control circulation unit respectively outputs 44-dimensional third characteristic data;
and the full connection units are connected with the second gate control circulation units in a one-to-one correspondence manner and are used for processing the third characteristic data to obtain a corresponding processing result and outputting the processing result, wherein each processing result is 22-dimensional.
Preferably, wherein the feature reduction module further comprises:
the second processing unit is used for carrying out inverse Fourier transform on each calculation result to obtain a corresponding intermediate result;
and the restoring unit is connected with the second processing unit and used for restoring each intermediate result in an overlapping windowing manner and respectively restoring to obtain each path of voice data.
The beneficial effects of the above technical scheme are that:
the method is based on artificial intelligence deep learning, does not depend on any hypothesis, is high in anti-interference capacity, can realize voice separation by only needing to prepare a plurality of pure voice data and noise data needing to be separated in advance as training data and train to generate a recurrent neural network model, and can separate voice and noise sources mixed in the voice.
Drawings
FIG. 1 is a flow chart of the steps of a method for separating human voice according to the preferred embodiment of the present invention;
FIG. 2 is a flowchart illustrating the sub-steps of step S2 according to the preferred embodiment of the present invention;
FIG. 3 is a flowchart illustrating the steps of step S3 according to the preferred embodiment of the present invention;
FIG. 4 is a flowchart illustrating the steps of step S4 according to the preferred embodiment of the present invention;
FIG. 5 is a schematic diagram of a human voice separating system according to a preferred embodiment of the present invention;
FIG. 6 is a schematic diagram of the internal structure of the feature extraction module according to the preferred embodiment of the invention;
FIG. 7 is a schematic diagram of the internal structure of a neural network module according to a preferred embodiment of the present invention;
FIG. 8 is a schematic diagram of an internal structure of a feature reduction module according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive efforts based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
A voice separation method, wherein mixed voice data which is mixed in advance is used as training data to train and generate a recurrent neural network model, the mixed voice data comprises a plurality of paths of voice data, the voice data comprises at least one path of voice data and at least one path of noise data, the recurrent neural network model is used for respectively identifying the voice data and the noise data, as shown in FIG. 1, the method also comprises the following steps:
step S1, obtaining the original audio data which is input from outside and needs to be separated, wherein the original audio data is mixed with at least one path of voice data and at least one path of noise data;
step S2, a feature extraction model is adopted to carry out feature extraction on the original audio data to obtain a feature coefficient, and the feature coefficient is a 22-dimensional BFCC coefficient;
step S3, importing the characteristic coefficients into a recurrent neural network model for processing to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of noise data;
and step S4, respectively performing characteristic restoration on each processing result by using a characteristic restoration model to obtain separated human voice data and each path of noise data.
As a preferred embodiment, when training data is prepared to train the recurrent neural network model, the training data is divided into voice data and noise data, and the unified format is 8000hz sampling rate and 16bit sampling precision, wherein the voice data and the noise data may be one path or multiple paths. In the process of machine learning and model training, input data is data obtained by mixing at least one path of human voice data and at least one path of noise data, a real label of each path of data is pure data before mixing, and a loss function is a difference sum. The process is to clarify the number of the sound sources to be separated, and if the human voice and a noise source are to be separated, when the input data is mixed, one noise data is mixed into the human voice data, and so on.
In a preferred embodiment of the present invention, the input data in the training data is the feature coefficients of the mixed speech data, and the expected output data in the training data is the clean human voice data before mixing and the clean noise data.
In a preferred embodiment of the present invention, as shown in fig. 2, step S2 further includes:
step S21, dividing the original audio data into multiple short-time audios by overlapping windows;
and step S22, performing Fourier transform and BARK frequency conversion on each short-time audio to obtain a characteristic coefficient.
In the preferred embodiment of the present invention, the recurrent neural network model includes a first gated loop unit 30, a plurality of second gated loop units 32, and a plurality of full-connected units 33, where the second gated loop units 32 and the full-connected units 33 are in one-to-one correspondence, and each second gated loop unit 32 uniquely corresponds to one path of speech data;
the input of the first gated loop unit 30 serves as the input of the recurrent neural network;
the input end of each second gated loop unit 32 is connected with the output end of the first gated loop unit 30, the output end of each second gated loop unit 32 is connected with the input end of the corresponding full-connection unit 33, and the output end of each full-connection unit 33 is used as the output end of the recurrent neural network model;
as shown in fig. 3, step S3 further includes:
step S31, the first gating circulating unit 30 processes the input original audio data according to the feature coefficients to obtain 22-dimensional first feature data;
step S32, performing feature concatenation on the first feature data output by the first gating cycle unit 30 and the feature coefficients to obtain second feature data, and then inputting the second feature data into a plurality of different second gating cycle units 32 respectively for processing;
step S33, each second gating circulation unit 32 processes the second feature data and outputs 44-dimensional third feature data to the corresponding full-connection unit 33;
in step S34, each full-link unit 33 obtains and outputs a corresponding processing result according to the third feature data, and then the process goes to step S4, where all the processing results are 22-dimensional.
Specifically, in this embodiment, the original noise data to be separated is subjected to feature extraction, a 20ms short-time audio is obtained by overlapping windows, then a 22-dimensional feature is obtained by performing fourier transform and Bark frequency conversion on the audio data, and then the obtained 22-dimensional feature is introduced into a recurrent neural network to perform processing calculation, so as to obtain n 22-dimensional calculation results, where n refers to the number of audios to be separated, for example, x is 2 when a human voice and a noise source are to be separated.
In a preferred embodiment of the present invention, as shown in fig. 4, step S4 further includes:
step S41, performing inverse Fourier transform on each processing result to obtain a corresponding intermediate result;
and step S42, restoring each intermediate result in an overlapping windowing mode, and respectively restoring to obtain each path of voice data.
Specifically, in this embodiment, the n 22-dimensional features obtained in the above embodiment are respectively restored by a feature restoration model, and the features are first subjected to inverse fourier transform (IFFT) and then restored to 20ms audio by overlapping windowing.
A voice separation system, as shown in fig. 5, wherein mixed voice data that is mixed in advance is used as training data to train and generate a recurrent neural network model, the mixed voice data includes multiple paths of voice data, the voice data includes at least one path of voice data and at least one path of noise data, the recurrent neural network model is used to respectively identify the voice data and the noise data, and the voice separation system specifically includes:
the system comprises an acquisition module 1, a processing module and a processing module, wherein the acquisition module is used for acquiring original audio data to be separated, and at least one path of voice data and at least one path of noise data are mixed in the original audio data;
the characteristic extraction module 2 is connected with the acquisition module 1 and is used for extracting the characteristics of the original audio data to obtain a characteristic coefficient, wherein the characteristic coefficient is a 22-dimensional BFCC coefficient;
the neural network module 3 is connected with the characteristic extraction module 2, a recurrent neural network model is preset in the neural network module 3, and the recurrent neural network model is used for processing the characteristic coefficients to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of noise data;
and the characteristic restoration module 4 is connected with the neural network module 3 and is used for respectively restoring the characteristics of each processing result to obtain the separated human voice data and each path of noise data.
In a preferred embodiment of the present invention, the input data in the training data is the feature coefficients of the mixed speech data, and the expected output data in the training data is the clean human voice data before mixing and the clean noise data.
In a preferred embodiment of the present invention, as shown in fig. 6, the feature extraction module 2 further includes:
a dividing unit 20, configured to divide original audio data into a plurality of short-time audios in a manner of overlapping windows;
and the first processing unit 21 is connected with the dividing unit 20 and is used for performing fourier transform and BARK frequency conversion on each short-time audio to obtain a characteristic coefficient.
In a preferred embodiment of the present invention, as shown in fig. 7, the neural network module 3 further includes:
the first gating circulating unit 30 is used for calculating the characteristic coefficient to obtain 22-dimensional first characteristic data;
the splicing unit 31 is connected with the first gating circulating unit 30 and is used for performing characteristic splicing on the first characteristic data and the characteristic coefficients to obtain second characteristic data;
the plurality of second gating circulation units 32 are respectively connected with the splicing unit 31, each second gating circulation unit 32 respectively and uniquely corresponds to one path of voice data and is used for respectively processing the second characteristic data, and each second gating circulation unit 32 respectively outputs 44-dimensional third characteristic data;
and the full connection units 33 are connected with the second gate control circulation units 32 in a one-to-one correspondence manner, and are used for processing the third feature data to obtain and output a corresponding processing result, wherein each processing result is 22 dimensions.
Specifically, in the present embodiment, the 22-dimensional feature coefficients are transmitted into the first gated loop unit 30(GRU) for processing, the output dimension is still 22-dimensional, and the activation function is Relu; then, according to the number n of the audio to be separated, the 22-dimensional first feature data output from the first gating cycle unit 30 and the initially input 22-dimensional feature coefficients are spliced (concat) and then are respectively transmitted into n different second gating cycle units 32 to be processed in the second step, the output dimension of each second gating cycle unit 32 is 44 dimensions, and the activation function is still Relu. Then, the n 44-dimensional third feature data output by the second gating circulation unit 32 are respectively transmitted into n full-connection units 33 corresponding to each other, and the output dimension of each full-connection unit 33 is 22 dimensions.
In a preferred embodiment of the present invention, as shown in fig. 8, the feature restoration module 4 further includes:
the second processing unit 40 is configured to perform inverse fourier transform on each calculation result to obtain a corresponding intermediate result;
and the restoring unit 41 is connected to the second processing unit 40, and is configured to restore each intermediate result in an overlapping windowing manner, and restore each path of voice data respectively.
The beneficial effects of the above technical scheme are that:
the method is based on artificial intelligence deep learning, does not depend on any hypothesis, is high in anti-interference capacity, can realize voice separation by only needing to prepare a plurality of pure voice data and noise data needing to be separated in advance as training data and train to generate a recurrent neural network model, and can separate voice and noise sources mixed in the voice.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A voice separation method is characterized in that mixed voice data which are mixed in advance are used as training data to train and generate a recurrent neural network model, the mixed voice data comprise multiple paths of voice data, the voice data comprise at least one path of voice data and at least one path of noise data, and the recurrent neural network model is used for respectively identifying the voice data and the noise data, and the method further comprises the following steps:
step S1, obtaining original audio data which is input from outside and needs to be separated, wherein at least one path of voice data and at least one path of noise data are mixed in the original audio data;
step S2, a feature extraction model is adopted to carry out feature extraction on the original audio data to obtain a feature coefficient, wherein the feature coefficient is a 22-dimensional BFCC coefficient;
step S3, importing the characteristic coefficients into the recurrent neural network model for processing to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;
step S4, a feature reduction model is adopted to respectively carry out feature reduction on each processing result, and the separated voice data and each path of noise data are obtained;
the recurrent neural network model comprises a first gating circulation unit, a plurality of second gating circulation units and a plurality of full connection units, wherein the second gating circulation units are in one-to-one correspondence with the full connection units, and each second gating circulation unit is respectively and uniquely corresponding to one path of voice data;
the input end of the first gating circulation unit is used as the input end of the recurrent neural network;
the input end of each second gated circulation unit is respectively connected with the output end of the first gated circulation unit, the output end of each second gated circulation unit is respectively connected with the input end of the corresponding full connection unit, and the output end of each full connection unit is used as the output end of the recurrent neural network model;
the step S3 further includes:
step S31, the first gating circulation unit processes the input original audio data according to the characteristic coefficient to obtain 22-dimensional first characteristic data;
step S32, performing characteristic splicing on the first characteristic data output by the first gating circulation unit and the characteristic coefficient to obtain second characteristic data, and then respectively inputting the second characteristic data into a plurality of different second gating circulation units for processing;
step S33, each second gating cycle unit processes the second feature data and outputs 44-dimensional third feature data to the corresponding full-connection unit;
step S34, each full-connection unit obtains and outputs a corresponding processing result according to the third feature data, and then the step S4 is turned to, wherein all the processing results are 22-dimensional.
2. The human voice separation method according to claim 1, wherein the input data in the training data is a feature coefficient of the mixed voice data, and the expected output data in the training data is pure human voice data before mixing and pure noise data.
3. The human voice separation method according to claim 1, wherein the step S2 further includes:
step S21, dividing the original audio data into a plurality of short-time audios by adopting a mode of overlapping windows;
and step S22, performing Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.
4. The human voice separation method according to claim 1, wherein the step S4 further includes:
step S41, performing inverse Fourier transform on each processing result to obtain a corresponding intermediate result;
and step S42, restoring each intermediate result in an overlapping windowing manner, and restoring to obtain each path of voice data respectively.
5. The utility model provides a voice separation system, its characterized in that, will mix the voice data through mixing in advance as training data training generation recurrent neural network model, including multichannel voice data in the mixed voice data, including at least one way voice data and at least one way noise data in the voice data, recurrent neural network model is used for discerning respectively the voice data with the noise data, voice separation system specifically includes:
the acquisition module is used for acquiring original audio data to be separated, wherein at least one path of the voice data and at least one path of the noise data are mixed in the original audio data;
the characteristic extraction module is connected with the acquisition module and used for extracting the characteristics of the original audio data to obtain a characteristic coefficient, wherein the characteristic coefficient is a 22-dimensional BFCC coefficient;
the neural network module is connected with the feature extraction module, the recurrent neural network model is preset in the neural network module, and the recurrent neural network model is used for processing the feature coefficients to obtain a plurality of processing results which are respectively in one-to-one correspondence with the human voice data and each path of the noise data;
the characteristic restoration module is connected with the neural network module and is used for respectively carrying out characteristic restoration on each processing result to obtain the separated human voice data and each path of noise data;
the neural network module further comprises:
the first gating circulating unit is used for calculating the characteristic coefficient to obtain 22-dimensional first characteristic data;
the splicing unit is connected with the first gating circulating unit and is used for performing characteristic splicing on the first characteristic data and the characteristic coefficient to obtain second characteristic data;
the second gate control circulation units are respectively connected with the splicing units, each second gate control circulation unit is respectively and uniquely corresponding to one path of voice data and used for respectively processing the second characteristic data, and each second gate control circulation unit respectively outputs 44-dimensional third characteristic data;
and the full connection units are connected with the second gate control circulation units in a one-to-one correspondence manner and are used for processing the third characteristic data to obtain a corresponding processing result and outputting the processing result, wherein each processing result is 22-dimensional.
6. The human voice separation system according to claim 5, wherein the input data in the training data is a feature coefficient of the mixed voice data, and the expected output data in the training data is pure human voice data before mixing and pure noise data.
7. The human voice separation system of claim 5, wherein the feature extraction module further comprises:
the segmentation unit is used for segmenting the original audio data into a plurality of short-time audios in a mode of overlapping windows;
and the first processing unit is connected with the segmentation unit and used for carrying out Fourier transform and BARK frequency conversion on each short-time audio to obtain the characteristic coefficient.
8. The human voice separation system of claim 5, wherein the feature restoration module further comprises:
the second processing unit is used for carrying out inverse Fourier transform on each processing result to obtain a corresponding intermediate result;
and the restoring unit is connected with the second processing unit and used for restoring each intermediate result in an overlapping windowing mode and respectively restoring to obtain each path of the voice data.
CN201911360803.7A 2019-12-25 2019-12-25 Human voice separation method and system Active CN110992966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911360803.7A CN110992966B (en) 2019-12-25 2019-12-25 Human voice separation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911360803.7A CN110992966B (en) 2019-12-25 2019-12-25 Human voice separation method and system

Publications (2)

Publication Number Publication Date
CN110992966A CN110992966A (en) 2020-04-10
CN110992966B true CN110992966B (en) 2022-07-01

Family

ID=70076980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911360803.7A Active CN110992966B (en) 2019-12-25 2019-12-25 Human voice separation method and system

Country Status (1)

Country Link
CN (1) CN110992966B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185357A (en) * 2020-12-02 2021-01-05 成都启英泰伦科技有限公司 Device and method for simultaneously recognizing human voice and non-human voice

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103941715B (en) * 2014-05-13 2016-02-17 郝瀚 One key control system of home electric and method
US20160189730A1 (en) * 2014-12-30 2016-06-30 Iflytek Co., Ltd. Speech separation method and system
US9818431B2 (en) * 2015-12-21 2017-11-14 Microsoft Technoloogy Licensing, LLC Multi-speaker speech separation
US11373672B2 (en) * 2016-06-14 2022-06-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
US20190206417A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation
CN109801644B (en) * 2018-12-20 2021-03-09 北京达佳互联信息技术有限公司 Separation method, separation device, electronic equipment and readable medium for mixed sound signal
CN110459238B (en) * 2019-04-12 2020-11-20 腾讯科技(深圳)有限公司 Voice separation method, voice recognition method and related equipment

Also Published As

Publication number Publication date
CN110992966A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN110600018B (en) Voice recognition method and device and neural network training method and device
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
Grais et al. Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders
CN108899047B (en) The masking threshold estimation method, apparatus and storage medium of audio signal
CN112037809A (en) Residual echo suppression method based on multi-feature flow structure deep neural network
CN110544482A (en) single-channel voice separation system
US20200380943A1 (en) Apparatuses and methods for creating noise environment noisy data and eliminating noise
CN110992966B (en) Human voice separation method and system
CN108962276B (en) Voice separation method and device
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN113035225B (en) Visual voiceprint assisted voice separation method and device
Chen et al. On Synthesis for Supervised Monaural Speech Separation in Time Domain.
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
CN113345465B (en) Voice separation method, device, equipment and computer readable storage medium
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
CN114283832A (en) Processing method and device for multi-channel audio signal
CN111028857A (en) Method and system for reducing noise of multi-channel audio and video conference based on deep learning
WO2020250220A1 (en) Sound analysis for determination of sound sources and sound isolation
JP2003271168A (en) Method, device and program for extracting signal, and recording medium recorded with the program
CN114495974B (en) Audio signal processing method
CN111833897B (en) Voice enhancement method for interactive education
CN117238277B (en) Intention recognition method, device, storage medium and computer equipment
CN111009259B (en) Audio processing method and device
Bao et al. Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication
Zhang et al. Multiple Sound Sources Separation Using Two-stage Network Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant