WO2020029906A1 - 一种多人语音的分离方法和装置 - Google Patents
一种多人语音的分离方法和装置 Download PDFInfo
- Publication number
- WO2020029906A1 WO2020029906A1 PCT/CN2019/099216 CN2019099216W WO2020029906A1 WO 2020029906 A1 WO2020029906 A1 WO 2020029906A1 CN 2019099216 W CN2019099216 W CN 2019099216W WO 2020029906 A1 WO2020029906 A1 WO 2020029906A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network model
- mixed
- speech
- sample
- voice
- Prior art date
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 77
- 230000000873 masking effect Effects 0.000 claims abstract description 78
- 239000011159 matrix material Substances 0.000 claims abstract description 41
- 239000000284 extract Substances 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 54
- 238000000034 method Methods 0.000 claims description 53
- 238000012549 training Methods 0.000 claims description 44
- 230000015654 memory Effects 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 230000009467 reduction Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 101100269850 Caenorhabditis elegans mask-1 gene Proteins 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present application relates to the technical field of signal processing, and in particular, to a method and a device for separating multi-person speech.
- Speech noise reduction schemes provided by related technologies are mainly applicable to the separation of speech and noise. Because the characteristics of speech and noise are very different, related speech noise reduction schemes can already complete the task of speech noise reduction. And because the speech characteristics of different speakers are very close, the technical difficulty of speech separation is obviously greater than speech noise reduction. How to separate speech from speech is still an unsolved problem.
- the embodiments of the present application provide a method and a device for separating multi-person voices, which are used to implement the separation between voices in a multi-person voice scene.
- an embodiment of the present application provides a method for separating multi-person voices, including:
- the terminal extracts a mixed voice feature from the mixed voice signal to be separated, and the mixed voice signal is mixed with N kinds of human voices, where N is a positive integer greater than or equal to 2;
- the terminal uses the generated adversarial network model to extract the masking coefficients of the mixed voice features to obtain a masking matrix corresponding to N kinds of human voices;
- the terminal uses the generated adversarial network model to perform voice separation on the masking matrix corresponding to the N kinds of human voices and the mixed voice signal, and outputs N kinds of separated voice signals corresponding to the N kinds of human voices.
- the embodiment of the present application further provides a device for separating multiple voices, which is installed in the terminal and includes:
- a feature extraction module configured to extract a mixed voice feature from a mixed voice signal to be separated, wherein the mixed voice signal is mixed with N kinds of human voices, and N is a positive integer greater than or equal to 2;
- a masking matrix generation module configured to use a generation adversarial network model to perform masking coefficient extraction on the mixed speech features to obtain a masking matrix corresponding to N kinds of human voices;
- a voice separation module configured to use the generated adversarial network model to perform voice separation on the masking matrix corresponding to the N kinds of human voices and the mixed voice signal, and output N kinds of separated voice signals corresponding to the N kinds of human voices .
- the constituent modules of the multi-person voice separation device may also perform the steps described in the foregoing aspect and various possible implementations. For details, refer to the foregoing description of the foregoing aspect and various possible implementations. .
- an embodiment of the present application provides a multi-person voice separation device.
- the multi-person voice separation device includes: a processor and a memory; the memory is used to store instructions; the processor is used to execute instructions in the memory, so that multiple people
- the speech separation device performs the method of any of the foregoing aspects.
- the embodiment of the present application provides a computer-readable program.
- the computer-readable program stores instructions.
- the computer-readable program runs on the computer, the computer executes the methods described in the foregoing aspects.
- the terminal first extracts the mixed speech features from the mixed speech signals to be separated, and the mixed speech signals are mixed with N kinds of human voices, and then uses the generated adversarial network model to extract the masking coefficients of the mixed speech features to obtain Masking matrices corresponding to N kinds of human voices; the terminal uses the generation adversarial network model to perform voice separation on the masking matrices corresponding to N kinds of human voices and mixed voice signals, and outputs N kinds of separated voice signals corresponding to N kinds of human voices.
- the generation adversarial network model can be used to extract the masking matrices corresponding to N kinds of human voices in the embodiment of the present application, the generation adversarial network model can accurately identify the speech signals corresponding to various human voices, and implement the speech separation network based on the generation adversarial network model
- the framework realizes the separation between speech and speech in a multi-person speech scene, and improves the performance of speech separation.
- FIG. 1 is a schematic flowchart of a method for separating multi-person voice according to an embodiment of the present application
- FIG. 2 is a schematic flowchart of a training process for generating an adversarial network model according to an embodiment of the present application
- FIG. 3 is a schematic diagram of a model architecture for generating an adversarial network model according to an embodiment of the present application
- FIG. 4-a is a schematic structural diagram of a multi-person voice separation device according to an embodiment of the present application.
- FIG. 4-b is a schematic structural diagram of a multi-person voice separation device according to an embodiment of the present application.
- FIG. 4-c is a schematic structural diagram of a model training module according to an embodiment of the present application.
- FIG. 4-d is a schematic structural diagram of a structure for generating a network training unit according to an embodiment of the present application.
- FIG. 4-e is a schematic structural diagram of a discrimination network training unit according to an embodiment of the present application.
- FIG. 5 is a schematic diagram of a composition structure of a method for separating multi-person voices applied to a terminal according to an embodiment of the present application
- FIG. 6 is a schematic diagram of a composition structure of a method for separating multi-person voices applied to a server according to an embodiment of the present application.
- the embodiments of the present application provide a method and a device for separating multi-person voices, which are used to implement the separation between voices in a multi-person voice scene.
- the embodiments of the present application mainly provide a method for separating multi-person voices.
- the embodiments of the present application can complete the separation between voices and voices in a multi-person scene through a neural network, and are applied to voice interactions in complex acoustic scenarios, such as Voice recognition for scenes such as smart speakers and smart TVs.
- the embodiment of the present application further provides a multi-person voice separation device.
- the multi-person voice separation device may be deployed in a terminal by means of audio processing software.
- the multi-person voice separation device may also be a server storing audio.
- the voice separation task performed on the mixed voice signal in the embodiment of the present application is completely different from the noise reduction in the related art.
- Voice noise reduction refers to removing the noise signal contained in the input audio and retaining the voice.
- Speech separation refers to the separation of speech belonging to different speakers in the input audio.
- the input audio contains noise and multi-person voice
- the output is a multi-person mixed voice after removing the noise.
- the individual speech of each speaker is output.
- the noise is output separately or directly removed, it depends on the design of different speech separation algorithms. From the perspective of the processing difficulty of audio characteristics, because the characteristics of speech and noise are very different, related noise reduction schemes have been able to complete the task of noise reduction well. And because the speech characteristics of different speakers are very close, the technical difficulty of speech separation is obviously greater than speech noise reduction.
- a machine learning method is used to train a generative adversarial network (GAN) model.
- the generative adversarial network model can also be referred to as a generative adversarial network model.
- the network model may be implemented by a neural network model.
- the neural network model used in the embodiments of the present application may specifically include: Deep Neural Networks (DNN), Long Short-Term Memory (LSTM) Convolutional Neural Network (CNN).
- DNN Deep Neural Networks
- LSTM Long Short-Term Memory
- CNN Convolutional Neural Network
- the masking matrix for example, obtains the masking coefficient on each frequency channel for the input mixed speech frame by frame, to form a masking matrix.
- a generation adversarial network model is used to perform speech separation on the masking matrices and mixed speech signals corresponding to N kinds of human voices, and output multiple separated speech signals.
- the generated adversarial network model used in the embodiment of the present application can effectively extract the masking matrices corresponding to N kinds of human voices for voice processing, so as to automatically separate the voice signal of a single human voice from a section of mixed voice, and realize the class Intelligent recognition of N kinds of human voices.
- a method for separating multi-person voice may include the following steps:
- the terminal extracts a mixed voice feature from the mixed voice signal to be separated.
- the mixed voice signal is mixed with N types of human voices, and N is a positive integer greater than or equal to 2.
- the number of sound sources is represented by the letter N.
- the number of sound sources N is greater than or equal to 2, that is, a plurality of human voices can be included in a mixed voice signal.
- the generated adversarial network model provided by the example can separate the voice signals of N kinds of human voices.
- the terminal first obtains a segment of the mixed voice signal to be separated, and first extracts the features corresponding to the mixed voice signal, that is, obtains the mixed voice feature.
- the mixed voice feature is an input feature that generates an adversarial network model. In practical applications, there can be multiple ways to obtain mixed speech features.
- step 101 the terminal extracts the mixed voice feature from the mixed voice signal to be separated, including:
- the terminal extracts the time domain feature or frequency domain feature of the single-channel voice signal from the mixed voice signal; or,
- the terminal extracts the time domain feature or frequency domain feature of the multi-channel voice signal from the mixed voice signal; or
- the terminal extracts a single-channel voice feature from the mixed voice signal
- the terminal extracts the correlation features between multiple channels from the mixed voice signal.
- the mixed voice signals to be separated in the embodiments of the present application may be acquired from a single channel or multiple channels.
- the mixed speech feature may include one or more of the following features, for example, it may include: the time domain feature or the frequency domain feature of the original single-channel / multi-channel voice signal.
- the mixed speech feature may be a single-channel speech feature, such as logarithmic energy spectrum, Mel Frequency Frequency Cepstrum Coefficient (MFCC), subband energy, and the like.
- MFCC Mel Frequency Frequency Cepstrum Coefficient
- the mixed speech feature may include: correlation features between multiple channels, such as generalized cross correlation (GCC) features, phase difference features, and the like.
- GCC generalized cross correlation
- the type of the extracted feature and the content of the feature can be determined in combination with the specific scene.
- the terminal uses a generation adversarial network model to extract a masking coefficient for the mixed speech features, and obtains a masking matrix corresponding to N kinds of human voices.
- the terminal may use a generation adversarial network model to separate a voice signal of a single human voice in the mixed voice signal, and after obtaining the mixed voice feature corresponding to the mixed voice signal, input the mixed voice feature to
- the neural network in the generative adversarial network model is used to extract the masking coefficients corresponding to each human voice.
- the masking coefficients can be formed on each frequency channel of the input mixed speech frame by frame to form a masking matrix.
- a masking matrix corresponding to N kinds of human voices can be generated, and the masking matrix can be used for voice separation of multiple human voices in a mixed voice signal.
- the generated adversarial network model used in the embodiment of the present application can be obtained by training by mixing voice samples and clean speech samples.
- the generated adversarial network model used in the embodiment of the present application is an effective unsupervised learning method.
- the terminal uses the generated adversarial network model to perform voice separation on the masking matrix and the mixed voice signal corresponding to the N kinds of human voices, and outputs N kinds of separated voice signals corresponding to the N kinds of human voices.
- the terminal after the terminal extracts a masking matrix corresponding to N kinds of human voices by generating an adversarial network model, it uses the generated adversarial network model to perform speech separation on the masking matrix and the mixed voice signal, so that the separation function
- the mixed speech signals identify separate speech signals that belong to different sound sources, which solves the problem that related technologies cannot identify multiple human voices.
- the terminal first extracts the mixed voice features from the mixed voice signals to be separated, and the mixed voice signals are mixed with N kinds of human voices, and then uses the generated adversarial network model to perform the mixed voice features.
- the masking coefficient is extracted to obtain the masking matrices corresponding to N kinds of human voices; the terminal uses the generation adversarial network model to perform voice separation on the masking matrices and mixed voice signals corresponding to N kinds of human voices, and outputs N kinds of separated voices corresponding to N kinds of human voices signal.
- the generation adversarial network model can be used to extract the masking matrices corresponding to N kinds of human voices in the embodiment of the present application, the generation adversarial network model can accurately identify the speech signals corresponding to various human voices, and implement the speech separation network based on the generation adversarial network model
- the framework realizes the separation between speech and speech in a multi-person speech scene, and improves the performance of speech separation.
- the generation of the adversarial network model in the embodiment of the present application includes at least two network models, one of which is a generation network model and the other is a discrimination network model.
- the generation network model may also be referred to as a generator, and the discrimination network model may also be referred to as a discriminator.
- the method for separating multi-person voice before the terminal extracts the mixed voice feature from the mixed voice signal to be separated, further includes:
- the terminal obtains mixed voice samples and clean voice samples from the sample database.
- the terminal extracts the features of the mixed speech samples from the mixed speech samples.
- the terminal extracts the masking coefficient features of the mixed voice samples by generating a network model to obtain a sample masking matrix corresponding to N types of human voices.
- the terminal uses the generated network model to perform voice separation on the sample masking matrix and the mixed voice samples, and outputs the separated voice samples.
- the terminal uses separate speech samples, mixed speech samples, and clean speech samples to alternately train the generated network model and the adversarial network model.
- a sample database may be provided for model training and discrimination.
- a mixed voice signal is used for model training.
- the “mixed voice sample” here is different from the mixed voice signal in step 101.
- the voice samples are sample voices in the sample database.
- clean voice samples are also provided in the sample database. During the training process, the perceived voice samples are obtained by superimposing multiple clean voices.
- the feature extraction of the mixed speech samples is the same as the feature extraction in step 101.
- the sample masking matrix is similar to the masking matrix generation method in the foregoing step 102.
- the sample masking matrix here is based on the mixture.
- the masking matrix generated by the features of the speech samples.
- use the generated network model to perform speech separation on the sample masking matrix and the mixed speech samples, and output the separated speech samples.
- the number of sound sources that can be used in the model training process is 2 or more. The number is not limited here.
- the discriminative network model After generating the network model to output the separated speech samples, the discriminative network model is used to determine whether the output separated speech samples are the same as the clean speech samples based on the separated speech samples, mixed speech samples, and clean speech samples.
- the discriminative network model is used to introduce an adversarial loss Function, so that the generated network model and the discriminative network model are repeatedly trained several times, which can better ensure that the separated speech samples are closer to the real clean speech samples.
- the terminal uses separate speech samples, mixed speech samples, and clean speech samples to alternately train the generated network model and the adversarial network model, including:
- a terminal determines a network model during the training, the terminal generates a network model.
- the terminal obtains a loss function for discriminating the network model by using the separated voice sample, the mixed voice sample, and the clean voice sample.
- the terminal optimizes the discrimination network model by minimizing the loss function of the discrimination network model.
- the terminal determines the network model.
- the terminal obtains a loss function of the network model by using the separated voice sample, the mixed voice sample, and the clean voice sample.
- the terminal generates the network model by minimizing the loss function of the network model.
- the terminal's voice separation training process based on generating an adversarial network model mainly includes alternately training the generated network model and the adversarial network model.
- the generated network model is labeled G
- the discriminated network model is labeled D.
- the initialization is generated.
- Network model G and discriminant network model D are generated.
- the training of the discriminative network model during the training process is completed through the above steps 201 to 203
- the training of the generated network model during the training process is completed through the above steps 204 to 206.
- the model training process of steps 201 to 203 and the model training process of steps 204 to 206 are iterated until the generation of the adversarial network model converges.
- the embodiment of the present application proposes a voice separation network framework based on a generative adversarial network, and uses the training process of the generative network and the adversarial network to iterate each other to improve the performance of the existing voice separation.
- the network model G is fixedly generated, and the parameters of the network model are optimized by minimizing the loss function L D of the network model.
- the above step 202 uses the separated speech samples, the mixed speech samples and the clean speech samples to obtain the loss function of the discriminative network model, including:
- the terminal determines a first signal sample combination according to the separated voice sample and the mixed voice sample, and determines a second signal sample combination according to the clean voice sample and the mixed voice sample;
- the terminal uses the discriminant network model to discriminate and output the first signal sample combination to obtain a first discriminant output result, and obtains a first distortion metric between the first discriminant output result and the first target output of the discriminant network model;
- the terminal uses the discriminant network model to discriminate the second signal sample combination to obtain a second discriminant output result, and obtains a second distortion metric between the second discriminant output result and the second target output of the discriminant network model.
- the terminal obtains a loss function for discriminating the network model according to the first distortion metric and the second distortion metric.
- the number of sound sources is taken as an example.
- the separated voice samples are represented by Z 1 and Z 2
- the mixed voice samples are represented by Y
- the separated voice samples are combined with the mixed voice samples to obtain a first signal sample combination.
- the first signal sample combination is represented by [Z 1 , Z 2 , Y].
- the second signal sample combination is represented by [X 1 , X 2 , Y]
- the clean speech sample is represented by X 1 , X 2 .
- the discriminant network model is marked as D, and the discriminant network model is used to discriminate and output the first signal sample combination to obtain a first discriminant output result.
- the first discriminant output result is D ([Z 1 , Z 2 , Y ]) To indicate that the first target output of the discriminant network model is the target output 0 (false), and then a first distortion metric between the first discriminant output result and the first target output of the discriminant network model is calculated.
- the first distortion metric can be calculated by the following formula:
- L separated-> false represents a distortion metric between the first discrimination output result of the discrimination network model D and the first target output after the combination of the separated voice sample and the mixed voice sample [Z 1 , Z 2 , Y].
- the discrimination output method in step 2023 is similar to the aforementioned step 2022.
- the terminal uses a discrimination network model to discriminate the second signal sample combination to obtain a second discrimination output result.
- the second discrimination output result is D ([X 1 , X 2 , Y])
- the second target output of the discriminant network model is target output 1 (true)
- a second distortion metric between the second discriminant output result and the second target output of the discriminant network model is obtained.
- the second distortion metric can be calculated by the following formula:
- L real-> ture represents a distortion metric between the second discriminative output result of the discriminative network model D and the second target output through a combination of clean speech samples and mixed speech samples [X 1 , X 2 , Y].
- step 2024 after the terminal obtains the first distortion metric and the second distortion metric through the foregoing steps, the terminal can obtain a loss function for discriminating the network model through the first distortion metric and the second distortion metric.
- the corresponding loss function when determining the optimization of the network model can be defined as:
- L D represents the loss function of the discriminative network model
- L separated-> false represents the combination of the separated speech samples and the mixed speech samples [Z 1 , Z 2 , Y].
- a measure of distortion between target outputs, L real-> ture represents the combination of clean speech samples and mixed speech samples [X 1 , X 2 , Y] between the second discriminative output of the discriminative network model D and the second target output Distortion measure.
- the network model D is fixedly discriminated, and the loss function L G of the network model is generated by minimization to optimize the parameters of the network model.
- the above step 205 uses the separated speech samples, the mixed speech samples, and the clean speech samples to obtain the loss function of the network model, including:
- the terminal determines a first signal sample combination according to the separated voice sample and the mixed voice sample.
- the terminal uses the discriminant network model to discriminate and output the first signal sample combination to obtain a first discriminant output result, and obtains a third distortion metric between the first discriminant output result and the second target output of the discriminant network model;
- the terminal obtains a fourth distortion metric between the separated voice sample and the clean voice.
- the terminal obtains a loss function for generating a network model according to the third distortion metric and the fourth distortion metric.
- step 2051 the number of sound sources is taken as an example for illustration.
- the separated voice samples are represented by Z 1 and Z 2
- the mixed voice samples are represented by Y
- the separated voice samples are combined with the mixed voice samples to obtain a first signal sample combination.
- the first signal sample combination is represented by [Z 1 , Z 2 , Y].
- the discriminant network model is labeled as D
- the terminal uses the discriminant network model to discriminate and output the first signal sample combination to obtain a first discriminant output result, and the first discriminant output result is D ([Z 1 , Z 2 , Y]) to indicate that the second target output of the discriminant network model is target output 1 (true), and then calculate a third distortion metric between the first discriminant output result and the second target output of the discriminant network model.
- the third distortion metric can be calculated by the following formula:
- L separated-> ture represents a distortion metric between the first discrimination output result of the discrimination network model D and the second target output after the combination [Z 1 , Z 2 , Y] of the separated speech sample and the mixed speech sample.
- step 2053 the terminal obtains a fourth distortion metric between the separated voice sample and the clean voice.
- the fourth distortion metric is a spectral distortion term, which is a distortion metric of the separated voice sample and the clean voice sample.
- step 2054 the terminal obtaining a fourth distortion metric between the separated voice sample and the clean voice includes:
- the terminal performs permutation invariance calculation on the separated speech samples and the clean speech samples, and obtains the result of the correspondence between the separated speech samples and the clean speech samples;
- the terminal obtains a fourth distortion metric according to the result of the correspondence between the separated speech samples and the clean speech samples.
- arg min f (x) refers to the set of all arguments x that make the function f (x) obtain its minimum value.
- step 2054 after the terminal obtains the third distortion metric and the fourth distortion metric through the foregoing steps, the terminal can obtain the loss function for generating the network model through the third distortion metric and the fourth distortion metric.
- the corresponding loss function when generating the network model optimization can be defined as:
- L G represents the loss function of the generated network model
- L separated-> ture represents the combination of the separated speech samples and the mixed speech samples [Z 1 , Z 2 , Y].
- the distortion metric between the target outputs, J ss represents the fourth distortion metric, and ⁇ is the weighting factor.
- the embodiment of the present application proposes a speech separation network framework based on a generative adversarial network, and uses the training process of the generation network and the adversarial network to iterate each other to improve the performance of the existing speech separation.
- FIG. 3 is a schematic diagram of a model architecture for generating an adversarial network model according to an embodiment of the present application.
- FIG. 3 is a schematic diagram of a model architecture for generating an adversarial network model according to an embodiment of the present application.
- the input of the generated network model G is the mixed speech feature corresponding to the mixed speech signal, and the separated speech is obtained through the neural network (DNN, LSTM, CNN, etc.).
- the time-frequency point masking matrices M 1 and M 2 (mask1, mask2) corresponding to the signal, and then by multiplying the masking matrix with the spectrum Y of the mixed speech signal, the spectrums Z 1 and Z 2 corresponding to the separated speech signal can be obtained, that is, the following Calculation formula:
- the input of the adversarial network model is a combination of separated speech signals and mixed speech signals [Z 1 , Z 2 , Y], or a combination of clean speech signals and mixed speech signals [X 1 , X 2 , Y], and the output is 0 or 1.
- the mixed speech signal is obtained by superimposing multiple clean speeches, so the spectrums X 1 and X 2 corresponding to the clean speeches are known.
- a related device for implementing the foregoing solution is also provided below, and the device is installed in a terminal.
- a multi-person voice separation device 400 may include: a feature extraction module 401, a masking matrix generation module 402, and a voice separation module 403.
- the feature extraction module 401 is configured to extract a mixed voice feature from a mixed voice signal to be separated, wherein the mixed voice signal is mixed with N kinds of human voices, where N is a positive integer greater than or equal to 2;
- a masking matrix generation module 402 is configured to use a generation adversarial network model to perform masking coefficient extraction on the mixed speech features to obtain a masking matrix corresponding to N kinds of human voices;
- the voice separation module 403 is configured to use the generated adversarial network model to perform voice separation on the masking matrix corresponding to the N kinds of human voices and the mixed voice signal, and output N kinds of separated voices corresponding to the N kinds of human voices signal.
- the generating an adversarial network model includes: generating a network model and an adversarial network model; as shown in FIG. 4-b, the multi-person voice separation device 400 further includes: model training Module 404, where
- the feature extraction module 401 is further configured to obtain the mixed speech samples and the clean speech samples from a sample database before extracting the mixed speech features from the mixed speech signals to be separated; and extracting from the mixed speech samples.
- the masking matrix generation module 402 is further configured to perform masking coefficient extraction on the features of the mixed voice samples by using the generated network model to obtain sample masking matrices corresponding to N types of human voices;
- the voice separation module 403 is further configured to use the generated network model to perform voice separation on the sample masking matrix and the mixed voice samples, and output separated voice samples;
- the model training module 404 is configured to use the separated speech samples, the mixed speech samples, and the clean speech samples to alternately train the generated network model and the adversarial network model.
- the model training module 404 includes:
- the generating network training unit 4041 is configured to fix the generated network model when the discriminative network model is trained this time; use the separated voice sample, the mixed voice sample, and the clean voice sample to obtain the discriminative network model. Loss function; optimize the discriminant network model by minimizing the loss function of the discriminant network model;
- the discriminative network training unit 4042 is configured to fix the discriminative network model when training the generated network model next time; use the separated voice sample, the mixed voice sample, and the clean voice sample to obtain the generated network model. Loss function; optimizing the generated network model by minimizing the loss function of the generated network model.
- the generating network training unit 4041 includes:
- a first speech combination subunit 40411 configured to determine a first signal sample combination according to the separated speech sample and the mixed speech sample, and determine a second signal sample combination according to the clean speech sample and the mixed speech sample;
- the first discrimination output subunit 40412 is configured to use the discrimination network model to discriminate the first signal sample combination to obtain a first discrimination output result, and obtain the first discrimination output result and the discrimination network model.
- a first loss function obtaining subunit 40413 is configured to obtain a loss function of the discriminant network model according to the first distortion metric and the second distortion metric.
- the discrimination network training unit 4042 includes:
- a second speech combination subunit 40421 configured to determine a first signal sample combination according to the separated speech samples and the mixed speech samples
- the second discrimination output subunit 40422 is configured to use the discrimination network model to discriminate and output the first signal sample combination, obtain a first discrimination output result, and obtain the first discrimination output result and the discrimination network model.
- a distortion metric acquisition subunit 40423 configured to acquire a fourth distortion metric between the separated speech sample and the clean speech
- a second loss function obtaining subunit 40424 is configured to obtain a loss function of the generated network model according to the third distortion metric and the fourth distortion metric.
- the distortion metric obtaining subunit 40423 is specifically configured to perform permutation invariance calculation on the separated speech samples and the clean speech samples to obtain the separated speech samples and the clean speech A result of the correspondence between the samples; and obtaining the fourth distortion metric according to a result of the correspondence between the separated speech sample and the clean speech sample.
- the feature extraction module 401 is specifically configured to extract a time-domain feature or a frequency-domain feature of a single-channel voice signal from the mixed voice signal; or from the mixed voice signal Extracting time-domain features or frequency-domain features of a multi-channel voice signal; or extracting a single-channel voice feature from the mixed voice signal; or extracting multi-channel correlation features from the mixed voice signal.
- the mixed speech features are extracted from the mixed speech signals to be separated, and the mixed speech signals are mixed with N kinds of human voices, and then the masking coefficient extraction of the mixed speech features is performed by using the generated adversarial network model.
- the generation adversarial network model uses the generation adversarial network model to perform voice separation on the masking matrices and mixed voice signals corresponding to N kinds of human voices, and output N kinds of separated voice signals corresponding to N kinds of human voices.
- the generation adversarial network model can be used to extract the masking matrices corresponding to N kinds of human voices in the embodiment of the present application, the generation adversarial network model can accurately identify the speech signals corresponding to various human voices, and implement the speech separation network based on the generation adversarial network model Framework, to achieve the separation between speech and speech in a multi-person speech scene, and improve the performance of speech separation.
- the terminal can be any terminal device including a mobile phone, a tablet, a PDA (Personal Digital Assistant), a POS (Point of Sales), and a vehicle-mounted computer. Taking the terminal as a mobile phone as an example:
- FIG. 5 is a block diagram showing a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application.
- the mobile phone includes: a radio frequency (RF) circuit 1010, a memory 1020, an input unit 1030, a display unit 1040, a sensor 1050, an audio circuit 1060, a wireless fidelity (WiFi) module 1070, and a processor 1080 , And power supply 1090 and other components.
- RF radio frequency
- the RF circuit 1010 can be used for receiving and transmitting signals during information transmission and reception or during a call.
- the downlink information of the base station is received and processed by the processor 1080; in addition, the uplink data of the design is transmitted to the base station.
- the RF circuit 1010 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like.
- the RF circuit 1010 can also communicate with a network and other devices through wireless communication.
- the above wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division Multiple Access) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), E-mail, Short Messaging Service (SMS), etc.
- GSM Global System of Mobile
- GPRS General Packet Radio Service
- CDMA Code Division Multiple Access
- WCDMA Wideband Code Division Multiple Access
- LTE Long Term Evolution
- E-mail Short Messaging Service
- the memory 1020 may be used to store software programs and modules.
- the processor 1080 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1020.
- the memory 1020 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, at least one function required application program (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may store data according to Data (such as audio data, phone book, etc.) created by the use of mobile phones.
- the memory 1020 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
- the input unit 1030 can be used to receive inputted numeric or character information, and generate key signal inputs related to user settings and function control of the mobile phone.
- the input unit 1030 may include a touch panel 1031 and other input devices 1032.
- Touch panel 1031 also known as touch screen, can collect user's touch operations on or near it (such as the user using a finger, stylus, etc. any suitable object or accessory on touch panel 1031 or near touch panel 1031 Operation), and drive the corresponding connection device according to a preset program.
- the touch panel 1031 may include two parts, a touch detection device and a touch controller.
- the touch detection device detects the user's touch position, and detects the signal caused by the touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into contact coordinates, and sends it To the processor 1080, and can receive the commands sent by the processor 1080 and execute them.
- various types such as resistive, capacitive, infrared, and surface acoustic wave can be used to implement the touch panel 1031.
- the input unit 1030 may include other input devices 1032.
- Other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
- the display unit 1040 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
- the display unit 1040 may include a display panel 1041, and optionally, the display panel 1041 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
- the touch panel 1031 can cover the display panel 1041. When the touch panel 1031 detects a touch operation on or near the touch panel 1031, the touch panel 1031 transmits the touch operation to the processor 1080 to determine the type of the touch event, and the processor 1080 displays the touch event according to the type of the touch event. A corresponding visual output is provided on the panel 1041.
- the touch panel 1031 and the display panel 1041 are implemented as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 can be integrated and Realize the input and output functions of the mobile phone.
- the mobile phone may further include at least one sensor 1050, such as a light sensor, a motion sensor, and other sensors.
- the light sensor may include an ambient light sensor and a proximity sensor.
- the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of the ambient light.
- the proximity sensor may turn off the display panel 1041 and / or the backlight when the mobile phone is moved to the ear.
- the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary.
- the mobile phone can be used for applications that recognize the attitude of mobile phones (such as horizontal and vertical screen switching, Games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tap), etc .; as for the mobile phone can also be equipped with gyroscope, barometer, hygrometer, thermometer, infrared sensor and other sensors, no longer here To repeat.
- attitude of mobile phones such as horizontal and vertical screen switching, Games, magnetometer attitude calibration
- vibration recognition related functions such as pedometer, tap
- the mobile phone can also be equipped with gyroscope, barometer, hygrometer, thermometer, infrared sensor and other sensors, no longer here To repeat.
- the audio circuit 1060, the speaker 1061, and the microphone 1062 can provide an audio interface between the user and the mobile phone.
- the audio circuit 1060 can transmit the received electrical data converted electrical signal to the speaker 1061, and the speaker 1061 converts the sound signal to an audio signal output.
- the microphone 1062 converts the collected sound signal into an electrical signal, and the audio circuit 1060 After receiving, it is converted into audio data, and then the audio data is output to the processor 1080 for processing, and then sent to, for example, another mobile phone via the RF circuit 1010, or the audio data is output to the memory 1020 for further processing.
- WiFi is a short-range wireless transmission technology.
- the mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1070. It provides users with wireless broadband Internet access.
- FIG. 5 shows the WiFi module 1070, it can be understood that it does not belong to the necessary structure of the mobile phone, and can be omitted as needed without changing the nature of the application.
- the processor 1080 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone.
- the processor 1080 runs or executes software programs and / or modules stored in the memory 1020, and calls data stored in the memory 1020 to execute.
- Various functions and processing data of the mobile phone so as to monitor the mobile phone as a whole.
- the processor 1080 may include one or more processing units; optionally, the processor 1080 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, and an application program. Etc., the modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 1080.
- the mobile phone also includes a power supply 1090 (such as a battery) for powering various components.
- a power supply 1090 (such as a battery) for powering various components.
- the power supply can be logically connected to the processor 1080 through a power management system, thereby implementing functions such as managing charging, discharging, and power consumption management through the power management system.
- the mobile phone may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
- the processor 1080 included in the terminal also has a method for separating and executing the multi-person voice performed by the terminal.
- FIG. 6 is a schematic diagram of a server structure provided by an embodiment of the present application.
- the server 1100 may have a large difference due to different configurations or performance, and may include one or more central processing units (CPUs) 1122 (for example, , One or more processors) and memory 1132, one or more storage media 1130 (eg, one or more storage devices) that store application programs 1142 or data 1144.
- the memory 1132 and the storage medium 1130 may be temporary storage or persistent storage.
- the program stored in the storage medium 1130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
- the central processing unit 1122 may be configured to communicate with the storage medium 1130, and execute a series of instruction operations in the storage medium 1130 on the server 1100.
- the server 1100 may also include one or more power sources 1126, one or more wired or wireless network interfaces 1150, one or more input / output interfaces 1158, and / or, one or more operating systems 1141, such as Windows ServerTM, Mac OS. XTM, UnixTM, LinuxTM, FreeBSDTM and more.
- operating systems 1141 such as Windows ServerTM, Mac OS. XTM, UnixTM, LinuxTM, FreeBSDTM and more.
- the steps of the method for separating multi-person voice performed by the server in the foregoing embodiment may be based on the server structure shown in FIG. 6.
- a storage medium is also provided.
- a computer program is stored in the storage medium, and the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
- the foregoing storage medium may be configured to store a computer program for performing the following steps:
- the terminal extracts a mixed voice feature from the mixed voice signal to be separated.
- the mixed voice signal is mixed with N kinds of human voices, and N is a positive integer greater than or equal to 2.
- the terminal uses a generation adversarial network model to extract a masking coefficient for the mixed voice feature to obtain a masking matrix corresponding to N kinds of human voices.
- the terminal uses the generated adversarial network model to perform voice separation on the masking matrix corresponding to the N kinds of human voices and the mixed voice signal, and outputs N kinds of separated voice signals corresponding to the N kinds of human voices.
- the storage media may include: a flash disk, a read-only memory (ROM), a random access device (Random Access Memory, RAM), a magnetic disk, or an optical disk.
- the device embodiments described above are only schematic, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit can be located in one place or distributed across multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objective of the solution of this embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.
- the technical solution of this application that is essentially or contributes to related technologies can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or CD-ROM, etc., including several instructions to make a computer device (can be personal A computer, a server, or a network device, etc.) execute the methods described in the embodiments of the present application.
- a computer floppy disk U disk
- mobile hard disk read-only memory
- ROM Read-Only Memory
- RAM Random Access Memory
- magnetic disk or CD-ROM etc.
- a computer device can be personal A computer, a server, or a network device, etc.
- the mixed speech features are first extracted from the mixed speech signals to be separated, and the mixed speech signals are mixed with N human voices, and then the masking coefficient extraction of the mixed speech features is performed using the generated adversarial network model to obtain N Masking matrices corresponding to various human voices; use a generation adversarial network model to perform voice separation on masking matrices corresponding to N human voices and mixed voice signals, and output N kinds of separated voice signals corresponding to N human voices.
- the generation adversarial network model can be used to extract the masking matrices corresponding to N kinds of human voices in the embodiment of the present application, the generation adversarial network model can accurately identify the speech signals corresponding to various human voices, and implement a speech separation network based on the generated adversarial network model.
- the framework realizes the separation between speech and speech in a multi-person speech scene, and improves the performance of speech separation.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Telephonic Communication Services (AREA)
Abstract
一种多人语音的分离方法和装置,用于实现在多人语音场景下的语音与语音之间的分离。包括:终端从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,N为大于或等于2的正整数(101);终端使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵(102);终端使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号(103)。
Description
本申请要求于2018年8月9日提交中国专利局、优先权号为2018109044889、发明名称为“一种多人语音的分离方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及信号处理技术领域,尤其涉及一种多人语音的分离方法和装置。
在嘈杂的声学环境中,比如在鸡尾酒会中,往往同时存在着多个不同的人声以及其他杂音。在这种声学环境下,人类的听觉系统能一定程度地听清楚目标语言,而机器在这方面的能力还远不如人类。因此,如何在多个人声混杂的声学环境中分离出目标语音一直是语音信号处理领域的重要研究方向。
相关技术提供的语音降噪方案主要适用于语音和噪声的分离,由于语音和噪声的特性差别很大,相关语音降噪方案已经能很好地完成语音降噪任务。而由于不同说话人的语音特性非常接近,语音分离的技术难度明显大于语音降噪。如何将语音和语音进行分离,仍是未解决的问题。
发明内容
本申请实施例提供了一种多人语音的分离方法和装置,用于实现在多人语音场景下的语音与语音之间的分离。
本申请实施例提供以下技术方案:
一方面,本申请实施例提供一种多人语音的分离方法,包括:
终端从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;
终端使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;
终端使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信 号。
另一方面,本申请实施例还提供一种多人语音的分离装置,安装在终端中,包括:
特征提取模块,设置为从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;
掩蔽矩阵生成模块,设置为使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;
语音分离模块,设置为使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信号。
在前述方面中,多人语音的分离装置的组成模块还可以执行前述一方面以及各种可能的实现方式中所描述的步骤,详见前述对前述一方面以及各种可能的实现方式中的说明。
另一方面,本申请实施例提供一种多人语音的分离装置,该多人语音的分离装置包括:处理器、存储器;存储器用于存储指令;处理器用于执行存储器中的指令,使得多人语音的分离装置执行如前述一方面中任一项的方法。
另一方面,本申请实施例提供了一种计算机可读,所述计算机可读中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
在本申请实施例中,终端首先从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,然后使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;终端使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号。由于本申请实施例中使用生成对抗网络模型可以提取到N种人声对应的掩蔽矩阵,该生成对抗网络模型可以精确识别多种人声对应的语音信号,基于该生成对抗网络模型实现语音分离网络框架,实现了在多人语音场景下的语音与语音之间的分离,提升了语音分离的性能。
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中 所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的技术人员来讲,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种多人语音的分离方法的流程方框示意图;
图2为本申请实施例提供的生成对抗网络模型的训练过程的的流程方框示意图;
图3为本申请实施例提供的一种生成对抗网络模型的模型架构示意图;
图4-a为本申请实施例提供的一种多人语音的分离装置的组成结构示意图;
图4-b为本申请实施例提供的一种多人语音的分离装置的组成结构示意图;
图4-c为本申请实施例提供的一种模型训练模块的组成结构示意图;
图4-d为本申请实施例提供的一种生成网络培训单元的组成结构示意图;
图4-e为本申请实施例提供的一种判别网络培训单元的组成结构示意图;
图5为本申请实施例提供的多人语音的分离方法应用于终端的组成结构示意图;
图6为本申请实施例提供的多人语音的分离方法应用于服务器的组成结构示意图。
本申请实施例提供了一种多人语音的分离方法和装置,用于实现在多人语音场景下的语音与语音之间的分离。
为使得本申请的申请目的、特征、优点能够更加的明显和易懂,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,下面所描述的实施例仅仅是本申请一部分实施例,而非全部实施例。基于本申请中的实施例,本领域的技术人员所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地 列出的或对于这些过程、方法、产品或设备固有的其它单元。
以下分别进行详细说明。
本申请实施例主要提供一种多人语音的分离方法,本申请实施例通过神经网络可以完成对多人场景下的语音与语音之间的分离,应用于复杂声学场景下的语音交互中,例如智能音箱,智能电视(TV)等场景的语音识别。本申请实施例中还提供多人语音的分离装置,该多人语音的分离装置可以通过音频处理软件的方式部署在终端中,该多人语音的分离装置也可以是存储音频的服务器。
本申请实施例中对混合语音信号所进行的语音分离任务完全不同于相关技术中语音降噪。语音降噪是指去除输入音频中包含的噪声信号,保留语音。语音分离是指分离出输入音频中属于不同说话人的语音。当输入音频包含噪声以及多人语音时,对于语音降噪任务,输出是去除噪声后多人混合语音。对于语音分离任务,输出的是每个说话人单独的语音,至于噪声是单独输出或者直接被去除,取决于不同语音分离算法的设计。从音频特性的处理难度上来看,由于语音和噪声的特性差别很大,相关语音降噪方案已经能很好地完成语音降噪任务。而由于不同说话人的语音特性非常接近,语音分离的技术难度明显大于语音降噪。
本申请实施例提供的多人语音的分离中采用机器学习的方式来训练出生成对抗网络(Generative Adversarial Nets,GAN)模型,该生成对抗网络模型也可以称为生成式对抗网络模型,该生成对抗网络模型可以是通过神经网络模型来实现,例如本申请实施例中采用的神经网络模型具体可以包括:深度神经网络(Deep Neural Networks,DNN)、长短期记忆网络(Long Short-Term Memory,LSTM)、卷积神经网络(Convolutional Neural Network,CNN)。首先从待分离的混合语音信号中提取出混合语音特征,再将该混合语音特征输入到生成对抗网络模型中,使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵,例如对输入的混合语音逐帧在各频率通道上求取掩蔽系数,即可形成掩蔽矩阵。最后使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出多个分离后的语音信号。本申请实施例采用的生成对抗网络模型能 够有效的提取N种人声对应的掩蔽矩阵,以进行语音处理,从而能够自动地对从一段混合语音中分离出单个人声的语音信号,实现了类人听觉的N种人声的智能识别。
请参阅图1所示,本申请一个实施例提供的多人语音的分离方法,可以包括如下步骤:
101、终端从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,N为大于或等于2的正整数。
在本申请实施例中,音源的数量用字母N来表示,在语音分离任务中,音源的数量N大于或等于2,即在一段混合语音信号中可以包括多种人声,通过本申请后续实施例提供的生成对抗网络模型可以分离出N种人声的语音信号。
在本申请实施例中,终端首先获取到一段待分离的混合语音信号,先提取该混合语音信号对应的特征,即获取到混合语音特征,该混合语音特征是生成对抗网络模型的输入特征,在实际应用中,混合语音特征的获取方式可以多种。
在本申请的一些实施例中,步骤101终端从待分离的混合语音信号中提取出混合语音特征,包括:
终端从混合语音信号中提取出单通道语音信号的时域特征或者频域特征;或者,
终端从混合语音信号中提取出多通道语音信号的时域特征或者频域特征;或者,
终端从混合语音信号中提取出单通道语音特征;或者,
终端从混合语音信号中提取出多通道间的相关特征。
其中,本申请实施例中待分离的混合语音信号可以从单通道或者多个通道采集得到。混合语音特征可以包含以下一个或者多个特征,例如可以包括:原始单通道/多通道语音信号的时域特征或者频域特征。又如混合语音特征可以是单通道语音特征,如对数能量谱,梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),子带能量等。又如混合语音特征可以包括:多通道间的相关特征,如广义互相关(generalized cross correlation,GCC) 特征,相位差特征等。对于混合音频信号的特征提取方式,可以结合具体场景来确定所提取的特征类型以及特征内容。
102、终端使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵。
在本申请实施例中,终端可以使用生成对抗网络模型来用于混合语音信号中的单个人声的语音信号的分离,在获取到混合语音信号对应的混合语音特征之后,将混合语音特征输入到生成对抗网络模型中,使用生成对抗网络模型中的神经网络来提取各个人声对应的掩蔽系数,例如对输入的混合语音逐帧在各频率通道上求取掩蔽系数,即可形成掩蔽矩阵。
本申请实施例中通过生成对抗网络模型可以生成N种人声对应的掩蔽矩阵,该掩蔽矩阵可以用于混合语音信号中多种人声的语音分离。本申请实施例采用的生成对抗网络模型可以通过混合语音样本和干净语音样本进行训练得到,本申请实施例采用的生成对抗网络模型是有效的无监督学习方法。通过构造生成网络模型和判别网络模型,在训练过程中使两个模型互相博弈,最终使得生成网络能够以假乱真,生成出接近真实目标(如语音等)的结果。详见后续实施例中对生成对抗网络模型的训练过程的详细说明。
103、终端使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号。
在本申请实施例中,终端通过生成对抗网络模型提取到N种人声所对应的掩蔽矩阵之后,使用生成对抗网络模型对掩蔽矩阵和混合语音信号进行语音分离,从而通过掩蔽矩阵的分离作用从该混合语音信号中识别出分别属于不同音源的分离语音信号,解决了相关技术无法识别多个人声语音的问题。
通过以上实施例对本申请实施例的描述可知,终端首先从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,然后使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;终端使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号。由于本申请实施例中使用生成对抗网络模型可以提取到N种人声对应的掩蔽矩阵,该生成对抗网络模型可以精确识别多种人声对应的语音信号,基于该生成对 抗网络模型实现语音分离网络框架,实现了在多人语音场景下的语音与语音之间的分离,提升了语音分离的性能。
接下来对本申请实施例中生成对抗网络模型的训练过程进行举例说明。本申请实施例中生成对抗网络模型至少包括两个网络模型,其中一个是生成网络模型,另一个是判别网络模型,生成网络模型也可以称为生成器,判别网络模型也可以称为判别器。通过生成网络模型和判别网络模型的互相博弈学习,从而通过生成对抗网络模型产生相当好的输出。
在本申请的一些实施例中,终端从待分离的混合语音信号中提取出混合语音特征之前,本申请实施例提供的多人语音的分离方法还包括:
A1、终端从样本数据库中获取混合语音样本和干净语音样本;
A2、终端从混合语音样本中提取出混合语音样本特征;
A3、终端通过生成网络模型对混合语音样本特征进行掩蔽系数提取,得到N种人声对应的样本掩蔽矩阵;
A4、终端使用生成网络模型对样本掩蔽矩阵和混合语音样本进行语音分离,输出分离语音样本;
A5、终端使用分离语音样本、混合语音样本和干净语音样本,对生成网络模型和对抗网络模型进行交替训练。
其中,本申请实施例中可以设置样本数据库用于模型的训练与判别,例如采用一段混合语音信号用于模型训练,这里的“混合语音样本”有别于步骤101中的混合语音信号,该混合语音样本是样本数据库中的样本语音,为了判别生成网络模型的输出效果,在样本数据库中还提供干净语音样本,在训练过程中,感觉语音样本是由多个干净语音叠加得到。
在前述的步骤A2至步骤A4中,混合语音样本特征的提取与步骤101中特征提取相同,样本掩蔽矩阵与前述步骤102中掩蔽矩阵的生成方式相类似,此处的样本掩蔽矩阵是指基于混合语音样本特征所生成的掩蔽矩阵,接下来使用生成网络模型对样本掩蔽矩阵和混合语音样本进行语音分离,输出分离语音样本,在模型训练过程中可以采用的音源数量为2,或者更多的音源数量,此处不做限定。
在生成网络模型输出分离语音样本之后,根据分离语音样本、混合语音 样本和干净语音样本,再使用判别网络模型来判别输出的分离语音样本是否与干净语音样本相同,使用判别网络模型,引入对抗损失函数,从而对生成网络模型和判别网络模型进行交替的多次训练,从而可以更好的保证分离语音样本更接近真实的干净语音样本。
在本申请的一些实施例中,请参阅图2所示,前述步骤A5终端使用分离语音样本、混合语音样本和干净语音样本,对生成网络模型和对抗网络模型进行交替训练,包括:
201、终端在本次训练判别网络模型时,固定生成网络模型。
202、终端使用分离语音样本、混合语音样本和干净语音样本获取判别网络模型的损失函数。
203、终端通过最小化判别网络模型的损失函数,优化判别网络模型。
204、终端在下一次训练生成网络模型时,固定判别网络模型。
205、终端使用分离语音样本、混合语音样本和干净语音样本获取生成网络模型的损失函数。
206、终端通过最小化生成网络模型的损失函数,优化生成网络模型。
在本申请实施例中,终端基于生成对抗网络模型的语音分离训练过程中主要包括对生成网络模型和对抗网络模型进行交替训练,生成网络模型标记为G,判别网络模型标记为D,首先初始化生成网络模型G和判别网络模型D。然后通过上述步骤201至步骤203完成一次训练过程中对判别网络模型的训练,再通过上述步骤204至步骤206完成一次训练过程中对生成网络模型的训练。迭代步骤201至步骤203的模型训练过程、步骤204至步骤206的模型训练过程,直到生成对抗网络模型收敛。本申请实施例提出基于生成式对抗网络的语音分离网络框架,利用生成网络和对抗网络互相迭代的训练过程,提升现有语音分离的性能。
首先在上述步骤201至步骤203中,固定生成网络模型G,通过最小化判别网络模型的损失函数L
D,优化判别网络模型参数。
上述步骤202使用分离语音样本、混合语音样本和干净语音样本获取判别网络模型的损失函数,包括:
2021、终端根据分离语音样本和混合语音样本确定第一信号样本组合, 以及根据干净语音样本和混合语音样本确定第二信号样本组合;
2022、终端使用判别网络模型对第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取第一判别输出结果与判别网络模型的第一目标输出之间的第一失真度量;
2023、终端使用判别网络模型对第二信号样本组合进行判别输出,得到第二判别输出结果,以及获取第二判别输出结果与判别网络模型的第二目标输出之间的第二失真度量;
2024、终端根据第一失真度量和第二失真度量获取判别网络模型的损失函数。
在步骤2021中,以音源数量为2示例说明,分离语音样本用Z
1、Z
2来表示,混合语音样本用Y来表示,分离语音样本与混合语音样本进行组合,得到第一信号样本组合,该第一信号样本组合用[Z
1,Z
2,Y]表示。同理,第二信号样本组合用[X
1,X
2,Y]来表示,干净语音样本用X
1,X
2来表示。
在步骤2022中,判别网络模型标记为D,使用判别网络模型对第一信号样本组合进行判别输出,得到第一判别输出结果,该第一判别输出结果用D([Z
1,Z
2,Y])来表示,判别网络模型的第一目标输出为目标输出0(false),接下来计算第一判别输出结果与判别网络模型的第一目标输出之间的第一失真度量。
例如该第一失真度量可以通过如下公式计算:
L
separated->false=||D([Z
1,Z
2,Y])-0||
2。
其中,L
separated->false表示分离语音样本与混合语音样本的组合[Z
1,Z
2,Y]经过判别网络模型D的第一判别输出结果与第一目标输出之间的失真度量。
在步骤2023中的判别输出方式与前述步骤2022相类似,终端使用判别网络模型对第二信号样本组合进行判别输出,得到第二判别输出结果,该第二判别输出结果用D([X
1,X
2,Y])来表示,判别网络模型的第二目标输出为目标输出1(true),接下来获取第二判别输出结果与判别网络模型的第二目标输出之间的第二失真度量。
例如该第二失真度量可以通过如下公式计算:
L
real->true=||D([X
1,X
2,Y])-1||
2。
其中,L
real->ture表示干净语音样本与混合语音样本的组合[X
1,X
2,Y]经过判别网络模型D的第二判别输出结果与第二目标输出之间的失真度量。
在步骤2024中,终端通过前述步骤获取到第一失真度量和第二失真度量之后,通过第一失真度量和第二失真度量可以获取判别网络模型的损失函数。
举例说明,判别网络模型优化时对应的损失函数可定义为:
L
D=L
real->true+L
separated->false。
其中,L
D表示判别网络模型的损失函数,L
separated->false表示分离语音样本与混合语音样本的组合[Z
1,Z
2,Y]经过判别网络模型D的第一判别输出结果与第一目标输出之间的失真度量,L
real->ture表示干净语音样本与混合语音样本的组合[X
1,X
2,Y]经过判别网络模型D的第二判别输出结果与第二目标输出之间的失真度量。
接下来在上述步骤204至步骤206中,固定判别网络模型D,通过最小化生成网络模型的损失函数L
G,优化生成网络模型参数。
上述步骤205使用分离语音样本、混合语音样本和干净语音样本获取生成网络模型的损失函数,包括:
2051、终端根据分离语音样本和混合语音样本确定第一信号样本组合;
2052、终端使用判别网络模型对第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取第一判别输出结果与判别网络模型的第二目标输出之间的第三失真度量;
2053、终端获取分离语音样本和干净语音之间的第四失真度量;
2054、终端根据第三失真度量和第四失真度量获取生成网络模型的损失函数。
在步骤2051中,以音源数量为2示例说明,分离语音样本用Z
1、Z
2来表示,混合语音样本用Y来表示,分离语音样本与混合语音样本进行组合,得到第一信号样本组合,该第一信号样本组合用[Z
1,Z
2,Y]表示。
在步骤2052中,判别网络模型标记为D,终端使用判别网络模型对第一信号样本组合进行判别输出,得到第一判别输出结果,该第一判别输出结果用D([Z
1,Z
2,Y])来表示,判别网络模型的第二目标输出为目标输出1(true),接下来计算第一判别输出结果与判别网络模型的第二目标输出之间的第三失 真度量。
例如该第三失真度量可以通过如下公式计算:
L
separated->true=||D([Z
1,Z
2,Y])-1||
2。
其中,L
separated->ture表示分离语音样本与混合语音样本的组合[Z
1,Z
2,Y]经过判别网络模型D的第一判别输出结果与第二目标输出之间的失真度量。
在步骤2053中,终端获取分离语音样本和干净语音之间的第四失真度量,第四失真度量是频谱失真项,为分离语音样本与干净语音样本的失真度量。
假设T为时域的帧数,F为频域的点数,S为音源的个数,本申请实施例提供的第四失真度量J
ss可表示为:
在本申请的一些实施例中,步骤2054终端获取分离语音样本和干净语音之间的第四失真度量,包括:
终端对分离语音样本和干净语音样本进行置换不变性计算,得到分离语音样本和干净语音样本之间的对应关系结果;
终端根据分离语音样本和干净语音样本之间的对应关系结果获取到第四失真度量。
其中,在语音分离任务中,由于音源数量大于2,考虑到分离语音样本与干净语音样本的对应关系并不是唯一的,即有可能是Z
1对应X
1、Z
2对应X
2,也有可能是Z
1对应X
2、Z
2对应X
1。因此需要针对分离语音样本和干净语音样本进行置换不变性计算,即可以在J
ss的定义中引入与对应关系无关的训练准则(Permutation Invariant Training,PIT)。PIT对应的频谱失真项J
φ*表示为:
假设所有对应关系的组合形成一个集合P,则φ*表示集合P中取得最小失真时的对应关系,
其中,arg min f(x)是指使得函数f(x)取得其最小值的所有自变量x的集合。
在步骤2054中,终端通过前述步骤获取到第三失真度量和第四失真度量之后,通过第三失真度量和第四失真度量可以获取生成网络模型的损失函数。
举例说明,生成网络模型优化时对应的损失函数可定义为:
L
G=J
SS+λ×L
separated->true。
其中,L
G表示生成网络模型的损失函数,L
separated->ture表示分离语音样本与混合语音样本的组合[Z
1,Z
2,Y]经过判别网络模型D的第一判别输出结果与第二目标输出之间的失真度量,J
ss表示第四失真度量,λ为加权因子。
通过以上实施例对本申请实施例的描述可知,本申请实施例提出基于生成式对抗网络的语音分离网络框架,利用生成网络和对抗网络互相迭代的训练过程,提升了现有语音分离的性能。
为便于更好的理解和实施本申请实施例的上述方案,下面举例相应的应用场景来进行具体说明。
请参阅图3所示,为本申请实施例提供的一种生成对抗网络模型的模型架构示意图。接下来将详细介绍生成对抗网络模型的语音分离网络结构。
在本申请实施例提供的基于生成式对抗网络的语音分离网络结构中,生成网络模型G的输入为混合语音信号对应的混合语音特征,经过神经网络(DNN,LSTM,CNN等),得到分离语音信号对应的时频点掩蔽矩阵M
1、M
2(mask1,mask2),之后通过掩蔽矩阵与混合语音信号的频谱Y相乘,可得到分离语音信号对应的频谱Z
1、Z
2,即满足如下计算公式:
Z
i=M
i*Y,i=1,2。
对抗网络模型的输入为分离语音信号与混合语音信号的组合[Z
1,Z
2,Y],或者是干净语音信号与混合语音信号的组合[X
1,X
2,Y],输出为0或者1。在训练过程中,混合语音信号是由多个干净语音叠加得到,因此干净语音对应的频谱X
1,X
2是已知的。
对于生成器和判别器的训练过程,详见前述实施例中的举例说明,此处不再赘述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表 述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
为便于更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关装置,该装置安装在终端中。
请参阅图4-a所示,本申请实施例提供的一种多人语音的分离装置400,可以包括:特征提取模块401、掩蔽矩阵生成模块402、语音分离模块403,其中,
特征提取模块401,设置为从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;
掩蔽矩阵生成模块402,设置为使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;
语音分离模块403,设置为使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信号。
在本申请的一些实施例中,所述生成对抗网络模型,包括:生成网络模型和对抗网络模型;请参阅图4-b所示,所述多人语音的分离装置400,还包括:模型训练模块404,其中,
所述特征提取模块401,还设置为从待分离的混合语音信号中提取出混合语音特征之前,从样本数据库中获取所述混合语音样本和所述干净语音样本;从所述混合语音样本中提取出混合语音样本特征;
所述掩蔽矩阵生成模块402,还设置为通过所述生成网络模型对所述混合语音样本特征进行掩蔽系数提取,得到N种人声对应的样本掩蔽矩阵;
所述语音分离模块403,还设置为使用所述生成网络模型对所述样本掩蔽矩阵和所述混合语音样本进行语音分离,输出分离语音样本;
所述模型训练模块404,设置为使用所述分离语音样本、所述混合语音样本和所述干净语音样本,对所述生成网络模型和所述对抗网络模型进行交替训练。
在本申请的一些实施例中,请参阅图4-c所示,所述模型训练模块404,包括:
生成网络训练单元4041,设置为在本次训练所述判别网络模型时,固定所述生成网络模型;使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述判别网络模型的损失函数;通过最小化所述判别网络模型的损失函数,优化所述判别网络模型;
判别网络训练单元4042,设置为在下一次训练所述生成网络模型时,固定所述判别网络模型;使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述生成网络模型的损失函数;通过最小化所述生成网络模型的损失函数,优化所述生成网络模型。
在本申请的一些实施例中,请参阅图4-d所示,所述生成网络训练单元4041,包括:
第一语音组合子单元40411,设置为根据所述分离语音样本和所述混合语音样本确定第一信号样本组合,以及根据所述干净语音样本和所述混合语音样本确定第二信号样本组合;
第一判别输出子单元40412,设置为使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别输出结果与所述判别网络模型的第一目标输出之间的第一失真度量;使用所述判别网络模型对所述第二信号样本组合进行判别输出,得到第二判别输出结果,以及获取所述第二判别输出结果与所述判别网络模型的第二目标输出之间的第二失真度量;
第一损失函数获取子单元40413,设置为根据所述第一失真度量和所述第二失真度量获取所述判别网络模型的损失函数。
在本申请的一些实施例中,请参阅图4-e所示,所述判别网络训练单元4042,包括:
第二语音组合子单元40421,设置为根据所述分离语音样本和所述混合语音样本确定第一信号样本组合;
第二判别输出子单元40422,设置为使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别 输出结果与所述判别网络模型的第二目标输出之间的第三失真度量;
失真度量获取子单元40423,设置为获取所述分离语音样本和所述干净语音之间的第四失真度量;
第二损失函数获取子单元40424,设置为根据所述第三失真度量和所述第四失真度量获取所述生成网络模型的损失函数。
在本申请的一些实施例中,所述失真度量获取子单元40423,具体设置为对所述分离语音样本和所述干净语音样本进行置换不变性计算,得到所述分离语音样本和所述干净语音样本之间的对应关系结果;根据所述分离语音样本和所述干净语音样本之间的对应关系结果获取到所述第四失真度量。
在本申请的一些实施例中,所述特征提取模块401,具体设置为从所述混合语音信号中提取出单通道语音信号的时域特征或者频域特征;或者,从所述混合语音信号中提取出多通道语音信号的时域特征或者频域特征;或者,从所述混合语音信号中提取出单通道语音特征;或者,从所述混合语音信号中提取出多通道间的相关特征。
通过以上对本申请实施例的描述可知,首先从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,然后使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;使用生成对抗网络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号。由于本申请实施例中使用生成对抗网络模型可以提取到N种人声对应的掩蔽矩阵,该生成对抗网络模型可以精确识别多种人声对应的语音信号,基于该生成对抗网络模型实现语音分离网络框架,实现在多人语音场景下的语音与语音之间的分离,提升语音分离的性能。
本申请实施例还提供了另一种终端,如图5所示,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该终端可以为包括手机、平板电脑、PDA(Personal Digital Assistant,个人数字助理)、POS(Point of Sales,销售终端)、车载电脑等任意终端设备,以终端为手机为例:
图5示出的是与本申请实施例提供的终端相关的手机的部分结构的框图。 参考图5,手机包括:射频(Radio Frequency,RF)电路1010、存储器1020、输入单元1030、显示单元1040、传感器1050、音频电路1060、无线保真(wireless fidelity,WiFi)模块1070、处理器1080、以及电源1090等部件。本领域技术人员可以理解,图5中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图5对手机的各个构成部件进行具体的介绍:
RF电路1010可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1080处理;另外,将设计上行的数据发送给基站。通常,RF电路1010包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路1010还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。
存储器1020可用于存储软件程序以及模块,处理器1080通过运行存储在存储器1020的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器1020可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1020可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元1030可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。输入单元1030可包括触控面板1031以及其他输入设备1032。触控面板1031,也称为触摸屏,可收集用户在其上 或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1031上或在触控面板1031附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板1031可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1080,并能接收处理器1080发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1031。除了触控面板1031,输入单元1030还可以包括其他输入设备1032。其他输入设备1032可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元1040可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元1040可包括显示面板1041,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板1041。触控面板1031可覆盖显示面板1041,当触控面板1031检测到在其上或附近的触摸操作后,传送给处理器1080以确定触摸事件的类型,随后处理器1080根据触摸事件的类型在显示面板1041上提供相应的视觉输出。虽然在图5中,触控面板1031与显示面板1041是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板1031与显示面板1041集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器1050,比如光传感器、运动传感器以及其他传感器。光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1041的亮度,接近传感器可在手机移动到耳边时,关闭显示面板1041和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传 感器,在此不再赘述。
音频电路1060、扬声器1061,传声器1062可提供用户与手机之间的音频接口。音频电路1060可将接收到的音频数据转换后的电信号,传输到扬声器1061,由扬声器1061转换为声音信号输出;另一方面,传声器1062将收集的声音信号转换为电信号,由音频电路1060接收后转换为音频数据,再将音频数据输出处理器1080处理后,经RF电路1010以发送给比如另一手机,或者将音频数据输出至存储器1020以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块1070可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图5示出了WiFi模块1070,但是可以理解的是,其并不属于手机的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
处理器1080是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1020内的软件程序和/或模块,以及调用存储在存储器1020内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器1080可包括一个或多个处理单元;可选的,处理器1080可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1080中。
手机还包括给各个部件供电的电源1090(比如电池),可选的,电源可以通过电源管理系统与处理器1080逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本申请实施例中,该终端所包括的处理器1080还具有控制执行以上由终端执行的多人语音的分离方法流程。
图6是本申请实施例提供的一种服务器结构示意图,该服务器1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1122(例如,一个或一个以上处理器)和存储器1132,一个或一个以上存储应用程序1142或数据1144的存储介质1130(例如一个或一个以上海量存储设备)。其中,存储器1132和存储介质 1130可以是短暂存储或持久存储。存储在存储介质1130的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1122可以设置为与存储介质1130通信,在服务器1100上执行存储介质1130中的一系列指令操作。
服务器1100还可以包括一个或一个以上电源1126,一个或一个以上有线或无线网络接口1150,一个或一个以上输入输出接口1158,和/或,一个或一个以上操作系统1141,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的多人语音的分离方法步骤可以基于该图6所示的服务器结构。
根据本申请实施例的又一方面,还提供了一种存储介质。该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:
S1,终端从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;
S2,所述终端使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;
S3,所述终端使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信号。
可选地,在本实施例中,本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
综上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照上述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对上述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
在本申请实施例中,首先从待分离的混合语音信号中提取出混合语音特征,混合语音信号中混合有N种人声,然后使用生成对抗网络模型对混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;使用生成对抗网 络模型对N种人声所对应的掩蔽矩阵和混合语音信号进行语音分离,输出与N种人声对应的N种分离语音信号。由于本申请实施例中使用生成对抗网络模型可以提取到N种人声对应的掩蔽矩阵,该生成对抗网络模型可以精确识别多种人声对应的语音信号,基于该生成对抗网络模型实现语音分离网络框架,实现了在多人语音场景下的语音与语音之间的分离,提升了语音分离的性能的效果。
Claims (15)
- 一种多人语音的分离方法,包括:终端从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;所述终端使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;所述终端使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信号。
- 根据权利要求1所述的方法,其中,所述生成对抗网络模型,包括:生成网络模型和对抗网络模型;所述终端从待分离的混合语音信号中提取出混合语音特征之前,所述方法还包括:所述终端从样本数据库中获取所述混合语音样本和所述干净语音样本;所述终端从所述混合语音样本中提取出混合语音样本特征;所述终端通过所述生成网络模型对所述混合语音样本特征进行掩蔽系数提取,得到N种人声对应的样本掩蔽矩阵;所述终端使用所述生成网络模型对所述样本掩蔽矩阵和所述混合语音样本进行语音分离,输出分离语音样本;所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本,对所述生成网络模型和所述对抗网络模型进行交替训练。
- 根据权利要求2所述的方法,其中,所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本,对所述生成网络模型和所述对抗网络模型进行交替训练,包括:所述终端在本次训练所述判别网络模型时,固定所述生成网络模型;所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述判别网络模型的损失函数;所述终端通过最小化所述判别网络模型的损失函数,优化所述判别网络模型;所述终端在下一次训练所述生成网络模型时,固定所述判别网络模型;所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述生成网络模型的损失函数;所述终端通过最小化所述生成网络模型的损失函数,优化所述生成网络模型。
- 根据权利要求3所述的方法,其中,所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述判别网络模型的损失函数,包括:所述终端根据所述分离语音样本和所述混合语音样本确定第一信号样本组合,以及根据所述干净语音样本和所述混合语音样本确定第二信号样本组合;所述终端使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别输出结果与所述判别网络模型的第一目标输出之间的第一失真度量;所述终端使用所述判别网络模型对所述第二信号样本组合进行判别输出,得到第二判别输出结果,以及获取所述第二判别输出结果与所述判别网络模型的第二目标输出之间的第二失真度量;所述终端根据所述第一失真度量和所述第二失真度量获取所述判别网络模型的损失函数。
- 根据权利要求3所述的方法,其中,所述终端使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述生成网络模型的损失函数,包括:所述终端根据所述分离语音样本和所述混合语音样本确定第一信号样本组合;所述终端使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别输出结果与所述判别网络模型的第二目标输出之间的第三失真度量;所述终端获取所述分离语音样本和所述干净语音之间的第四失真度量;所述终端根据所述第三失真度量和所述第四失真度量获取所述生成网络模型的损失函数。
- 根据权利要求5所述的方法,其中,所述终端获取所述分离语音样本和所述干净语音之间的第四失真度量,包括:所述终端对所述分离语音样本和所述干净语音样本进行置换不变性计算,得到所述分离语音样本和所述干净语音样本之间的对应关系结果;所述终端根据所述分离语音样本和所述干净语音样本之间的对应关系结果获取到所述第四失真度量。
- 根据权利要求1至6中任一项所述的方法,其中,所述终端从待分离的混合语音信号中提取出混合语音特征,包括:所述终端从所述混合语音信号中提取出单通道语音信号的时域特征或者频域特征;或者,所述终端从所述混合语音信号中提取出多通道语音信号的时域特征或者频域特征;或者,所述终端从所述混合语音信号中提取出单通道语音特征;或者,所述终端从所述混合语音信号中提取出多通道间的相关特征。
- 一种多人语音的分离装置,安装在终端中,包括:特征提取模块,设置为从待分离的混合语音信号中提取出混合语音特征,所述混合语音信号中混合有N种人声,所述N为大于或等于2的正整数;掩蔽矩阵生成模块,设置为使用生成对抗网络模型对所述混合语音特征进行掩蔽系数提取,得到N种人声对应的掩蔽矩阵;语音分离模块,设置为使用所述生成对抗网络模型对所述N种人声所对应的掩蔽矩阵和所述混合语音信号进行语音分离,输出与所述N种人声对应的N种分离语音信号。
- 根据权利要求8所述的装置,其中,所述生成对抗网络模型,包括:生成网络模型和对抗网络模型;所述多人语音的分离装置,还包括:模型训练模块,其中,所述特征提取模块,还设置为从待分离的混合语音信号中提取出混合语音特征之前,从样本数据库中获取所述混合语音样本和所述干净语音样本;从所述混合语音样本中提取出混合语音样本特征;所述掩蔽矩阵生成模块,还设置为通过所述生成网络模型对所述混合语 音样本特征进行掩蔽系数提取,得到N种人声对应的样本掩蔽矩阵;所述语音分离模块,还设置为使用所述生成网络模型对所述样本掩蔽矩阵和所述混合语音样本进行语音分离,输出分离语音样本;所述模型训练模块,设置为使用所述分离语音样本、所述混合语音样本和所述干净语音样本,对所述生成网络模型和所述对抗网络模型进行交替训练。
- 根据权利要求9所述的装置,其中,所述模型训练模块,包括:生成网络训练单元,设置为在本次训练所述判别网络模型时,固定所述生成网络模型;使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述判别网络模型的损失函数;通过最小化所述判别网络模型的损失函数,优化所述判别网络模型;判别网络训练单元,设置为在下一次训练所述生成网络模型时,固定所述判别网络模型;使用所述分离语音样本、所述混合语音样本和所述干净语音样本获取所述生成网络模型的损失函数;通过最小化所述生成网络模型的损失函数,优化所述生成网络模型。
- 根据权利要求10所述的装置,其中,所述生成网络训练单元,包括:第一语音组合子单元,设置为根据所述分离语音样本和所述混合语音样本确定第一信号样本组合,以及根据所述干净语音样本和所述混合语音样本确定第二信号样本组合;第一判别输出子单元,设置为使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别输出结果与所述判别网络模型的第一目标输出之间的第一失真度量;使用所述判别网络模型对所述第二信号样本组合进行判别输出,得到第二判别输出结果,以及获取所述第二判别输出结果与所述判别网络模型的第二目标输出之间的第二失真度量;第一损失函数获取子单元,设置为根据所述第一失真度量和所述第二失真度量获取所述判别网络模型的损失函数。
- 根据权利要求10所述的装置,其中,所述判别网络训练单元,包括:第二语音组合子单元,设置为根据所述分离语音样本和所述混合语音样 本确定第一信号样本组合;第二判别输出子单元,设置为使用所述判别网络模型对所述第一信号样本组合进行判别输出,得到第一判别输出结果,以及获取所述第一判别输出结果与所述判别网络模型的第二目标输出之间的第三失真度量;失真度量获取子单元,设置为获取所述分离语音样本和所述干净语音之间的第四失真度量;第二损失函数获取子单元,设置为根据所述第三失真度量和所述第四失真度量获取所述生成网络模型的损失函数。
- 根据权利要求12所述的装置,其中,所述失真度量获取子单元,具体设置为对所述分离语音样本和所述干净语音样本进行置换不变性计算,得到所述分离语音样本和所述干净语音样本之间的对应关系结果;根据所述分离语音样本和所述干净语音样本之间的对应关系结果获取到所述第四失真度量。
- 根据权利要求8至13中任一项所述的装置,其中,所述特征提取模块,具体设置为从所述混合语音信号中提取出单通道语音信号的时域特征或者频域特征;或者,从所述混合语音信号中提取出多通道语音信号的时域特征或者频域特征;或者,从所述混合语音信号中提取出单通道语音特征;或者,从所述混合语音信号中提取出多通道间的相关特征。
- 一种多人语音的分离装置,所述多人语音的分离装置包括:处理器和存储器;所述存储器,用于存储指令;所述处理器,用于执行所述存储器中的所述指令,执行如权利要求1至7中任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020548932A JP2021516786A (ja) | 2018-08-09 | 2019-08-05 | 複数人の音声を分離する方法、装置、およびコンピュータプログラム |
EP19848216.8A EP3751569B1 (en) | 2018-08-09 | 2019-08-05 | Multi-person voice separation method and apparatus |
US17/023,829 US11450337B2 (en) | 2018-08-09 | 2020-09-17 | Multi-person speech separation method and apparatus using a generative adversarial network model |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810904488.9 | 2018-08-09 | ||
CN201810904488.9A CN110164469B (zh) | 2018-08-09 | 2018-08-09 | 一种多人语音的分离方法和装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/023,829 Continuation US11450337B2 (en) | 2018-08-09 | 2020-09-17 | Multi-person speech separation method and apparatus using a generative adversarial network model |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020029906A1 true WO2020029906A1 (zh) | 2020-02-13 |
Family
ID=67645182
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/099216 WO2020029906A1 (zh) | 2018-08-09 | 2019-08-05 | 一种多人语音的分离方法和装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US11450337B2 (zh) |
EP (1) | EP3751569B1 (zh) |
JP (1) | JP2021516786A (zh) |
CN (2) | CN110544488B (zh) |
WO (1) | WO2020029906A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111341304A (zh) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | 一种基于gan的说话人语音特征训练方法、装置和设备 |
CN111640456A (zh) * | 2020-06-04 | 2020-09-08 | 合肥讯飞数码科技有限公司 | 叠音检测方法、装置和设备 |
CN112216300A (zh) * | 2020-09-25 | 2021-01-12 | 三一专用汽车有限责任公司 | 用于搅拌车驾驶室内声音的降噪方法、装置和搅拌车 |
CN114743561A (zh) * | 2022-05-06 | 2022-07-12 | 广州思信电子科技有限公司 | 语音分离装置及方法、存储介质、计算机设备 |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108847238B (zh) * | 2018-08-06 | 2022-09-16 | 东北大学 | 一种服务机器人语音识别方法 |
CN110544482B (zh) * | 2019-09-09 | 2021-11-12 | 北京中科智极科技有限公司 | 一种单通道语音分离系统 |
CN110795892B (zh) * | 2019-10-23 | 2021-10-01 | 北京邮电大学 | 一种基于生成对抗网络的信道模拟方法及装置 |
CN110827849B (zh) * | 2019-11-11 | 2022-07-26 | 广州国音智能科技有限公司 | 数据建库的人声分离方法、装置、终端及可读存储介质 |
CN111312270B (zh) * | 2020-02-10 | 2022-11-22 | 腾讯科技(深圳)有限公司 | 语音增强方法及装置、电子设备和计算机可读存储介质 |
ES2928295T3 (es) * | 2020-02-14 | 2022-11-16 | System One Noc & Dev Solutions S A | Método de mejora de las señales de voz telefónica basado en redes neuronales convolucionales |
CN113450823B (zh) * | 2020-03-24 | 2022-10-28 | 海信视像科技股份有限公司 | 基于音频的场景识别方法、装置、设备及存储介质 |
CN111477240B (zh) * | 2020-04-07 | 2023-04-07 | 浙江同花顺智能科技有限公司 | 音频处理方法、装置、设备和存储介质 |
CN111899758B (zh) * | 2020-09-07 | 2024-01-30 | 腾讯科技(深圳)有限公司 | 语音处理方法、装置、设备和存储介质 |
CN112071329B (zh) * | 2020-09-16 | 2022-09-16 | 腾讯科技(深圳)有限公司 | 一种多人的语音分离方法、装置、电子设备和存储介质 |
CN112331218B (zh) * | 2020-09-29 | 2023-05-05 | 北京清微智能科技有限公司 | 一种针对多说话人的单通道语音分离方法和装置 |
CN113223497A (zh) * | 2020-12-10 | 2021-08-06 | 上海雷盎云智能技术有限公司 | 智能语音识别处理方法及系统 |
CN112992174A (zh) * | 2021-02-03 | 2021-06-18 | 深圳壹秘科技有限公司 | 一种语音分析方法及其语音记录装置 |
CN113077812B (zh) * | 2021-03-19 | 2024-07-23 | 北京声智科技有限公司 | 语音信号生成模型训练方法、回声消除方法和装置及设备 |
CN113571084B (zh) * | 2021-07-08 | 2024-03-22 | 咪咕音乐有限公司 | 音频处理方法、装置、设备及存储介质 |
CN113689837B (zh) * | 2021-08-24 | 2023-08-29 | 北京百度网讯科技有限公司 | 音频数据处理方法、装置、设备以及存储介质 |
CN114446316B (zh) * | 2022-01-27 | 2024-03-12 | 腾讯科技(深圳)有限公司 | 音频分离方法、音频分离模型的训练方法、装置及设备 |
CN116168717A (zh) * | 2022-12-28 | 2023-05-26 | 阿里巴巴达摩院(杭州)科技有限公司 | 语音分离方法 |
CN116597828B (zh) * | 2023-07-06 | 2023-10-03 | 腾讯科技(深圳)有限公司 | 模型确定方法、模型应用方法和相关装置 |
CN118283015B (zh) * | 2024-05-30 | 2024-08-20 | 江西扬声电子有限公司 | 一种基于客舱以太网的多路音频传输方法和系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103903632A (zh) * | 2014-04-02 | 2014-07-02 | 重庆邮电大学 | 一种多声源环境下的基于听觉中枢系统的语音分离方法 |
US20150287406A1 (en) * | 2012-03-23 | 2015-10-08 | Google Inc. | Estimating Speech in the Presence of Noise |
CN105096961A (zh) * | 2014-05-06 | 2015-11-25 | 华为技术有限公司 | 语音分离方法和装置 |
CN107945811A (zh) * | 2017-10-23 | 2018-04-20 | 北京大学 | 一种面向频带扩展的生成式对抗网络训练方法及音频编码、解码方法 |
CN108109619A (zh) * | 2017-11-15 | 2018-06-01 | 中国科学院自动化研究所 | 基于记忆和注意力模型的听觉选择方法和装置 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8947347B2 (en) * | 2003-08-27 | 2015-02-03 | Sony Computer Entertainment Inc. | Controlling actions in a video game unit |
US7225124B2 (en) * | 2002-12-10 | 2007-05-29 | International Business Machines Corporation | Methods and apparatus for multiple source signal separation |
JP5841986B2 (ja) * | 2013-09-26 | 2016-01-13 | 本田技研工業株式会社 | 音声処理装置、音声処理方法、及び音声処理プログラム |
CN106024005B (zh) * | 2016-07-01 | 2018-09-25 | 腾讯科技(深圳)有限公司 | 一种音频数据的处理方法及装置 |
WO2018045358A1 (en) * | 2016-09-05 | 2018-03-08 | Google Llc | Generating theme-based videos |
WO2018053340A1 (en) * | 2016-09-15 | 2018-03-22 | Twitter, Inc. | Super resolution using a generative adversarial network |
CN106847294B (zh) * | 2017-01-17 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | 基于人工智能的音频处理方法和装置 |
CN107437077A (zh) * | 2017-08-04 | 2017-12-05 | 深圳市唯特视科技有限公司 | 一种基于生成对抗网络的旋转面部表示学习的方法 |
US10642846B2 (en) * | 2017-10-13 | 2020-05-05 | Microsoft Technology Licensing, Llc | Using a generative adversarial network for query-keyword matching |
US10839822B2 (en) * | 2017-11-06 | 2020-11-17 | Microsoft Technology Licensing, Llc | Multi-channel speech separation |
CN108198569B (zh) * | 2017-12-28 | 2021-07-16 | 北京搜狗科技发展有限公司 | 一种音频处理方法、装置、设备及可读存储介质 |
CN108346433A (zh) * | 2017-12-28 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种音频处理方法、装置、设备及可读存储介质 |
US10811000B2 (en) * | 2018-04-13 | 2020-10-20 | Mitsubishi Electric Research Laboratories, Inc. | Methods and systems for recognizing simultaneous speech by multiple speakers |
JP7243052B2 (ja) | 2018-06-25 | 2023-03-22 | カシオ計算機株式会社 | オーディオ抽出装置、オーディオ再生装置、オーディオ抽出方法、オーディオ再生方法、機械学習方法及びプログラム |
US11281976B2 (en) * | 2018-07-12 | 2022-03-22 | International Business Machines Corporation | Generative adversarial network based modeling of text for natural language processing |
-
2018
- 2018-08-09 CN CN201910745824.4A patent/CN110544488B/zh active Active
- 2018-08-09 CN CN201810904488.9A patent/CN110164469B/zh active Active
-
2019
- 2019-08-05 JP JP2020548932A patent/JP2021516786A/ja active Pending
- 2019-08-05 EP EP19848216.8A patent/EP3751569B1/en active Active
- 2019-08-05 WO PCT/CN2019/099216 patent/WO2020029906A1/zh unknown
-
2020
- 2020-09-17 US US17/023,829 patent/US11450337B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150287406A1 (en) * | 2012-03-23 | 2015-10-08 | Google Inc. | Estimating Speech in the Presence of Noise |
CN103903632A (zh) * | 2014-04-02 | 2014-07-02 | 重庆邮电大学 | 一种多声源环境下的基于听觉中枢系统的语音分离方法 |
CN105096961A (zh) * | 2014-05-06 | 2015-11-25 | 华为技术有限公司 | 语音分离方法和装置 |
CN107945811A (zh) * | 2017-10-23 | 2018-04-20 | 北京大学 | 一种面向频带扩展的生成式对抗网络训练方法及音频编码、解码方法 |
CN108109619A (zh) * | 2017-11-15 | 2018-06-01 | 中国科学院自动化研究所 | 基于记忆和注意力模型的听觉选择方法和装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3751569A4 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111341304A (zh) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | 一种基于gan的说话人语音特征训练方法、装置和设备 |
CN111640456A (zh) * | 2020-06-04 | 2020-09-08 | 合肥讯飞数码科技有限公司 | 叠音检测方法、装置和设备 |
CN111640456B (zh) * | 2020-06-04 | 2023-08-22 | 合肥讯飞数码科技有限公司 | 叠音检测方法、装置和设备 |
CN112216300A (zh) * | 2020-09-25 | 2021-01-12 | 三一专用汽车有限责任公司 | 用于搅拌车驾驶室内声音的降噪方法、装置和搅拌车 |
CN114743561A (zh) * | 2022-05-06 | 2022-07-12 | 广州思信电子科技有限公司 | 语音分离装置及方法、存储介质、计算机设备 |
Also Published As
Publication number | Publication date |
---|---|
EP3751569B1 (en) | 2024-10-23 |
US20210005216A1 (en) | 2021-01-07 |
CN110544488A (zh) | 2019-12-06 |
EP3751569A1 (en) | 2020-12-16 |
EP3751569A4 (en) | 2021-07-21 |
CN110164469A (zh) | 2019-08-23 |
CN110544488B (zh) | 2022-01-28 |
US11450337B2 (en) | 2022-09-20 |
CN110164469B (zh) | 2023-03-10 |
JP2021516786A (ja) | 2021-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020029906A1 (zh) | 一种多人语音的分离方法和装置 | |
US9685161B2 (en) | Method for updating voiceprint feature model and terminal | |
CN108346433A (zh) | 一种音频处理方法、装置、设备及可读存储介质 | |
CN110364145A (zh) | 一种语音识别的方法、语音断句的方法及装置 | |
CN110570840B (zh) | 一种基于人工智能的智能设备唤醒方法和装置 | |
US9500739B2 (en) | Estimating and tracking multiple attributes of multiple objects from multi-sensor data | |
WO2021114847A1 (zh) | 网络通话方法、装置、计算机设备及存储介质 | |
CN107993672B (zh) | 频带扩展方法及装置 | |
WO2018228167A1 (zh) | 导航方法及相关产品 | |
CN110364156A (zh) | 语音交互方法、系统、终端及可读存储介质 | |
CN110827826A (zh) | 语音转换文字方法、电子设备 | |
CN109302528B (zh) | 一种拍照方法、移动终端及计算机可读存储介质 | |
CN110931028B (zh) | 一种语音处理方法、装置和电子设备 | |
CN109686359B (zh) | 语音输出方法、终端及计算机可读存储介质 | |
CN117153186A (zh) | 声音信号处理方法、装置、电子设备和存储介质 | |
CN109167880B (zh) | 双面屏终端控制方法、双面屏终端及计算机可读存储介质 | |
CN112700783A (zh) | 通讯的变声方法、终端设备和存储介质 | |
CN109543193B (zh) | 一种翻译方法、装置及终端设备 | |
CN109453526B (zh) | 一种声音处理方法、终端及计算机可读存储介质 | |
CN106782614B (zh) | 音质检测方法及装置 | |
WO2017124876A1 (zh) | 一种语音播放方法和装置 | |
CN107645604B (zh) | 一种通话处理方法及移动终端 | |
CN109087661A (zh) | 语音处理方法、装置、系统及可读存储介质 | |
CN110364177B (zh) | 语音处理方法、移动终端及计算机可读存储介质 | |
CN116597828B (zh) | 模型确定方法、模型应用方法和相关装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19848216 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019848216 Country of ref document: EP Effective date: 20200909 |
|
ENP | Entry into the national phase |
Ref document number: 2020548932 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |