CN111243579A - Time domain single-channel multi-speaker voice recognition method and system - Google Patents

Time domain single-channel multi-speaker voice recognition method and system Download PDF

Info

Publication number
CN111243579A
CN111243579A CN202010061565.6A CN202010061565A CN111243579A CN 111243579 A CN111243579 A CN 111243579A CN 202010061565 A CN202010061565 A CN 202010061565A CN 111243579 A CN111243579 A CN 111243579A
Authority
CN
China
Prior art keywords
speaker
network
error
output
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010061565.6A
Other languages
Chinese (zh)
Other versions
CN111243579B (en
Inventor
黄露
杨毅
孙甲松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010061565.6A priority Critical patent/CN111243579B/en
Publication of CN111243579A publication Critical patent/CN111243579A/en
Application granted granted Critical
Publication of CN111243579B publication Critical patent/CN111243579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

A time domain single channel multi-speaker voice recognition method, input is the original waveform sampling of the mixed speech signal, extract the characteristic through the one-dimensional convolution network first, then send into the separation network and carry on the speech separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then adopting a forced alignment method to obtain corresponding labeling information from the existing target voice labeling, calculating a smaller probability distribution error of an acoustic modeling unit under two sorts as an error of neural network back propagation by means of cross scoring and threshold selection, constructing a multi-speaker voice recognition model containing a time domain single channel, and realizing multi-speaker voice recognition by utilizing the model. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.

Description

Time domain single-channel multi-speaker voice recognition method and system
Technical Field
The invention belongs to the technical field of audio, and particularly relates to a time domain single-channel multi-speaker voice recognition method and system.
Background
The cocktail party problem (cocktail party problem) is a problem in the field of computer speech recognition, and the current speech recognition technology can recognize the content spoken by one person with high precision, but when the number of the spoken persons is two or more, the speech recognition rate is greatly reduced, and the problem is called the cocktail party problem. The problem can be solved greatly for a series of practical application scenes, such as automatic recording of multi-person conferences, multi-party human-computer interaction, automatic audio/video labeling and the like.
With the rise of neural networks and deep learning, a plurality of voice separation algorithms based on deep learning are proposed, which can be mainly classified into two categories, one is voice separation based on time-frequency spectrum, and the other is voice separation based on time-domain signals.
1. The voice separation method based on the time spectrum comprises the following steps:
1) deep Clustering (DPCL) method: firstly, mapping the time-frequency spectrum of a voice signal to a high-dimensional space through an artificial neural network, then dividing a high-dimensional space vector by using a clustering algorithm such as K-means clustering and the like, and dividing components belonging to the same speaker together. The method assumes that each time-frequency point belongs to only one of a plurality of speakers, and clustering in a high-dimensional space is not necessarily the optimal operation.
2) Deep Attractor Network (DAN) method: similar to DPCL, the time-frequency spectrum of the mixed speech signal is mapped to a high-dimensional space, and then a series of attractors are constructed in the space, and the attractors are used to divide the time-frequency points belonging to the target person together. However, DAN requires an estimate of attractors, requiring not only an additional amount of computation, but also a complex design process.
3) The rank independent Training (PIT) method: under the condition of mixing the speech signals of two speakers, the more intuitive method is to use an artificial neural network to perform speech separation, firstly input the time-frequency spectrum or other characteristics of the mixed speech, and then design two outputs, wherein each output corresponds to the time-frequency spectrum of one speaker. But this can lead to problems: the ordering of the output two ports and the target reference speech is not necessarily consistent, i.e.: the speaker's ranking for both outputs of the neural network may be "speaker 2, speaker 1", but the ranking for the reference speech is "speaker 1, speaker 2", and serious errors can occur if at this time the error between the output and the labeled value is forcibly calculated from the speech label. Therefore, the error needs to be recalculated after the reference speech is reordered to "speaker 2, speaker 1", which is the Label ranking (Label Permutation) problem in speech separation. The PIT is a main method for solving the tag ordering at present, and the tag ordering problem is relieved by considering all possible reference voice orderings and then selecting an ordering which enables the sum of all human errors to be minimum as an optimal ordering. Fig. 1 shows a framework for single-channel multi-speaker voice separation using the PIT method.
The mathematical model for the standard PIT method is: suppose that the mixed speech signal input contains two speakers, Y represents the time-frequency spectrum of the mixed speech signal input and is a T × F matrix, where T is the time frame number and F is the number of frequency points of the fast fourier transform. The time and frequency point indices of the correlation matrix are omitted for simplicity of presentation. Two speaker masks M with their amplitude values fed into a separate network (usually the recurrent neural network RNN) and estimated1And M2
(M1,M2)=Separation(|Y|) (1)
Where Separation stands for Separation network.
Then, the amplitude values of the frequency spectrums of the two speakers are estimated according to the mask, and the following formula is shown:
Figure BDA0002374669210000021
Figure BDA0002374669210000022
wherein
Figure BDA0002374669210000023
Representing the estimated spectral amplitude value at the ith speaker.
Suppose X1And X2Is the original clean speech of the target speaker, the error between the estimated value and the clean speech is calculated by the following error function:
Figure BDA0002374669210000024
wherein S is the total number of speakers, and when two speakers exist, S is 2; p is an array of 1,2, …, S, with a total of S! And (4) carrying out the following steps. The goal of the above formula is: the optimal target speaker arrangement sequence is found to be basically consistent with the estimated speaker arrangement sequence, and then the minimum Mean Square Error (MSE) under the arrangement sequence is used as the error of the neural network gradient updating.
According to equation 4, the PIT method is used to calculate two errors LS in the case of two speakers1And LS2
LS1=LS11+LS22(5)
LS2=LS12+LS21(6)
Wherein
Figure BDA0002374669210000031
Representing the error between the ith output of the separation network and the clean speech spectrum of the jth target person. In particular LS11Refers to the error, LS, between the net 1 st output and the 1 st speaker's clean speech spectrum22The error between the second output of the network and the clean voice spectrum of the 2 nd speaker (on the premise of judging the 1 st output of the separation network as the 1 st speaker) is referred to; LS (least squares)12Refers to the error, LS, between the net 1 st output and the 2 nd speaker's clean speech spectrum21Refers to the error between the second output of the network and the clean speech spectrum of the 1 st speaker (on the premise that the 1 st output of the separate network is judged to be the 2 nd speaker).
Finally, in LS1And LS2The better of the two errors is selected as the error of the neural network back-propagation update. In this case, the operation of equation (7) needs to be performed 4 times.
2. The voice separation method based on the time domain signal comprises the following steps:
time domain Audio Separation Network (TasNet) takes advantage of the concept of PIT to handle the ordering of output ports, except that the input and output of the neural Network are both speech waveform samples. Firstly, a one-dimensional convolution is used as an encoder on the whole structure, and a frame of voice is encoded to obtain an encoding vector; then sending the coding vector to a separation network to obtain two masks; the two masks are respectively multiplied by the coding vector of the mixed voice to obtain the coding vector of the voice of the frame of the target speaker; finally, the coded vector is restored into a speech waveform through a decoder acting as a one-dimensional convolution. Recent work has shown that the separation achieved by this method has greatly exceeded the several time-spectrum based separation methods described above.
Specifically, considering that the current mixed speech signal input y is in a time domain form, it needs to be subjected to encoding and decoding operations to realize signal separation, where the encoder performs convolution with N convolution kernels equal to the input y:
Figure BDA0002374669210000041
where i is 1, …, N, N is the number of convolution kernels, wiIs the ith convolution kernel, and the obtained e is the coding vector and is N-dimensional. These code vectors are then input into a separate network, the output is a mask of two speakers, and the code vector for each predicted person is the code vector for the mixed speech multiplied by its mask:
(m1,m2)=Separation(e) (9)
di=mi⊙e (10)
finally, the decoder recovers the original voice
Figure BDA0002374669210000042
Where W is the learnable decoder matrix.
The above method has the disadvantages that: because the PIT processes a small segment of speech, there may be inconsistency of reference speech sequence of the front and rear segments in a sentence of speech, which may cause serious speaker switching when the separation result of some output of the front and rear segments is pieced together, that is, the output of only speaker 2 should contain the speech of speaker 1. Therefore, in practical application, a Recurrent Neural Network (RNN) is generally adopted to perform sentence-level modeling, so that certain continuity and stability of sequencing of frames before and after output can be ensured.
In addition, all the above two methods still need to perform voice separation first and then perform voice recognition on each person, that is, a real end-to-end system is still not realized, and a certain distance is left from the requirement of the commercial application.
3. Single-channel multi-speaker speech recognition using PIT
The most intuitive method for performing single-channel multi-speaker speech recognition by using PIT is to change the output of a neural network into an acoustic modeling unit and replace an MSE (mean square error) error function during speech separation with a Cross Entropy (CE) error function, namely
Figure BDA0002374669210000051
Wherein
Figure BDA0002374669210000052
Wherein
Figure BDA0002374669210000053
Is the acoustic modeling unit probability distribution of the ith output of the neural network at the t frame,
Figure BDA0002374669210000054
is the true tag of the ith speaker frame t under the permutation p, and is generally obtained by forced alignment.
Figure BDA0002374669210000055
Is the ith output at time t on the label
Figure BDA0002374669210000056
The probability of (c) above.
However, the method has the disadvantage that the frequency domain amplitude spectrum of the input mixed signal is not utilized for the phase information of the mixed signal.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention provides a time domain single channel multiple speaker voice recognition method and system, which combines the time domain processing technology with the single channel multiple speaker voice separation technology, and introduces the time domain processing method of voice signals on the basis of training without relation to the sequence, thereby reducing the error rate of the multiple speaker voice recognition.
In order to achieve the purpose, the invention adopts the technical scheme that:
a time domain single channel multi-speaker voice recognition method comprises the following steps:
step 1, sending an original waveform of mixed voice into a one-dimensional convolution network to preliminarily extract features, then sending the feature into a separation network BSRU, and outputting a feature representation after the original waveform is separated;
step 2, respectively sending the characteristic representations of the separated original waveforms into two full-connection layers, and outputting two acoustic state distribution vectors;
step 3, referring the two state distribution vectors to labeling information obtained by forced alignment, obtaining smaller errors under two sorts in a cross scoring and threshold value selection mode, and constructing a time domain single-channel multi-speaker voice recognition model as errors of neural network back propagation;
and 4, realizing multi-speaker voice recognition by using the single-channel multi-speaker voice recognition model with the time domain.
In the step 1, the one-dimensional convolution network is one or more layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer comprise the number of convolution kernels, the length of the convolution kernels, the maximum pooling size and the step length; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.
In the step 1, the BSRU is a bidirectional SRU, and the calculation method of the SRU is as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt=σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W thereinr、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs the current input and output; c. CtIs the state value of the cell at time t for storing history information, ct-1Is the state value of the cell at time t-1; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
In the step 2, the obtained two state distribution vectors are the probability distribution of the acoustic modeling units of the two speakers.
In the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice labeling; subsequently, in the case of two speakers, a multiple cross-scoring method, i.e., scoring, is employedConsider the error LR in two cases separately1And LR2
LR1=LR11+LR22
LR2=LR12+LR21
Wherein LRijAnd (3) expressing the cross entropy error between the ith output and the jth target person clean voice forced alignment label of the separation network, wherein i is 1,2, and j is 1, 2.
First calculate LR11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And will LR1As the smaller error under both orderings; if LR11If greater than the threshold, LR is calculated12And LR21Will LR2As the smaller error under both orderings.
The invention also provides a time domain single channel multi-speaker voice recognition system, comprising:
a mixed voice signal waveform sampling module 101 for sampling the waveform of the mixed voice signal;
the one-dimensional convolutional neural network module 102 takes the output of the mixed voice signal waveform sampling module 101 as input, and preliminarily extracts features;
a separation network BSRU103 which takes the output of the one-dimensional convolution neural network module 102 as input to obtain the characteristic representation after the original waveform is separated;
the two full connection layers 104 respectively take two paths of outputs of the separation network BSRU103 as inputs to obtain two state distribution vectors;
a multiple cross scoring module 105, which performs cross scoring on the output of the two full connection layers 104 and the two target voice labels 106 by using a multiple cross scoring and error threshold setting method to obtain two smaller cross entropy errors 107 in sequence;
the minimum error module 108 takes the smaller error of the two sorts as the error of the back propagation update of the whole neural network.
The main principle of the invention is as follows: under the condition of two speakers, the original waveform sampling of a mixed voice signal is input, the characteristics of the waveform are primarily learned through a one-dimensional convolution network, and then the waveform is sent to a separation network for voice separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then, acquiring corresponding labeling information from the existing target voice labeling by adopting a forced alignment method, and calculating a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode. In order to accelerate the cross scoring process, the invention also provides a scoring algorithm for reducing 1/4-1/2 error calculation amount through threshold setting. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.
Compared with the prior art, the invention has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.
Drawings
Fig. 1 is a block diagram of a prior art single-channel multi-speaker voice separation using the PIT method.
FIG. 2 is a time domain single channel multiple speaker speech recognition modeling flow chart of the present invention
Fig. 3 is a schematic diagram of the SRU calculation method.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a time domain single channel multi-speaker voice recognition method and a system, which are particularly used for combining time domain processing of voice signals and single channel multi-speaker voice recognition on the basis of sequencing irrelevant training in a multi-speaker voice recognition scene and are used for reducing the error rate of the multi-speaker voice recognition. The methods and systems are not limited to multi-speaker speech recognition but may be any methods and systems related to speech recognition.
FIG. 2 is a time domain single channel multiple speaker speech recognition model modeling process of the present invention, which includes:
step 1, sending an original waveform into a one-dimensional convolution network to preliminarily extract features, then sending the original waveform into a separation network BSRU, and outputting a feature representation after the original waveform is separated;
the system inputs original sampling waveforms of mixed voice, and the characteristics of the original sampling waveforms are preliminarily extracted through a one-dimensional convolution network. The one-dimensional convolution network can be one layer or multiple layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer include the number of convolution kernels, the length of the convolution kernels, the maximum pooling size, the step size and the like. For a one-dimensional convolution network of one layer, the length of the convolution kernel is generally set to the number of sampling points of one frame of speech, for example, one frame of 25ms, and the sampling of 16kHz is 400 points. The multilayer one-dimensional convolution network has a pooling operation, and the one-dimensional convolution network of one layer has no pooling operation. The output of each layer of convolution is normalized through batch normalization to improve generalization and training speed. And finally, splicing vectors of all channels in the last layer together to be used as a feature representation of the learned time domain waveform. The feature representations of these waveforms are then fed into a separation network BSRU for separation, outputting the separated feature representations of the two speakers of the original blended waveform.
The bsru (bidirectional SRU), i.e. bidirectional SRU, of the disjoint network, with reference to fig. 3, is calculated as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt=σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W in the formular、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs whenFront input and output; c. CtThe state value of the cell is used for storing historical information; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
Step 2, the characteristic representation after the separation of the original waveform output in the step 1 is respectively sent into two full-connection layers, and two acoustic state distribution vectors are respectively output;
the output of the separation network passes through two independent full-connection layers to respectively obtain the probability distribution of the acoustic modeling units output by the two neural network full-connection layers.
And 3, referring to the marking information obtained by forced alignment, and reducing the calculated amount by adopting a method of cross scoring for multiple times and setting an error threshold value.
And referring to the labeling information obtained by forced alignment, and respectively obtaining a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode.
Firstly, acquiring corresponding labeling information from the existing target voice label by adopting a forced alignment method;
subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately1And LR2
LR1=LR11+LR22(3)
LR2=LR12+LR21(4)
Wherein LRijAnd i is 1,2, j is 1,2 represents the cross entropy error between the ith output of the separation network and the jth target person clean voice forced alignment label.
If the above formula is performed in sequence, LR needs to be calculated for 4 timesij. The method of the invention comprises the following steps: first calculate LR11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And taking the formula (3) as a smaller error under two sorts; if LR11If greater than the threshold, LR is calculated12And LR21Equation (4) is taken as the smaller error for both orderings. So that only 2 or 3 LR's need be calculated at a timeijCompared with the previous 4 times, the method can save about 1/4-1/2 error calculation amount. The setting of this threshold follows two principles: firstly, the initial error of training is generally large, so the threshold value should be large at the initial stage of training and is decreased as training progresses; second, the threshold should be matched to the LR of the current trainingijThe average values are related.
The error calculation time for training using this multiple cross-scoring method is 3/4 to 1/2 of the error calculation time in the multiple cross-scoring mode.
Two orderings in the present invention refer to: in the case of two speakers, when a mixed voice of the two speakers is inputted, contents which the two speakers respectively speak, that is, one input, two outputs, are recognized. Meanwhile, the user also knows what the two persons actually say respectively during training, namely two references; but the correspondence of the output and the reference is not known, there are two sorts:
A. output 1 corresponds to reference 1 and output 2 corresponds to reference 2
B. Output 2 corresponds to reference 1 and output 1 corresponds to reference 2
I.e. there is no guarantee that the output of the first port is always the first person.
In the case of multiple speakers, the principle is the same.
In a word, the invention provides a time domain single-channel multi-speaker voice recognition modeling method, which can effectively further improve the recognition effect of multi-speaker voice. Employing this method on some multi-speaker continuous speech recognition datasets can achieve better performance than the time-spectrum based PIT method.
Fig. 2 also shows a corresponding system, in which the mixed speech signal waveform sampling module 101 obtains a signal x ═ x1,...,xT](T is the time length of the signal), the output of the mixed speech signal waveform sampling module 101 is used as the input of the one-dimensional convolutional neural network module 102; the output of the one-dimensional convolutional neural network module 102 is used as a separation netThe input of network BSRU 103; the output of the separation network BSRU103 is respectively sent to two full connection layers 104; the inputs of the two fully connected layers 104 are sent to the multiple cross-scoring module 105 along with the two target voice labels 106; the multiple cross scoring module 105 uses multiple cross scoring and error threshold setting methods to obtain two kinds of smaller cross entropy errors 107 in the sequence, and the smaller error selected by the minimum error module 108 is the error of the whole neural network back propagation update.
During testing, the logarithm values of the probability vectors output by the neural network are sent to a speech recognition decoder, and the recognition texts of two persons can be obtained, and the method has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.

Claims (7)

1. A time domain single channel multi-speaker voice recognition method is characterized by comprising the following steps:
step 1, sending an original waveform of mixed voice into a one-dimensional convolution network to preliminarily extract features, then sending the feature into a separation network BSRU, and outputting a feature representation after the original waveform is separated;
step 2, respectively sending the characteristic representations of the separated original waveforms into two full-connection layers, and outputting two acoustic state distribution vectors;
step 3, referring the two state distribution vectors to labeling information obtained by forced alignment, obtaining smaller errors under two sorts in a cross scoring and threshold value selection mode, and constructing a time domain single-channel multi-speaker voice recognition model as errors of neural network back propagation;
and 4, realizing multi-speaker voice recognition by using the single-channel multi-speaker voice recognition model with the time domain.
2. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 1, the one-dimensional convolutional network is one or more layers, and for the multi-layer one-dimensional convolutional network, the parameters of each layer include the number of convolutional kernels, the length of the convolutional kernels, the maximum pooling size and the step size; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.
3. The time-domain single-channel multi-speaker speech recognition method according to claim 1, wherein in the step 1, the BSRU is a bidirectional SRU, and the SRU is calculated as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt==σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W thereinr、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs the current input and output; c. CtIs the state value of the cell at time t for storing history information, ct-1Is the state value of the cell at time t-1; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
4. The time-domain single-channel multi-speaker speech recognition method according to claim 3, wherein the two state distribution vectors obtained in step 2 are acoustic modeling unit probability distributions of two speakers.
5. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice label; subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately1And LR2
LR1=LR11+LR22
LR2=LR12+LR21
Wherein LRijAnd (3) expressing the cross entropy error between the ith output and the jth target person clean voice forced alignment label of the separation network, wherein i is 1,2, and j is 1, 2.
6. The time-domain single-channel multi-speaker speech recognition method of claim 5, wherein LR is first calculated11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And will LR1As the smaller error under both orderings; if LR11If greater than the threshold, LR is calculated12And LR21Will LR2As the smaller error under both orderings.
7. A time domain single channel multi-speaker speech recognition system, comprising:
a mixed voice signal waveform sampling module (101) for sampling the waveform of the mixed voice signal;
the one-dimensional convolution neural network module (102) takes the output of the mixed voice signal waveform sampling module (101) as input and preliminarily extracts features;
the separation network BSRU (103) takes the output of the one-dimensional convolution neural network module (102) as input to obtain the characteristic representation after the original waveform is separated;
the two full-connection layers (104) respectively take two paths of outputs of the BSRU (103) as inputs to obtain two state distribution vectors;
a multiple cross scoring module (105) which performs cross scoring on the output of two full connection layers (104) and two target voice labels (106) by using a method of multiple cross scoring and setting an error threshold value to obtain two smaller cross entropy errors (107) in sequence;
and a minimum error module (108) for taking the smaller error of the two sorts as the error of the back propagation update of the whole neural network.
CN202010061565.6A 2020-01-19 2020-01-19 Time domain single-channel multi-speaker voice recognition method and system Active CN111243579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010061565.6A CN111243579B (en) 2020-01-19 2020-01-19 Time domain single-channel multi-speaker voice recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010061565.6A CN111243579B (en) 2020-01-19 2020-01-19 Time domain single-channel multi-speaker voice recognition method and system

Publications (2)

Publication Number Publication Date
CN111243579A true CN111243579A (en) 2020-06-05
CN111243579B CN111243579B (en) 2022-10-14

Family

ID=70872827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010061565.6A Active CN111243579B (en) 2020-01-19 2020-01-19 Time domain single-channel multi-speaker voice recognition method and system

Country Status (1)

Country Link
CN (1) CN111243579B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562712A (en) * 2020-12-24 2021-03-26 上海明略人工智能(集团)有限公司 Recording data processing method and system, electronic equipment and storage medium
CN113239809A (en) * 2021-05-14 2021-08-10 西北工业大学 Underwater sound target identification method based on multi-scale sparse SRU classification model
CN113362831A (en) * 2021-07-12 2021-09-07 科大讯飞股份有限公司 Speaker separation method and related equipment thereof
CN113436633A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Speaker recognition method, speaker recognition device, computer equipment and storage medium
CN113571085A (en) * 2021-07-24 2021-10-29 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115440198A (en) * 2022-11-08 2022-12-06 南方电网数字电网研究院有限公司 Method and apparatus for converting mixed audio signal, computer device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5632002A (en) * 1992-12-28 1997-05-20 Kabushiki Kaisha Toshiba Speech recognition interface system suitable for window systems and speech mail systems
US20060028337A1 (en) * 2004-08-09 2006-02-09 Li Qi P Voice-operated remote control for TV and electronic systems
US20120092436A1 (en) * 2010-10-19 2012-04-19 Microsoft Corporation Optimized Telepresence Using Mobile Device Gestures
CN108694949A (en) * 2018-03-27 2018-10-23 佛山市顺德区中山大学研究院 Method for distinguishing speek person and its device based on reorder super vector and residual error network
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109637526A (en) * 2019-01-08 2019-04-16 西安电子科技大学 The adaptive approach of DNN acoustic model based on personal identification feature
US20190304437A1 (en) * 2018-03-29 2019-10-03 Tencent Technology (Shenzhen) Company Limited Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers
CN110491415A (en) * 2019-09-23 2019-11-22 河南工业大学 A kind of speech-emotion recognition method based on convolutional neural networks and simple cycle unit

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5632002A (en) * 1992-12-28 1997-05-20 Kabushiki Kaisha Toshiba Speech recognition interface system suitable for window systems and speech mail systems
US20060028337A1 (en) * 2004-08-09 2006-02-09 Li Qi P Voice-operated remote control for TV and electronic systems
US20120092436A1 (en) * 2010-10-19 2012-04-19 Microsoft Corporation Optimized Telepresence Using Mobile Device Gestures
CN108694949A (en) * 2018-03-27 2018-10-23 佛山市顺德区中山大学研究院 Method for distinguishing speek person and its device based on reorder super vector and residual error network
US20190304437A1 (en) * 2018-03-29 2019-10-03 Tencent Technology (Shenzhen) Company Limited Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition
US20190318725A1 (en) * 2018-04-13 2019-10-17 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers
CN108877782A (en) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN109637526A (en) * 2019-01-08 2019-04-16 西安电子科技大学 The adaptive approach of DNN acoustic model based on personal identification feature
CN110491415A (en) * 2019-09-23 2019-11-22 河南工业大学 A kind of speech-emotion recognition method based on convolutional neural networks and simple cycle unit

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TIAN TAN ET AL: "Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition", 《ICASSP 2018》 *
范存航等: "一种基于卷积神经网络的端到端语音分离方法", 《信号处理》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562712A (en) * 2020-12-24 2021-03-26 上海明略人工智能(集团)有限公司 Recording data processing method and system, electronic equipment and storage medium
CN113239809A (en) * 2021-05-14 2021-08-10 西北工业大学 Underwater sound target identification method based on multi-scale sparse SRU classification model
CN113239809B (en) * 2021-05-14 2023-09-15 西北工业大学 Underwater sound target identification method based on multi-scale sparse SRU classification model
CN113436633A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Speaker recognition method, speaker recognition device, computer equipment and storage medium
CN113436633B (en) * 2021-06-30 2024-03-12 平安科技(深圳)有限公司 Speaker recognition method, speaker recognition device, computer equipment and storage medium
CN113362831A (en) * 2021-07-12 2021-09-07 科大讯飞股份有限公司 Speaker separation method and related equipment thereof
CN113571085A (en) * 2021-07-24 2021-10-29 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN113571085B (en) * 2021-07-24 2023-09-22 平安科技(深圳)有限公司 Voice separation method, system, device and storage medium
CN113782045A (en) * 2021-08-30 2021-12-10 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN113782045B (en) * 2021-08-30 2024-01-05 江苏大学 Single-channel voice separation method for multi-scale time delay sampling
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115116448B (en) * 2022-08-29 2022-11-15 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115440198A (en) * 2022-11-08 2022-12-06 南方电网数字电网研究院有限公司 Method and apparatus for converting mixed audio signal, computer device and storage medium

Also Published As

Publication number Publication date
CN111243579B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
CN111243579B (en) Time domain single-channel multi-speaker voice recognition method and system
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN109272988B (en) Voice recognition method based on multi-path convolution neural network
JP7407968B2 (en) Speech recognition method, device, equipment and storage medium
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
Kinoshita et al. Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds
WO2020024646A1 (en) Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks
Nakkiran et al. Compressing deep neural networks using a rank-constrained topology.
Razak et al. Comparison between fuzzy and nn method for speech emotion recognition
CN113822125B (en) Processing method and device of lip language recognition model, computer equipment and storage medium
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN113674732B (en) Voice confidence detection method and device, electronic equipment and storage medium
WO2021135457A1 (en) Recurrent neural network-based emotion recognition method, apparatus, and storage medium
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN112488063B (en) Video statement positioning method based on multi-stage aggregation Transformer model
CN110569908B (en) Speaker counting method and system
CN111210815B (en) Deep neural network construction method for voice command word recognition, and recognition method and device
CN112766368A (en) Data classification method, equipment and readable storage medium
CN110717022A (en) Robot dialogue generation method and device, readable storage medium and robot
WO2020151017A1 (en) Scalable field human-machine dialogue system state tracking method and device
CN111563161A (en) Sentence recognition method, sentence recognition device and intelligent equipment
CN116312539A (en) Chinese dialogue round correction method and system based on large model
CN113889088B (en) Method and device for training speech recognition model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant