CN111243579A - Time domain single-channel multi-speaker voice recognition method and system - Google Patents
Time domain single-channel multi-speaker voice recognition method and system Download PDFInfo
- Publication number
- CN111243579A CN111243579A CN202010061565.6A CN202010061565A CN111243579A CN 111243579 A CN111243579 A CN 111243579A CN 202010061565 A CN202010061565 A CN 202010061565A CN 111243579 A CN111243579 A CN 111243579A
- Authority
- CN
- China
- Prior art keywords
- speaker
- network
- error
- output
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000000926 separation method Methods 0.000 claims abstract description 45
- 239000013598 vector Substances 0.000 claims abstract description 33
- 238000013528 artificial neural network Methods 0.000 claims abstract description 26
- 238000009826 distribution Methods 0.000 claims abstract description 18
- 238000005070 sampling Methods 0.000 claims abstract description 16
- 238000002372 labelling Methods 0.000 claims abstract description 12
- 239000000284 extract Substances 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000013077 scoring method Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 3
- 238000001228 spectrum Methods 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
A time domain single channel multi-speaker voice recognition method, input is the original waveform sampling of the mixed speech signal, extract the characteristic through the one-dimensional convolution network first, then send into the separation network and carry on the speech separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then adopting a forced alignment method to obtain corresponding labeling information from the existing target voice labeling, calculating a smaller probability distribution error of an acoustic modeling unit under two sorts as an error of neural network back propagation by means of cross scoring and threshold selection, constructing a multi-speaker voice recognition model containing a time domain single channel, and realizing multi-speaker voice recognition by utilizing the model. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.
Description
Technical Field
The invention belongs to the technical field of audio, and particularly relates to a time domain single-channel multi-speaker voice recognition method and system.
Background
The cocktail party problem (cocktail party problem) is a problem in the field of computer speech recognition, and the current speech recognition technology can recognize the content spoken by one person with high precision, but when the number of the spoken persons is two or more, the speech recognition rate is greatly reduced, and the problem is called the cocktail party problem. The problem can be solved greatly for a series of practical application scenes, such as automatic recording of multi-person conferences, multi-party human-computer interaction, automatic audio/video labeling and the like.
With the rise of neural networks and deep learning, a plurality of voice separation algorithms based on deep learning are proposed, which can be mainly classified into two categories, one is voice separation based on time-frequency spectrum, and the other is voice separation based on time-domain signals.
1. The voice separation method based on the time spectrum comprises the following steps:
1) deep Clustering (DPCL) method: firstly, mapping the time-frequency spectrum of a voice signal to a high-dimensional space through an artificial neural network, then dividing a high-dimensional space vector by using a clustering algorithm such as K-means clustering and the like, and dividing components belonging to the same speaker together. The method assumes that each time-frequency point belongs to only one of a plurality of speakers, and clustering in a high-dimensional space is not necessarily the optimal operation.
2) Deep Attractor Network (DAN) method: similar to DPCL, the time-frequency spectrum of the mixed speech signal is mapped to a high-dimensional space, and then a series of attractors are constructed in the space, and the attractors are used to divide the time-frequency points belonging to the target person together. However, DAN requires an estimate of attractors, requiring not only an additional amount of computation, but also a complex design process.
3) The rank independent Training (PIT) method: under the condition of mixing the speech signals of two speakers, the more intuitive method is to use an artificial neural network to perform speech separation, firstly input the time-frequency spectrum or other characteristics of the mixed speech, and then design two outputs, wherein each output corresponds to the time-frequency spectrum of one speaker. But this can lead to problems: the ordering of the output two ports and the target reference speech is not necessarily consistent, i.e.: the speaker's ranking for both outputs of the neural network may be "speaker 2, speaker 1", but the ranking for the reference speech is "speaker 1, speaker 2", and serious errors can occur if at this time the error between the output and the labeled value is forcibly calculated from the speech label. Therefore, the error needs to be recalculated after the reference speech is reordered to "speaker 2, speaker 1", which is the Label ranking (Label Permutation) problem in speech separation. The PIT is a main method for solving the tag ordering at present, and the tag ordering problem is relieved by considering all possible reference voice orderings and then selecting an ordering which enables the sum of all human errors to be minimum as an optimal ordering. Fig. 1 shows a framework for single-channel multi-speaker voice separation using the PIT method.
The mathematical model for the standard PIT method is: suppose that the mixed speech signal input contains two speakers, Y represents the time-frequency spectrum of the mixed speech signal input and is a T × F matrix, where T is the time frame number and F is the number of frequency points of the fast fourier transform. The time and frequency point indices of the correlation matrix are omitted for simplicity of presentation. Two speaker masks M with their amplitude values fed into a separate network (usually the recurrent neural network RNN) and estimated1And M2。
(M1,M2)=Separation(|Y|) (1)
Where Separation stands for Separation network.
Then, the amplitude values of the frequency spectrums of the two speakers are estimated according to the mask, and the following formula is shown:
Suppose X1And X2Is the original clean speech of the target speaker, the error between the estimated value and the clean speech is calculated by the following error function:
wherein S is the total number of speakers, and when two speakers exist, S is 2; p is an array of 1,2, …, S, with a total of S! And (4) carrying out the following steps. The goal of the above formula is: the optimal target speaker arrangement sequence is found to be basically consistent with the estimated speaker arrangement sequence, and then the minimum Mean Square Error (MSE) under the arrangement sequence is used as the error of the neural network gradient updating.
According to equation 4, the PIT method is used to calculate two errors LS in the case of two speakers1And LS2:
LS1=LS11+LS22(5)
LS2=LS12+LS21(6)
Wherein
Representing the error between the ith output of the separation network and the clean speech spectrum of the jth target person. In particular LS11Refers to the error, LS, between the net 1 st output and the 1 st speaker's clean speech spectrum22The error between the second output of the network and the clean voice spectrum of the 2 nd speaker (on the premise of judging the 1 st output of the separation network as the 1 st speaker) is referred to; LS (least squares)12Refers to the error, LS, between the net 1 st output and the 2 nd speaker's clean speech spectrum21Refers to the error between the second output of the network and the clean speech spectrum of the 1 st speaker (on the premise that the 1 st output of the separate network is judged to be the 2 nd speaker).
Finally, in LS1And LS2The better of the two errors is selected as the error of the neural network back-propagation update. In this case, the operation of equation (7) needs to be performed 4 times.
2. The voice separation method based on the time domain signal comprises the following steps:
time domain Audio Separation Network (TasNet) takes advantage of the concept of PIT to handle the ordering of output ports, except that the input and output of the neural Network are both speech waveform samples. Firstly, a one-dimensional convolution is used as an encoder on the whole structure, and a frame of voice is encoded to obtain an encoding vector; then sending the coding vector to a separation network to obtain two masks; the two masks are respectively multiplied by the coding vector of the mixed voice to obtain the coding vector of the voice of the frame of the target speaker; finally, the coded vector is restored into a speech waveform through a decoder acting as a one-dimensional convolution. Recent work has shown that the separation achieved by this method has greatly exceeded the several time-spectrum based separation methods described above.
Specifically, considering that the current mixed speech signal input y is in a time domain form, it needs to be subjected to encoding and decoding operations to realize signal separation, where the encoder performs convolution with N convolution kernels equal to the input y:
where i is 1, …, N, N is the number of convolution kernels, wiIs the ith convolution kernel, and the obtained e is the coding vector and is N-dimensional. These code vectors are then input into a separate network, the output is a mask of two speakers, and the code vector for each predicted person is the code vector for the mixed speech multiplied by its mask:
(m1,m2)=Separation(e) (9)
di=mi⊙e (10)
finally, the decoder recovers the original voice
Where W is the learnable decoder matrix.
The above method has the disadvantages that: because the PIT processes a small segment of speech, there may be inconsistency of reference speech sequence of the front and rear segments in a sentence of speech, which may cause serious speaker switching when the separation result of some output of the front and rear segments is pieced together, that is, the output of only speaker 2 should contain the speech of speaker 1. Therefore, in practical application, a Recurrent Neural Network (RNN) is generally adopted to perform sentence-level modeling, so that certain continuity and stability of sequencing of frames before and after output can be ensured.
In addition, all the above two methods still need to perform voice separation first and then perform voice recognition on each person, that is, a real end-to-end system is still not realized, and a certain distance is left from the requirement of the commercial application.
3. Single-channel multi-speaker speech recognition using PIT
The most intuitive method for performing single-channel multi-speaker speech recognition by using PIT is to change the output of a neural network into an acoustic modeling unit and replace an MSE (mean square error) error function during speech separation with a Cross Entropy (CE) error function, namely
Wherein
WhereinIs the acoustic modeling unit probability distribution of the ith output of the neural network at the t frame,is the true tag of the ith speaker frame t under the permutation p, and is generally obtained by forced alignment.Is the ith output at time t on the labelThe probability of (c) above.
However, the method has the disadvantage that the frequency domain amplitude spectrum of the input mixed signal is not utilized for the phase information of the mixed signal.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention provides a time domain single channel multiple speaker voice recognition method and system, which combines the time domain processing technology with the single channel multiple speaker voice separation technology, and introduces the time domain processing method of voice signals on the basis of training without relation to the sequence, thereby reducing the error rate of the multiple speaker voice recognition.
In order to achieve the purpose, the invention adopts the technical scheme that:
a time domain single channel multi-speaker voice recognition method comprises the following steps:
step 2, respectively sending the characteristic representations of the separated original waveforms into two full-connection layers, and outputting two acoustic state distribution vectors;
step 3, referring the two state distribution vectors to labeling information obtained by forced alignment, obtaining smaller errors under two sorts in a cross scoring and threshold value selection mode, and constructing a time domain single-channel multi-speaker voice recognition model as errors of neural network back propagation;
and 4, realizing multi-speaker voice recognition by using the single-channel multi-speaker voice recognition model with the time domain.
In the step 1, the one-dimensional convolution network is one or more layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer comprise the number of convolution kernels, the length of the convolution kernels, the maximum pooling size and the step length; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.
In the step 1, the BSRU is a bidirectional SRU, and the calculation method of the SRU is as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt=σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W thereinr、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs the current input and output; c. CtIs the state value of the cell at time t for storing history information, ct-1Is the state value of the cell at time t-1; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
In the step 2, the obtained two state distribution vectors are the probability distribution of the acoustic modeling units of the two speakers.
In the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice labeling; subsequently, in the case of two speakers, a multiple cross-scoring method, i.e., scoring, is employedConsider the error LR in two cases separately1And LR2:
LR1=LR11+LR22
LR2=LR12+LR21
Wherein LRijAnd (3) expressing the cross entropy error between the ith output and the jth target person clean voice forced alignment label of the separation network, wherein i is 1,2, and j is 1, 2.
First calculate LR11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And will LR1As the smaller error under both orderings; if LR11If greater than the threshold, LR is calculated12And LR21Will LR2As the smaller error under both orderings.
The invention also provides a time domain single channel multi-speaker voice recognition system, comprising:
a mixed voice signal waveform sampling module 101 for sampling the waveform of the mixed voice signal;
the one-dimensional convolutional neural network module 102 takes the output of the mixed voice signal waveform sampling module 101 as input, and preliminarily extracts features;
a separation network BSRU103 which takes the output of the one-dimensional convolution neural network module 102 as input to obtain the characteristic representation after the original waveform is separated;
the two full connection layers 104 respectively take two paths of outputs of the separation network BSRU103 as inputs to obtain two state distribution vectors;
a multiple cross scoring module 105, which performs cross scoring on the output of the two full connection layers 104 and the two target voice labels 106 by using a multiple cross scoring and error threshold setting method to obtain two smaller cross entropy errors 107 in sequence;
the minimum error module 108 takes the smaller error of the two sorts as the error of the back propagation update of the whole neural network.
The main principle of the invention is as follows: under the condition of two speakers, the original waveform sampling of a mixed voice signal is input, the characteristics of the waveform are primarily learned through a one-dimensional convolution network, and then the waveform is sent to a separation network for voice separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then, acquiring corresponding labeling information from the existing target voice labeling by adopting a forced alignment method, and calculating a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode. In order to accelerate the cross scoring process, the invention also provides a scoring algorithm for reducing 1/4-1/2 error calculation amount through threshold setting. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.
Compared with the prior art, the invention has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.
Drawings
Fig. 1 is a block diagram of a prior art single-channel multi-speaker voice separation using the PIT method.
FIG. 2 is a time domain single channel multiple speaker speech recognition modeling flow chart of the present invention
Fig. 3 is a schematic diagram of the SRU calculation method.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
The invention provides a time domain single channel multi-speaker voice recognition method and a system, which are particularly used for combining time domain processing of voice signals and single channel multi-speaker voice recognition on the basis of sequencing irrelevant training in a multi-speaker voice recognition scene and are used for reducing the error rate of the multi-speaker voice recognition. The methods and systems are not limited to multi-speaker speech recognition but may be any methods and systems related to speech recognition.
FIG. 2 is a time domain single channel multiple speaker speech recognition model modeling process of the present invention, which includes:
the system inputs original sampling waveforms of mixed voice, and the characteristics of the original sampling waveforms are preliminarily extracted through a one-dimensional convolution network. The one-dimensional convolution network can be one layer or multiple layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer include the number of convolution kernels, the length of the convolution kernels, the maximum pooling size, the step size and the like. For a one-dimensional convolution network of one layer, the length of the convolution kernel is generally set to the number of sampling points of one frame of speech, for example, one frame of 25ms, and the sampling of 16kHz is 400 points. The multilayer one-dimensional convolution network has a pooling operation, and the one-dimensional convolution network of one layer has no pooling operation. The output of each layer of convolution is normalized through batch normalization to improve generalization and training speed. And finally, splicing vectors of all channels in the last layer together to be used as a feature representation of the learned time domain waveform. The feature representations of these waveforms are then fed into a separation network BSRU for separation, outputting the separated feature representations of the two speakers of the original blended waveform.
The bsru (bidirectional SRU), i.e. bidirectional SRU, of the disjoint network, with reference to fig. 3, is calculated as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt=σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W in the formular、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs whenFront input and output; c. CtThe state value of the cell is used for storing historical information; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
Step 2, the characteristic representation after the separation of the original waveform output in the step 1 is respectively sent into two full-connection layers, and two acoustic state distribution vectors are respectively output;
the output of the separation network passes through two independent full-connection layers to respectively obtain the probability distribution of the acoustic modeling units output by the two neural network full-connection layers.
And 3, referring to the marking information obtained by forced alignment, and reducing the calculated amount by adopting a method of cross scoring for multiple times and setting an error threshold value.
And referring to the labeling information obtained by forced alignment, and respectively obtaining a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode.
Firstly, acquiring corresponding labeling information from the existing target voice label by adopting a forced alignment method;
subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately1And LR2:
LR1=LR11+LR22(3)
LR2=LR12+LR21(4)
Wherein LRijAnd i is 1,2, j is 1,2 represents the cross entropy error between the ith output of the separation network and the jth target person clean voice forced alignment label.
If the above formula is performed in sequence, LR needs to be calculated for 4 timesij. The method of the invention comprises the following steps: first calculate LR11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And taking the formula (3) as a smaller error under two sorts; if LR11If greater than the threshold, LR is calculated12And LR21Equation (4) is taken as the smaller error for both orderings. So that only 2 or 3 LR's need be calculated at a timeijCompared with the previous 4 times, the method can save about 1/4-1/2 error calculation amount. The setting of this threshold follows two principles: firstly, the initial error of training is generally large, so the threshold value should be large at the initial stage of training and is decreased as training progresses; second, the threshold should be matched to the LR of the current trainingijThe average values are related.
The error calculation time for training using this multiple cross-scoring method is 3/4 to 1/2 of the error calculation time in the multiple cross-scoring mode.
Two orderings in the present invention refer to: in the case of two speakers, when a mixed voice of the two speakers is inputted, contents which the two speakers respectively speak, that is, one input, two outputs, are recognized. Meanwhile, the user also knows what the two persons actually say respectively during training, namely two references; but the correspondence of the output and the reference is not known, there are two sorts:
B. Output 2 corresponds to reference 1 and output 1 corresponds to reference 2
I.e. there is no guarantee that the output of the first port is always the first person.
In the case of multiple speakers, the principle is the same.
In a word, the invention provides a time domain single-channel multi-speaker voice recognition modeling method, which can effectively further improve the recognition effect of multi-speaker voice. Employing this method on some multi-speaker continuous speech recognition datasets can achieve better performance than the time-spectrum based PIT method.
Fig. 2 also shows a corresponding system, in which the mixed speech signal waveform sampling module 101 obtains a signal x ═ x1,...,xT](T is the time length of the signal), the output of the mixed speech signal waveform sampling module 101 is used as the input of the one-dimensional convolutional neural network module 102; the output of the one-dimensional convolutional neural network module 102 is used as a separation netThe input of network BSRU 103; the output of the separation network BSRU103 is respectively sent to two full connection layers 104; the inputs of the two fully connected layers 104 are sent to the multiple cross-scoring module 105 along with the two target voice labels 106; the multiple cross scoring module 105 uses multiple cross scoring and error threshold setting methods to obtain two kinds of smaller cross entropy errors 107 in the sequence, and the smaller error selected by the minimum error module 108 is the error of the whole neural network back propagation update.
During testing, the logarithm values of the probability vectors output by the neural network are sent to a speech recognition decoder, and the recognition texts of two persons can be obtained, and the method has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.
Claims (7)
1. A time domain single channel multi-speaker voice recognition method is characterized by comprising the following steps:
step 1, sending an original waveform of mixed voice into a one-dimensional convolution network to preliminarily extract features, then sending the feature into a separation network BSRU, and outputting a feature representation after the original waveform is separated;
step 2, respectively sending the characteristic representations of the separated original waveforms into two full-connection layers, and outputting two acoustic state distribution vectors;
step 3, referring the two state distribution vectors to labeling information obtained by forced alignment, obtaining smaller errors under two sorts in a cross scoring and threshold value selection mode, and constructing a time domain single-channel multi-speaker voice recognition model as errors of neural network back propagation;
and 4, realizing multi-speaker voice recognition by using the single-channel multi-speaker voice recognition model with the time domain.
2. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 1, the one-dimensional convolutional network is one or more layers, and for the multi-layer one-dimensional convolutional network, the parameters of each layer include the number of convolutional kernels, the length of the convolutional kernels, the maximum pooling size and the step size; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.
3. The time-domain single-channel multi-speaker speech recognition method according to claim 1, wherein in the step 1, the BSRU is a bidirectional SRU, and the SRU is calculated as follows:
ft=σ(Wfxt+vf⊙ct-1+bf)
ct=ft⊙ct-1+(1-ft)⊙(Wxt)
rt==σ(Wrxt+vr⊙ct-1+br)
ht=rt⊙ct+(1-rt)⊙xt
w, W thereinr、WfIs a weight matrix, vf、bf、vr、brIs a parameter vector; x is the number oftAnd htIs the current input and output; c. CtIs the state value of the cell at time t for storing history information, ct-1Is the state value of the cell at time t-1; f. oftAnd rtRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.
4. The time-domain single-channel multi-speaker speech recognition method according to claim 3, wherein the two state distribution vectors obtained in step 2 are acoustic modeling unit probability distributions of two speakers.
5. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice label; subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately1And LR2:
LR1=LR11+LR22
LR2=LR12+LR21
Wherein LRijAnd (3) expressing the cross entropy error between the ith output and the jth target person clean voice forced alignment label of the separation network, wherein i is 1,2, and j is 1, 2.
6. The time-domain single-channel multi-speaker speech recognition method of claim 5, wherein LR is first calculated11If LR11If the LR is less than a predetermined threshold, the LR is calculated22And will LR1As the smaller error under both orderings; if LR11If greater than the threshold, LR is calculated12And LR21Will LR2As the smaller error under both orderings.
7. A time domain single channel multi-speaker speech recognition system, comprising:
a mixed voice signal waveform sampling module (101) for sampling the waveform of the mixed voice signal;
the one-dimensional convolution neural network module (102) takes the output of the mixed voice signal waveform sampling module (101) as input and preliminarily extracts features;
the separation network BSRU (103) takes the output of the one-dimensional convolution neural network module (102) as input to obtain the characteristic representation after the original waveform is separated;
the two full-connection layers (104) respectively take two paths of outputs of the BSRU (103) as inputs to obtain two state distribution vectors;
a multiple cross scoring module (105) which performs cross scoring on the output of two full connection layers (104) and two target voice labels (106) by using a method of multiple cross scoring and setting an error threshold value to obtain two smaller cross entropy errors (107) in sequence;
and a minimum error module (108) for taking the smaller error of the two sorts as the error of the back propagation update of the whole neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010061565.6A CN111243579B (en) | 2020-01-19 | 2020-01-19 | Time domain single-channel multi-speaker voice recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010061565.6A CN111243579B (en) | 2020-01-19 | 2020-01-19 | Time domain single-channel multi-speaker voice recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111243579A true CN111243579A (en) | 2020-06-05 |
CN111243579B CN111243579B (en) | 2022-10-14 |
Family
ID=70872827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010061565.6A Active CN111243579B (en) | 2020-01-19 | 2020-01-19 | Time domain single-channel multi-speaker voice recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111243579B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112562712A (en) * | 2020-12-24 | 2021-03-26 | 上海明略人工智能(集团)有限公司 | Recording data processing method and system, electronic equipment and storage medium |
CN113239809A (en) * | 2021-05-14 | 2021-08-10 | 西北工业大学 | Underwater sound target identification method based on multi-scale sparse SRU classification model |
CN113362831A (en) * | 2021-07-12 | 2021-09-07 | 科大讯飞股份有限公司 | Speaker separation method and related equipment thereof |
CN113436633A (en) * | 2021-06-30 | 2021-09-24 | 平安科技(深圳)有限公司 | Speaker recognition method, speaker recognition device, computer equipment and storage medium |
CN113571085A (en) * | 2021-07-24 | 2021-10-29 | 平安科技(深圳)有限公司 | Voice separation method, system, device and storage medium |
CN113782045A (en) * | 2021-08-30 | 2021-12-10 | 江苏大学 | Single-channel voice separation method for multi-scale time delay sampling |
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115440198A (en) * | 2022-11-08 | 2022-12-06 | 南方电网数字电网研究院有限公司 | Method and apparatus for converting mixed audio signal, computer device and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5632002A (en) * | 1992-12-28 | 1997-05-20 | Kabushiki Kaisha Toshiba | Speech recognition interface system suitable for window systems and speech mail systems |
US20060028337A1 (en) * | 2004-08-09 | 2006-02-09 | Li Qi P | Voice-operated remote control for TV and electronic systems |
US20120092436A1 (en) * | 2010-10-19 | 2012-04-19 | Microsoft Corporation | Optimized Telepresence Using Mobile Device Gestures |
CN108694949A (en) * | 2018-03-27 | 2018-10-23 | 佛山市顺德区中山大学研究院 | Method for distinguishing speek person and its device based on reorder super vector and residual error network |
CN108877782A (en) * | 2018-07-04 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN109637526A (en) * | 2019-01-08 | 2019-04-16 | 西安电子科技大学 | The adaptive approach of DNN acoustic model based on personal identification feature |
US20190304437A1 (en) * | 2018-03-29 | 2019-10-03 | Tencent Technology (Shenzhen) Company Limited | Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition |
US20190318725A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Research Laboratories, Inc. | Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers |
CN110491415A (en) * | 2019-09-23 | 2019-11-22 | 河南工业大学 | A kind of speech-emotion recognition method based on convolutional neural networks and simple cycle unit |
-
2020
- 2020-01-19 CN CN202010061565.6A patent/CN111243579B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5632002A (en) * | 1992-12-28 | 1997-05-20 | Kabushiki Kaisha Toshiba | Speech recognition interface system suitable for window systems and speech mail systems |
US20060028337A1 (en) * | 2004-08-09 | 2006-02-09 | Li Qi P | Voice-operated remote control for TV and electronic systems |
US20120092436A1 (en) * | 2010-10-19 | 2012-04-19 | Microsoft Corporation | Optimized Telepresence Using Mobile Device Gestures |
CN108694949A (en) * | 2018-03-27 | 2018-10-23 | 佛山市顺德区中山大学研究院 | Method for distinguishing speek person and its device based on reorder super vector and residual error network |
US20190304437A1 (en) * | 2018-03-29 | 2019-10-03 | Tencent Technology (Shenzhen) Company Limited | Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition |
US20190318725A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Research Laboratories, Inc. | Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers |
CN108877782A (en) * | 2018-07-04 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
CN109637526A (en) * | 2019-01-08 | 2019-04-16 | 西安电子科技大学 | The adaptive approach of DNN acoustic model based on personal identification feature |
CN110491415A (en) * | 2019-09-23 | 2019-11-22 | 河南工业大学 | A kind of speech-emotion recognition method based on convolutional neural networks and simple cycle unit |
Non-Patent Citations (2)
Title |
---|
TIAN TAN ET AL: "Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition", 《ICASSP 2018》 * |
范存航等: "一种基于卷积神经网络的端到端语音分离方法", 《信号处理》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112562712A (en) * | 2020-12-24 | 2021-03-26 | 上海明略人工智能(集团)有限公司 | Recording data processing method and system, electronic equipment and storage medium |
CN113239809A (en) * | 2021-05-14 | 2021-08-10 | 西北工业大学 | Underwater sound target identification method based on multi-scale sparse SRU classification model |
CN113239809B (en) * | 2021-05-14 | 2023-09-15 | 西北工业大学 | Underwater sound target identification method based on multi-scale sparse SRU classification model |
CN113436633A (en) * | 2021-06-30 | 2021-09-24 | 平安科技(深圳)有限公司 | Speaker recognition method, speaker recognition device, computer equipment and storage medium |
CN113436633B (en) * | 2021-06-30 | 2024-03-12 | 平安科技(深圳)有限公司 | Speaker recognition method, speaker recognition device, computer equipment and storage medium |
CN113362831A (en) * | 2021-07-12 | 2021-09-07 | 科大讯飞股份有限公司 | Speaker separation method and related equipment thereof |
CN113571085A (en) * | 2021-07-24 | 2021-10-29 | 平安科技(深圳)有限公司 | Voice separation method, system, device and storage medium |
CN113571085B (en) * | 2021-07-24 | 2023-09-22 | 平安科技(深圳)有限公司 | Voice separation method, system, device and storage medium |
CN113782045A (en) * | 2021-08-30 | 2021-12-10 | 江苏大学 | Single-channel voice separation method for multi-scale time delay sampling |
CN113782045B (en) * | 2021-08-30 | 2024-01-05 | 江苏大学 | Single-channel voice separation method for multi-scale time delay sampling |
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115116448B (en) * | 2022-08-29 | 2022-11-15 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115440198A (en) * | 2022-11-08 | 2022-12-06 | 南方电网数字电网研究院有限公司 | Method and apparatus for converting mixed audio signal, computer device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111243579B (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111243579B (en) | Time domain single-channel multi-speaker voice recognition method and system | |
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
CN109272988B (en) | Voice recognition method based on multi-path convolution neural network | |
JP7407968B2 (en) | Speech recognition method, device, equipment and storage medium | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
Kinoshita et al. | Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds | |
WO2020024646A1 (en) | Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks | |
Nakkiran et al. | Compressing deep neural networks using a rank-constrained topology. | |
Razak et al. | Comparison between fuzzy and nn method for speech emotion recognition | |
CN113822125B (en) | Processing method and device of lip language recognition model, computer equipment and storage medium | |
CN112233698A (en) | Character emotion recognition method and device, terminal device and storage medium | |
CN113674732B (en) | Voice confidence detection method and device, electronic equipment and storage medium | |
WO2021135457A1 (en) | Recurrent neural network-based emotion recognition method, apparatus, and storage medium | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN111274412A (en) | Information extraction method, information extraction model training device and storage medium | |
CN114282059A (en) | Video retrieval method, device, equipment and storage medium | |
CN112488063B (en) | Video statement positioning method based on multi-stage aggregation Transformer model | |
CN110569908B (en) | Speaker counting method and system | |
CN111210815B (en) | Deep neural network construction method for voice command word recognition, and recognition method and device | |
CN112766368A (en) | Data classification method, equipment and readable storage medium | |
CN110717022A (en) | Robot dialogue generation method and device, readable storage medium and robot | |
WO2020151017A1 (en) | Scalable field human-machine dialogue system state tracking method and device | |
CN111563161A (en) | Sentence recognition method, sentence recognition device and intelligent equipment | |
CN116312539A (en) | Chinese dialogue round correction method and system based on large model | |
CN113889088B (en) | Method and device for training speech recognition model, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |