CN111243579A

CN111243579A - Time domain single-channel multi-speaker voice recognition method and system

Info

Publication number: CN111243579A
Application number: CN202010061565.6A
Authority: CN
Inventors: 黄露; 杨毅; 孙甲松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-05
Anticipated expiration: 2040-01-19
Also published as: CN111243579B

Abstract

A time domain single channel multi-speaker voice recognition method, input is the original waveform sampling of the mixed speech signal, extract the characteristic through the one-dimensional convolution network first, then send into the separation network and carry on the speech separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then adopting a forced alignment method to obtain corresponding labeling information from the existing target voice labeling, calculating a smaller probability distribution error of an acoustic modeling unit under two sorts as an error of neural network back propagation by means of cross scoring and threshold selection, constructing a multi-speaker voice recognition model containing a time domain single channel, and realizing multi-speaker voice recognition by utilizing the model. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.

Description

Time domain single-channel multi-speaker voice recognition method and system

Technical Field

The invention belongs to the technical field of audio, and particularly relates to a time domain single-channel multi-speaker voice recognition method and system.

Background

The cocktail party problem (cocktail party problem) is a problem in the field of computer speech recognition, and the current speech recognition technology can recognize the content spoken by one person with high precision, but when the number of the spoken persons is two or more, the speech recognition rate is greatly reduced, and the problem is called the cocktail party problem. The problem can be solved greatly for a series of practical application scenes, such as automatic recording of multi-person conferences, multi-party human-computer interaction, automatic audio/video labeling and the like.

With the rise of neural networks and deep learning, a plurality of voice separation algorithms based on deep learning are proposed, which can be mainly classified into two categories, one is voice separation based on time-frequency spectrum, and the other is voice separation based on time-domain signals.

1. The voice separation method based on the time spectrum comprises the following steps:

1) deep Clustering (DPCL) method: firstly, mapping the time-frequency spectrum of a voice signal to a high-dimensional space through an artificial neural network, then dividing a high-dimensional space vector by using a clustering algorithm such as K-means clustering and the like, and dividing components belonging to the same speaker together. The method assumes that each time-frequency point belongs to only one of a plurality of speakers, and clustering in a high-dimensional space is not necessarily the optimal operation.

2) Deep Attractor Network (DAN) method: similar to DPCL, the time-frequency spectrum of the mixed speech signal is mapped to a high-dimensional space, and then a series of attractors are constructed in the space, and the attractors are used to divide the time-frequency points belonging to the target person together. However, DAN requires an estimate of attractors, requiring not only an additional amount of computation, but also a complex design process.

3) The rank independent Training (PIT) method: under the condition of mixing the speech signals of two speakers, the more intuitive method is to use an artificial neural network to perform speech separation, firstly input the time-frequency spectrum or other characteristics of the mixed speech, and then design two outputs, wherein each output corresponds to the time-frequency spectrum of one speaker. But this can lead to problems: the ordering of the output two ports and the target reference speech is not necessarily consistent, i.e.: the speaker's ranking for both outputs of the neural network may be "speaker 2, speaker 1", but the ranking for the reference speech is "speaker 1, speaker 2", and serious errors can occur if at this time the error between the output and the labeled value is forcibly calculated from the speech label. Therefore, the error needs to be recalculated after the reference speech is reordered to "speaker 2, speaker 1", which is the Label ranking (Label Permutation) problem in speech separation. The PIT is a main method for solving the tag ordering at present, and the tag ordering problem is relieved by considering all possible reference voice orderings and then selecting an ordering which enables the sum of all human errors to be minimum as an optimal ordering. Fig. 1 shows a framework for single-channel multi-speaker voice separation using the PIT method.

The mathematical model for the standard PIT method is: suppose that the mixed speech signal input contains two speakers, Y represents the time-frequency spectrum of the mixed speech signal input and is a T × F matrix, where T is the time frame number and F is the number of frequency points of the fast fourier transform. The time and frequency point indices of the correlation matrix are omitted for simplicity of presentation. Two speaker masks M with their amplitude values fed into a separate network (usually the recurrent neural network RNN) and estimated₁And M₂。

(M₁,M₂)＝Separation(|Y|) (1)

Where Separation stands for Separation network.

Then, the amplitude values of the frequency spectrums of the two speakers are estimated according to the mask, and the following formula is shown:

wherein

Representing the estimated spectral amplitude value at the ith speaker.

Suppose X₁And X₂Is the original clean speech of the target speaker, the error between the estimated value and the clean speech is calculated by the following error function:

wherein S is the total number of speakers, and when two speakers exist, S is 2; p is an array of 1,2, …, S, with a total of S! And (4) carrying out the following steps. The goal of the above formula is: the optimal target speaker arrangement sequence is found to be basically consistent with the estimated speaker arrangement sequence, and then the minimum Mean Square Error (MSE) under the arrangement sequence is used as the error of the neural network gradient updating.

According to equation 4, the PIT method is used to calculate two errors LS in the case of two speakers₁And LS₂：

LS₁＝LS₁₁+LS₂₂(5)

LS₂＝L_S12+LS₂₁(6)

Wherein

Representing the error between the ith output of the separation network and the clean speech spectrum of the jth target person. In particular LS₁₁Refers to the error, LS, between the net 1 st output and the 1 st speaker's clean speech spectrum₂₂The error between the second output of the network and the clean voice spectrum of the 2 nd speaker (on the premise of judging the 1 st output of the separation network as the 1 st speaker) is referred to; LS (least squares)₁₂Refers to the error, LS, between the net 1 st output and the 2 nd speaker's clean speech spectrum₂₁Refers to the error between the second output of the network and the clean speech spectrum of the 1 st speaker (on the premise that the 1 st output of the separate network is judged to be the 2 nd speaker).

Finally, in LS₁And LS₂The better of the two errors is selected as the error of the neural network back-propagation update. In this case, the operation of equation (7) needs to be performed 4 times.

2. The voice separation method based on the time domain signal comprises the following steps:

time domain Audio Separation Network (TasNet) takes advantage of the concept of PIT to handle the ordering of output ports, except that the input and output of the neural Network are both speech waveform samples. Firstly, a one-dimensional convolution is used as an encoder on the whole structure, and a frame of voice is encoded to obtain an encoding vector; then sending the coding vector to a separation network to obtain two masks; the two masks are respectively multiplied by the coding vector of the mixed voice to obtain the coding vector of the voice of the frame of the target speaker; finally, the coded vector is restored into a speech waveform through a decoder acting as a one-dimensional convolution. Recent work has shown that the separation achieved by this method has greatly exceeded the several time-spectrum based separation methods described above.

Specifically, considering that the current mixed speech signal input y is in a time domain form, it needs to be subjected to encoding and decoding operations to realize signal separation, where the encoder performs convolution with N convolution kernels equal to the input y:

where i is 1, …, N, N is the number of convolution kernels, w_iIs the ith convolution kernel, and the obtained e is the coding vector and is N-dimensional. These code vectors are then input into a separate network, the output is a mask of two speakers, and the code vector for each predicted person is the code vector for the mixed speech multiplied by its mask:

(m₁,m₂)＝Separation(e) (9)

d_i＝m_i⊙e (10)

finally, the decoder recovers the original voice

Where W is the learnable decoder matrix.

The above method has the disadvantages that: because the PIT processes a small segment of speech, there may be inconsistency of reference speech sequence of the front and rear segments in a sentence of speech, which may cause serious speaker switching when the separation result of some output of the front and rear segments is pieced together, that is, the output of only speaker 2 should contain the speech of speaker 1. Therefore, in practical application, a Recurrent Neural Network (RNN) is generally adopted to perform sentence-level modeling, so that certain continuity and stability of sequencing of frames before and after output can be ensured.

In addition, all the above two methods still need to perform voice separation first and then perform voice recognition on each person, that is, a real end-to-end system is still not realized, and a certain distance is left from the requirement of the commercial application.

3. Single-channel multi-speaker speech recognition using PIT

The most intuitive method for performing single-channel multi-speaker speech recognition by using PIT is to change the output of a neural network into an acoustic modeling unit and replace an MSE (mean square error) error function during speech separation with a Cross Entropy (CE) error function, namely

Wherein

Wherein

Is the acoustic modeling unit probability distribution of the ith output of the neural network at the t frame,

is the true tag of the ith speaker frame t under the permutation p, and is generally obtained by forced alignment.

Is the ith output at time t on the label

The probability of (c) above.

However, the method has the disadvantage that the frequency domain amplitude spectrum of the input mixed signal is not utilized for the phase information of the mixed signal.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, the present invention provides a time domain single channel multiple speaker voice recognition method and system, which combines the time domain processing technology with the single channel multiple speaker voice separation technology, and introduces the time domain processing method of voice signals on the basis of training without relation to the sequence, thereby reducing the error rate of the multiple speaker voice recognition.

In order to achieve the purpose, the invention adopts the technical scheme that:

a time domain single channel multi-speaker voice recognition method comprises the following steps:

step 1, sending an original waveform of mixed voice into a one-dimensional convolution network to preliminarily extract features, then sending the feature into a separation network BSRU, and outputting a feature representation after the original waveform is separated;

step 2, respectively sending the characteristic representations of the separated original waveforms into two full-connection layers, and outputting two acoustic state distribution vectors;

step 3, referring the two state distribution vectors to labeling information obtained by forced alignment, obtaining smaller errors under two sorts in a cross scoring and threshold value selection mode, and constructing a time domain single-channel multi-speaker voice recognition model as errors of neural network back propagation;

and 4, realizing multi-speaker voice recognition by using the single-channel multi-speaker voice recognition model with the time domain.

In the step 1, the one-dimensional convolution network is one or more layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer comprise the number of convolution kernels, the length of the convolution kernels, the maximum pooling size and the step length; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.

In the step 1, the BSRU is a bidirectional SRU, and the calculation method of the SRU is as follows:

f_t＝σ(W_fxt+v_f⊙c_t-1+b_f)

c_t＝f_t⊙c_t-1+(1-f_t)⊙(Wx_t)

r_t＝σ(W_rx_t+v_r⊙c_t-1+b_r)

h_t＝r_t⊙c_t+(1-r_t)⊙x_t

w, W therein_r、W_fIs a weight matrix, v_f、b_f、v_r、b_rIs a parameter vector; x is the number of_tAnd h_tIs the current input and output; c. C_tIs the state value of the cell at time t for storing history information, c_t-1Is the state value of the cell at time t-1; f. of_tAnd r_tRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.

In the step 2, the obtained two state distribution vectors are the probability distribution of the acoustic modeling units of the two speakers.

In the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice labeling; subsequently, in the case of two speakers, a multiple cross-scoring method, i.e., scoring, is employedConsider the error LR in two cases separately₁And LR₂：

LR₁＝LR₁₁+L_R22

LR₂＝LR₁₂+LR₂₁

Wherein LR_ijAnd (3) expressing the cross entropy error between the ith output and the jth target person clean voice forced alignment label of the separation network, wherein i is 1,2, and j is 1, 2.

First calculate LR₁₁If LR₁₁If the LR is less than a predetermined threshold, the LR is calculated₂₂And will LR₁As the smaller error under both orderings; if LR₁₁If greater than the threshold, LR is calculated₁₂And LR₂₁Will LR₂As the smaller error under both orderings.

The invention also provides a time domain single channel multi-speaker voice recognition system, comprising:

a mixed voice signal waveform sampling module 101 for sampling the waveform of the mixed voice signal;

the one-dimensional convolutional neural network module 102 takes the output of the mixed voice signal waveform sampling module 101 as input, and preliminarily extracts features;

a separation network BSRU103 which takes the output of the one-dimensional convolution neural network module 102 as input to obtain the characteristic representation after the original waveform is separated;

the two full connection layers 104 respectively take two paths of outputs of the separation network BSRU103 as inputs to obtain two state distribution vectors;

a multiple cross scoring module 105, which performs cross scoring on the output of the two full connection layers 104 and the two target voice labels 106 by using a multiple cross scoring and error threshold setting method to obtain two smaller cross entropy errors 107 in sequence;

the minimum error module 108 takes the smaller error of the two sorts as the error of the back propagation update of the whole neural network.

The main principle of the invention is as follows: under the condition of two speakers, the original waveform sampling of a mixed voice signal is input, the characteristics of the waveform are primarily learned through a one-dimensional convolution network, and then the waveform is sent to a separation network for voice separation; the separated outputs are respectively sent to two full-connection layers, and two acoustic state distribution vectors are output; and then, acquiring corresponding labeling information from the existing target voice labeling by adopting a forced alignment method, and calculating a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode. In order to accelerate the cross scoring process, the invention also provides a scoring algorithm for reducing 1/4-1/2 error calculation amount through threshold setting. During testing, the logarithm values of the probability vectors output by the two neural networks are sent to a speech recognition decoder, and then the recognition texts of the two persons can be obtained.

Compared with the prior art, the invention has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.

Drawings

Fig. 1 is a block diagram of a prior art single-channel multi-speaker voice separation using the PIT method.

FIG. 2 is a time domain single channel multiple speaker speech recognition modeling flow chart of the present invention

Fig. 3 is a schematic diagram of the SRU calculation method.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The invention provides a time domain single channel multi-speaker voice recognition method and a system, which are particularly used for combining time domain processing of voice signals and single channel multi-speaker voice recognition on the basis of sequencing irrelevant training in a multi-speaker voice recognition scene and are used for reducing the error rate of the multi-speaker voice recognition. The methods and systems are not limited to multi-speaker speech recognition but may be any methods and systems related to speech recognition.

FIG. 2 is a time domain single channel multiple speaker speech recognition model modeling process of the present invention, which includes:

step 1, sending an original waveform into a one-dimensional convolution network to preliminarily extract features, then sending the original waveform into a separation network BSRU, and outputting a feature representation after the original waveform is separated;

the system inputs original sampling waveforms of mixed voice, and the characteristics of the original sampling waveforms are preliminarily extracted through a one-dimensional convolution network. The one-dimensional convolution network can be one layer or multiple layers, and for the multi-layer one-dimensional convolution network, the parameters of each layer include the number of convolution kernels, the length of the convolution kernels, the maximum pooling size, the step size and the like. For a one-dimensional convolution network of one layer, the length of the convolution kernel is generally set to the number of sampling points of one frame of speech, for example, one frame of 25ms, and the sampling of 16kHz is 400 points. The multilayer one-dimensional convolution network has a pooling operation, and the one-dimensional convolution network of one layer has no pooling operation. The output of each layer of convolution is normalized through batch normalization to improve generalization and training speed. And finally, splicing vectors of all channels in the last layer together to be used as a feature representation of the learned time domain waveform. The feature representations of these waveforms are then fed into a separation network BSRU for separation, outputting the separated feature representations of the two speakers of the original blended waveform.

The bsru (bidirectional SRU), i.e. bidirectional SRU, of the disjoint network, with reference to fig. 3, is calculated as follows:

f_t＝σ(W_fx_t+v_f⊙c_t-1+b_f)

c_t＝f_t⊙c_t-1+(1-f_t)⊙(Wx_t)

r_t＝σ(W_rx_t+v_r⊙c_t-1+b_r)

h_t＝r_t⊙c_t+(1-r_t)⊙x_t

w, W in the formula_r、W_fIs a weight matrix, v_f、b_f、v_r、b_rIs a parameter vector; x is the number of_tAnd h_tIs whenFront input and output; c. C_tThe state value of the cell is used for storing historical information; f. of_tAnd r_tRepresenting a forgetting gate and a resetting gate respectively, sigma is a sigmod function, ⊙ represents the corresponding multiplication of the elements of two vectors.

Step 2, the characteristic representation after the separation of the original waveform output in the step 1 is respectively sent into two full-connection layers, and two acoustic state distribution vectors are respectively output;

the output of the separation network passes through two independent full-connection layers to respectively obtain the probability distribution of the acoustic modeling units output by the two neural network full-connection layers.

And 3, referring to the marking information obtained by forced alignment, and reducing the calculated amount by adopting a method of cross scoring for multiple times and setting an error threshold value.

And referring to the labeling information obtained by forced alignment, and respectively obtaining a smaller probability distribution error of the acoustic modeling units under two sorts as an error of neural network back propagation in a cross scoring and threshold selection mode.

Firstly, acquiring corresponding labeling information from the existing target voice label by adopting a forced alignment method;

subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately₁And LR₂：

LR₁＝LR₁₁+L_R22(3)

LR₂＝LR₁₂+LR₂₁(4)

Wherein LR_ijAnd i is 1,2, j is 1,2 represents the cross entropy error between the ith output of the separation network and the jth target person clean voice forced alignment label.

If the above formula is performed in sequence, LR needs to be calculated for 4 times_ij. The method of the invention comprises the following steps: first calculate LR₁₁If LR₁₁If the LR is less than a predetermined threshold, the LR is calculated₂₂And taking the formula (3) as a smaller error under two sorts; if LR₁₁If greater than the threshold, LR is calculated₁₂And LR₂₁Equation (4) is taken as the smaller error for both orderings. So that only 2 or 3 LR's need be calculated at a time_ijCompared with the previous 4 times, the method can save about 1/4-1/2 error calculation amount. The setting of this threshold follows two principles: firstly, the initial error of training is generally large, so the threshold value should be large at the initial stage of training and is decreased as training progresses; second, the threshold should be matched to the LR of the current training_ijThe average values are related.

The error calculation time for training using this multiple cross-scoring method is 3/4 to 1/2 of the error calculation time in the multiple cross-scoring mode.

Two orderings in the present invention refer to: in the case of two speakers, when a mixed voice of the two speakers is inputted, contents which the two speakers respectively speak, that is, one input, two outputs, are recognized. Meanwhile, the user also knows what the two persons actually say respectively during training, namely two references; but the correspondence of the output and the reference is not known, there are two sorts:

A. output 1 corresponds to reference 1 and output 2 corresponds to reference 2

B. Output 2 corresponds to reference 1 and output 1 corresponds to reference 2

I.e. there is no guarantee that the output of the first port is always the first person.

In the case of multiple speakers, the principle is the same.

In a word, the invention provides a time domain single-channel multi-speaker voice recognition modeling method, which can effectively further improve the recognition effect of multi-speaker voice. Employing this method on some multi-speaker continuous speech recognition datasets can achieve better performance than the time-spectrum based PIT method.

Fig. 2 also shows a corresponding system, in which the mixed speech signal waveform sampling module 101 obtains a signal x ═ x₁,...,x_T](T is the time length of the signal), the output of the mixed speech signal waveform sampling module 101 is used as the input of the one-dimensional convolutional neural network module 102; the output of the one-dimensional convolutional neural network module 102 is used as a separation netThe input of network BSRU 103; the output of the separation network BSRU103 is respectively sent to two full connection layers 104; the inputs of the two fully connected layers 104 are sent to the multiple cross-scoring module 105 along with the two target voice labels 106; the multiple cross scoring module 105 uses multiple cross scoring and error threshold setting methods to obtain two kinds of smaller cross entropy errors 107 in the sequence, and the smaller error selected by the minimum error module 108 is the error of the whole neural network back propagation update.

During testing, the logarithm values of the probability vectors output by the neural network are sent to a speech recognition decoder, and the recognition texts of two persons can be obtained, and the method has the main advantages that: by adopting a more flexible convolution network stacking mode and simplifying a method for cross scoring calculation errors, the purpose of improving the generalization capability of the model is realized, and the performance of the voice recognition system of a plurality of speakers is further improved. The method can be widely applied to various application fields related to voice separation and recognition.

Claims

1. A time domain single channel multi-speaker voice recognition method is characterized by comprising the following steps:

2. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 1, the one-dimensional convolutional network is one or more layers, and for the multi-layer one-dimensional convolutional network, the parameters of each layer include the number of convolutional kernels, the length of the convolutional kernels, the maximum pooling size and the step size; for a layer of one-dimensional convolution network, setting the length of a convolution kernel as the number of sampling points of a frame of voice; the multi-layer one-dimensional convolution network has pooling operation, and the one-dimensional convolution network of one layer has no pooling operation; the output of each layer of convolution is normalized through batch normalization to improve the generalization and the training speed, and the vectors of all channels of the last layer are spliced together and used as the feature representation of the learned time domain waveform.

3. The time-domain single-channel multi-speaker speech recognition method according to claim 1, wherein in the step 1, the BSRU is a bidirectional SRU, and the SRU is calculated as follows:

f_t＝σ(W_fx_t+v_f⊙c_t-1+b_f)

c_t＝f_t⊙c_t-1+(1-f_t)⊙(Wx_t)

r_t＝＝σ(W_rx_t+v_r⊙c_t-1+b_r)

h_t＝r_t⊙c_t+(1-r_t)⊙x_t

4. The time-domain single-channel multi-speaker speech recognition method according to claim 3, wherein the two state distribution vectors obtained in step 2 are acoustic modeling unit probability distributions of two speakers.

5. The time-domain single-channel multi-speaker voice recognition method according to claim 1, wherein in the step 3, firstly, a forced alignment method is adopted to obtain corresponding labeling information from the existing target voice label; subsequently, in the case of two speakers, a multiple cross-scoring method is employed, i.e., error LR in both cases is considered separately₁And LR₂：

LR₁＝LR₁₁+L_R22

LR₂＝LR₁₂+LR₂₁

6. The time-domain single-channel multi-speaker speech recognition method of claim 5, wherein LR is first calculated₁₁If LR₁₁If the LR is less than a predetermined threshold, the LR is calculated₂₂And will LR₁As the smaller error under both orderings; if LR₁₁If greater than the threshold, LR is calculated₁₂And LR₂₁Will LR₂As the smaller error under both orderings.

7. A time domain single channel multi-speaker speech recognition system, comprising:

a mixed voice signal waveform sampling module (101) for sampling the waveform of the mixed voice signal;

the one-dimensional convolution neural network module (102) takes the output of the mixed voice signal waveform sampling module (101) as input and preliminarily extracts features;

the separation network BSRU (103) takes the output of the one-dimensional convolution neural network module (102) as input to obtain the characteristic representation after the original waveform is separated;

the two full-connection layers (104) respectively take two paths of outputs of the BSRU (103) as inputs to obtain two state distribution vectors;

a multiple cross scoring module (105) which performs cross scoring on the output of two full connection layers (104) and two target voice labels (106) by using a method of multiple cross scoring and setting an error threshold value to obtain two smaller cross entropy errors (107) in sequence;

and a minimum error module (108) for taking the smaller error of the two sorts as the error of the back propagation update of the whole neural network.