CN111667836A

CN111667836A - Text-irrelevant multi-label speaker identification method based on deep learning

Info

Publication number: CN111667836A
Application number: CN202010563201.8A
Authority: CN
Inventors: 邓克琦; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-09-15
Anticipated expiration: 2040-06-19
Also published as: CN111667836B

Abstract

The invention discloses a text-irrelevant multi-label speaker recognition method based on deep learning. The method comprises the following steps: (1) averagely dividing the voice of each speaker in the training data set into N parts, and marking different labels on each part; (2) constructing a corresponding neural network model, and ensuring that the dimension of an output layer is consistent with the number of the marks of the training data set; (3) inputting training data into a neural network, comparing the output layer result with the label corresponding to the data, and solving a cross entropy loss function so as to carry out training; (4) and (3) presetting N labels which are regarded as effective identification for the voice data of each speaker according to the corresponding relation of the training data set in the step (1) by the test set data, inputting the test data set data into a neural network, comparing the labels predicted by the model with the preset N labels, and identifying the speaker correctly as long as one of the labels is satisfied. The method can effectively improve the speaker recognition performance of the model in pure and noisy environments.

Description

Text-irrelevant multi-label speaker identification method based on deep learning

Technical Field

The invention relates to a text irrelevant multi-label speaker identification method based on deep learning.

Background

Speaker recognition, also known as speaker recognition, voiceprint recognition, is aimed at identifying a speaker from its speech characteristics. The speaker identification is divided into two processes of speaker identification and speaker confirmation, wherein the speaker identification means that whether a speaker is in a recorded speaker set or not is identified after the corresponding voice processing analysis of the speaker is carried out; speaker verification refers to a process of further verifying whether a speaker corresponding to an input voice is a target speaker.

The i-vector method can be used to achieve speaker recognition (N.Dehak, P.J.Kenny, R.Dehak, P.Dumouchel and P.Ouellet, "Front-End Factor Analysis for speaker verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol.19, No.4, pp.788-798, May 2011.). The literature (D.Snyder, P.Ghahremni, D.Povey, D.Garcia-Romero, and Y.Carmiel, "Deep Neural Network expressions for textlndexlndicant Speaker Verification," in Interspeed, 2017, pp.999-1003.) indicates that the method of Deep learning can already exceed the conventional i-vector method after the use of large-scale data, in particular, data enhancement. However, speaker recognition in noisy environments remains a challenging problem.

A noise reduction auto-encoder (DAE) may be used to generate enhanced speech from noisy speech, thereby improving speaker recognition performance in noisy scenes (o.plchot, l.burget, h.aronoutz and P.

"Audio enhancement with DNN automation for spaker recognition,"2016IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai,2016, pp.5090-5094.). However, since this method uses L₂The loss function performs voice enhancement on the data first and then performs speaker recognition, which results in mismatching of the voice enhancement and the speaker recognition, and thus cannot achieve a high noise-containing voice recognition rate. The literature (Sun Shon, Hao Tang, and James Glass, "VoiceID Loss: Speech Enhancement for Speaker Verification," in Interspeed, 2019, pp.2888-2892.) uses an end-to-end architecture for Speech Enhancement and Speaker recognition,but the overall performance of the algorithm has room for improvement.

When speaker recognition is performed using deep learning, large-scale data is required for training, however, as the amount of data further increases, the improvement of the model discrimination capability is slow. Speakers are in one-to-one correspondence with labels of corresponding voice data, which is not the most efficient way to use the data.

Disclosure of Invention

In the existing speaker recognition training strategy based on deep learning, speakers correspond to labels of corresponding voice data one by one, and the data cannot be utilized to the maximum extent. The invention provides a text-independent multi-label speaker recognition method based on deep learning, which can further improve the distinguishing capability of a model and can further improve the recognition capability of the model no matter in a pure environment or a noisy environment.

The technical scheme adopted by the invention is as follows:

the text irrelevant multi-label speaker identification method based on deep learning comprises the following steps:

step 1, dividing the voice of each speaker in a training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;

step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of labels of the training data set;

step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;

and 4, presetting N labels regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the labels in the step 1 by the test set data, inputting the test data set data into the neural network model, comparing the labels predicted by the model with the N labels regarded as effective identification set before, and obtaining the speaker identification rate of the model as long as the predicted label is one of the N labels regarded as effective identification, namely correct identification.

Furthermore, the neural network model comprises a voice enhancement network and a speaker recognition network, wherein the voice enhancement network is used for enhancing noisy voice and improving the robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.

Further, in step 3, a specific formula of the cross entropy loss function is as follows:

wherein C is the total number of speakers; p is a radical of_iAccording to the speech data according to the index y_iConfirmed true identity, i.e. only one of the N × C classification positions is 1, and the other positions are zero, q_iRefers to the probability of systematic prediction at each classified location.

The invention changes the method that the speaker voice and the speaker label are in one-to-one correspondence in the past speaker recognition method based on deep learning, and adopts a multi-label mode. The method can further improve the speaker recognition performance of the model in a pure environment and a noise environment.

Drawings

FIG. 1 is a schematic diagram of a neural network in an embodiment of the present invention; wherein N is the number of parts of the training data set.

Fig. 2 is a specific structure of a speech enhancement network in an embodiment of the present invention.

FIG. 3 is a flow chart of the text irrelevant multi-labeled speaker recognition method based on deep learning according to the invention.

FIG. 4 is a flow chart of a specific algorithm of the method of the present invention, wherein n1 and n2 represent the number of training sets and test lumped speech respectively; d is the training set, T is the testing set, and C is the number of speakers.

FIG. 5 is a comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and for various types of noisy speech.

FIG. 6 is a line graph of the comparison of a prior art deep learning based speaker recognition scheme with the method of the present invention for clean speech and various types of noisy speech.

Detailed Description

The text irrelevant multi-label speaker identification method based on deep learning mainly comprises the following parts:

1. segmenting a training data set

1) Defining a training data set

D＝{(x₁,y₁),...,(x_n1,y_n1)}， (1)

In the formula, D is a training set, x and y are respectively voice and corresponding original labels, and n1 represents the voice quantity of the training set;

2) training set speech data labels

Dividing the voice of each speaker in the training data set into N parts on average, and marking different labels on each part of voice, so that the number of the labels of the whole training data set is N times of the number of the speakers, which can be expressed specifically;

in the formula, y_iLabels for the speech data in the training set in the initial state,

the number of the modified phonetic data in the training set is C, the total number of the speakers is C, and the value of m is 0 to N-1.

2. Constructing neural networks

And constructing a corresponding neural network model which comprises a voice enhancement network and a speaker recognition network, and ensuring that the dimension of an output layer is consistent with the total number of labels of the training data set. The voice enhancement network can enhance the noisy voice, and the enhanced noisy voice can be recognized after the voice is input into the speaker recognition network. The whole network can compare the recognition performance of the model on pure voice, noisy voice and enhanced noisy voice, so as to obtain the recognition performance and robustness of the model.

3. Constructing noisy speech data set

And mixing the data set (which is regarded as a pure voice data set at this time) with different types of noise voice data according to a certain signal-to-noise ratio to obtain a noise-containing voice data set.

4. Training

1) Pre-training

Firstly, pure voice is used for training a speaker recognition network, training data are input into a neural network, the output layer result is compared with a label corresponding to the data, a cross entropy loss function is solved, therefore, supervised training is carried out, and parameters of the speaker recognition network are locked after the training is converged; because the number of categories distinguished by the constructed neural network is N times of the number of speakers in the data set, the number of classification positions involved in calculating the cross entropy loss function also becomes N times of the initial state correspondingly, which is specifically as follows:

where N is the number of partitions into the training data set, C is the total number of speakers, p_iAccording to arrangement of fingers as before

To confirm the true identity, i.e., only one of the N × C classification positions is 1, and the other positions are all zeros, and q_cRefers to the probability of systematic prediction at each classified location.

2) Training

A speech enhancement network is introduced, and a neural network model comprising the speech enhancement network and a parameter-locked speaker recognition network is trained end-to-end by using noisy speech data. The speech enhancement network uses the SIGMOID function to calculate a ratio mask, and the ratio mask is multiplied by the input speech data to achieve the enhancement effect, wherein the SIGMOID function can be expressed as:

wherein z is the value at any point in the tensor output by the speech enhancement network.

5. Testing

1) Defining a test data set

T＝{(x₁,y₁),...,(x_n2,y_n2)}， (5)

Where T is the test set, x and y are the speech and corresponding original labels, respectively, and n2 represents the number of speeches in the training set.

2) Presetting valid identification label

N labels which can be regarded as effective identification are preset for each speaker according to the formula (2).

3) Pure speech recognition

The pure test voice is input into the speaker recognition network, the voice enhancement network is not needed, the test result of the network is compared with the preset effective recognition label, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the pure voice recognition rate is obtained.

4) Noisy speech recognition

The noise-containing test voice is directly input into the speaker recognition network, the test result of the network is compared with the preset effective recognition label without passing through the voice enhancement network, one of the results is correct recognition, and otherwise, the result is wrong recognition, so that the noise-containing voice recognition rate is obtained.

5) Enhanced speech recognition

Inputting the voice containing noise into a voice enhancement network, then inputting the enhanced voice into a speaker recognition network, comparing the obtained result with a preset effective recognition label, and determining whether the voice containing noise is correctly recognized or not, otherwise, determining whether the voice containing noise is wrongly recognized, thereby obtaining the enhanced voice containing noise recognition rate.

Comparing the system identification performances under the three conditions of 3), 4) and 5), and judging whether the result meets the expectation: the recognition rate of the pure voice is the highest, the recognition rate of the noisy voice data is the lowest, and the recognition rate of the enhanced noisy voice data is between the two recognition rates.

Therefore, the method changes the prior method that the voice of the speaker corresponds to the label of the speaker one by one on the basis of the prior text-independent speaker identification method based on deep learning, and adopts a mode of multiple labels.

Examples

The technical scheme in the embodiment of the invention is clearly and completely described below with reference to the accompanying drawings.

1. Test sample and objective evaluation criteria

The sample clean speech of this example was derived from the Voxceleb1 data set, and the Voxceleb1 data was extracted from the YouTube video, which contained 1251 speakers and approximately 150000 segments of speech, each having an average speech length of 7.8 seconds. To test the robustness of the model, the present embodiment used the Noise-92 data set. Selecting White, Babble, Leopard, Volvo, Factory, Tank and Gun noises from the Noise-92 data set, mixing the Noise with pure voice in a ratio of 10dB of signal-to-Noise ratio to obtain Noise-containing voice for training and testing the performance of the system in a Noise-containing environment. And when mixing the noise data with the clean speech, the noise data is firstly divided into two parts, one part is used for mixing with the training set data, and the other part is mixed with the test set data, so that the same noise data set is prevented from being used by the training set and the test set at the same time.

The invention adopts Accuracy (Top-1 Accuracy) score as an objective evaluation standard. When the Accuracy of the multi-label speaker identification method is calculated, the speaker belongs to effective identification as long as the output result meets one of the preset effective identification labels, and the requirement of Top-1 Accuracy is met.

2. Parameter setting

1) Speaker recognition network

Referring to fig. 1, the speaker recognition network used in the present embodiment is composed of four one-dimensional convolution layers and a fully connected layer (dimension 1500). The sizes of convolution kernels of the 4 one-dimensional convolution layers are respectively as follows: 5. 7, 1; the step lengths are respectively: 1. 2, 1; a global average pooling layer is also passed between the last convolutional layer and the fully-connected layer. Here, the 257-dimensional spectrum is taken as input directly using a window function with a window length of 25ms and a frame shift of 10 ms. This example does not normalize the input data, but rather takes the magnitude spectrum an exponential multiple of 0.3. The training uses 298 frames of fixed length (257 frames each) as input for a segment of speech.

2) Speech enhancement network

The reinforced part model is composed of 11 expansion convolution layers, and the specific structure is shown in figure 2. For the final layer of convolution output, a ratio mask with the same size is generated by using a SIGMOID function, and the ratio mask is multiplied by the original input to achieve the aim of speech enhancement. This embodiment uses RELU as the non-linear activation. And finally outputting a prediction result obtained after the layer passes through SOFTMAX.

3) Data set composition

This requires that the original data set be partitioned into training and test sets, and that both training and test sets contain the same number of speakers, and that the proportion of training and test set sizes for each set of embodiments need to be consistent. In this example, the test set and training set were constructed at a 1:3 ratio. Before the algorithm starts, the training data sets need to be sorted according to the corresponding speaker ID, namely, the voices belonging to the same speaker are sorted together.

3) Setting of N value

Since on a data set with limited size, N is too large, the data of the training subset is too sparse, and the overall recognition performance of the system is finally reduced. Therefore, in the present embodiment, only the cases where N is 2 and 3 are tested.

3. Concrete implementation process of method

1) Referring to fig. 3 and 4, the algorithm is initialized according to equations (1), (5) and the parameter settings described above, training data and test data buffers are established for buffering data used in speaker recognition, a speaker label buffer is established for buffering labels used in training and testing, and for any moment in model training, the following calculations are performed: and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.

2) Referring to the label newly set in step 4 of fig. 4, the corresponding label of the voice data is obtained, and the label cache is updated.

3) And inputting the data into a neural network, comparing the output result with the labels in the label cache, calculating a cross entropy loss function, and reversely propagating and optimizing the model parameters.

For any moment of the model test, the following calculation is performed:

1) and acquiring a new piece of voice data, performing short-time Fourier transform on windowing the new piece of voice data to acquire a group of data containing 298 frames and 257 frequency points in each frame, taking an exponential time of 0.3 for the amplitude spectrum, and updating the data cache.

2) And acquiring a corresponding label newly set by the voice data, and updating the label cache.

3) Inputting data into the neural network, referring to steps 11 to 13 in fig. 4, comparing the output result with the labels in the label cache, and judging whether one of the following is satisfied:

31) if the judgment result is yes, the identification is correct;

32) if the judgment result is negative, the error identification is carried out;

4) referring to steps 13 to 17 of fig. 4, the speaker recognition rate of the model on the test set is found.

In order to show the performance of the speaker recognition in the clean environment and the noisy environment, the present embodiment compares the speaker recognition based on deep learning with the method of the present invention. Fig. 5 shows Accuracy scoring results of the prior deep learning-based speaker recognition and the method of the present invention in clean environments and before and after enhancement of various types of noisy environments, and fig. 6 shows a line graph of the results.

In FIG. 5, D denotes training using clean speech, D^NNoisy speech training is used. "Baseline" indicates the existing scheme, "deployed (N ═ 3)" and "deployed (N ═ 2)" respectively represent that the method of the invention is used, and N takes the

values

3 and 3 respectively2. It can be seen from the results that the present invention results better than the existing solutions for both clean speech and enhanced types of noisy speech. And the recognition rate of the enhanced noisy speech is obviously higher than that before enhancement, and the enhancement effect of the invention is more obvious.

In fig. 6, "baseline (original)", "deployed (N ═ 3) (original)" and "deployed (N ═ 2) (original)" represent speaker recognition rates of clean speech and noisy speech without speech enhancement, and "baseline (enhanced)", "deployed (N ═ 3) (enhanced)" and "deployed (N ═ 2) (enhanced)" represent speaker recognition rates of noisy speech after speech enhancement. It can be seen from the results that for the noisy speech which is not enhanced, the method of the present invention has little effect difference when the value of N is 2 and 3; for enhanced noisy speech and pure speech, the method of the invention has better effect when N is 2, and the N is obviously superior to the existing scheme.

As can be seen from the results shown in fig. 5 and fig. 6, the speaker recognition method based on deep learning of the present invention can further improve the recognition performance of the model in both clean environment and noisy environment.

Claims

1. The text irrelevant multi-label speaker recognition method based on deep learning is characterized by comprising the following steps of:

2. The text-independent multi-label speaker recognition method based on deep learning of claim 1, wherein the neural network model comprises a speech enhancement network and a speaker recognition network, the speech enhancement network is used for enhancing noisy speech and improving robustness of the neural network model; in the step 3, the speaker recognition network is trained in advance, parameters are locked after the training is carried out until convergence, and then an end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.

3. The method for recognizing the text-irrelevant multi-labeled speaker based on the deep learning as claimed in claim 1, wherein in the step 3, the specific formula of the cross entropy loss function is as follows: