CN111667836B

CN111667836B - Text irrelevant multi-label speaker recognition method based on deep learning

Info

Publication number: CN111667836B
Application number: CN202010563201.8A
Authority: CN
Inventors: 邓克琦; 卢晶
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2023-05-05
Anticipated expiration: 2040-06-19
Also published as: CN111667836A

Abstract

The invention discloses a text irrelevant multi-label speaker recognition method based on deep learning. The method comprises the following steps: (1) Dividing the voice of each speaker in the training data set into N parts, and marking different marks on each part; (2) Constructing a corresponding neural network model, and ensuring that the dimension of an output layer is consistent with the number of the training data set labels; (3) Inputting training data into a neural network, comparing an output layer result with a label corresponding to the data, and solving a cross entropy loss function so as to train; (4) And (3) presetting N marks which are regarded as effective identification for the voice data of each speaker according to the corresponding relation of the training data set in the step (1), inputting the data of the testing data set into a neural network, comparing the marks predicted by the model with the preset N marks, and judging that one of the marks is correct identification as long as the one of the marks is satisfied. The method can effectively improve the speaker recognition performance of the model in pure and noisy environments.

Description

Text irrelevant multi-label speaker recognition method based on deep learning

Technical Field

The invention relates to a text irrelevant multi-label speaker recognition method based on deep learning.

Background

Speaker recognition, also known as speaker recognition, voiceprint recognition, aims to confirm the identity of a speaker from its speech characteristics. Speaker recognition is divided into two processes, namely speaker recognition and speaker confirmation, wherein the speaker recognition refers to whether a speaker is in a recorded speaker set or not after processing and analyzing the corresponding voice of the speaker; speaker verification refers to a process of further verifying whether a speaker corresponding to an input voice is a target speaker.

The i-vector method may be used to implement speaker recognition (N.Dehak, P.J.Kenny, R.Dehak, P.Dumouchel and P.ouellet, "Front-End Factor Analysis for Speaker Verification," in IEEE Transactions on Audio, spech, and Language Processing, vol.19, no.4, pp.788-798, may 2011.). Literature (D.Snyder, P.Ghahremani, D.Povey, D.Garcia-romiro, and y. Carmiel, "Deep Neural Network Embeddings for Text Independent Speaker Verification," in Interspeech,2017, pp. 999-1003.) states that deep learning methods can be exceeded by conventional i-vector methods when large-scale data are used, particularly after data enhancement. Speaker identification in noisy environments, however, remains a challenging problem.

A noise-reducing auto-encoder (DAE) may be used to generate enhanced speech from noisy speech to improve speaker recognition performance in noisy scenes (O.Plchot, L.Burget, H.Aronowitz and P).

"Audio enhancing with DNN autoencoder for speaker recognition,"2016IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), shanghai,2016, pp.5090-5094. However, since this method uses L ₂ The loss function firstly carries out voice enhancement on the data and then carries out speaker recognition, so that the voice enhancement and the speaker recognition are not matched, and a high noise-containing voice recognition rate cannot be obtained. Literature (Suwo Shon, hao Tang, and James Glass, "VoiceID Loss: speech Enhancement for Speaker Verification," in Interseech, 2019, pp.2888-2892.) uses end-to-end structure for speech enhancement and speaker recognition, but there is room for improvement in the overall performance of the algorithm.

When deep learning is used for speaker recognition, large-scale data is required for training, however, the improvement of the model discrimination ability is slow with further increase of the data amount. The speaker is one-to-one with the label of the corresponding voice data, which is not the most efficient way to use the data.

Disclosure of Invention

In the existing speaker recognition training strategy based on deep learning, the speaker corresponds to the corresponding voice data one by one, and cannot utilize the data to the greatest extent. The invention provides a text-independent multi-label speaker recognition method based on deep learning, which can further improve the distinguishing capability of a model and can further improve the recognition capability of the model in a pure environment or a noisy environment.

The invention adopts the technical scheme that:

the text irrelevant multi-label speaker recognition method based on deep learning comprises the following steps:

step 1, equally dividing the voice of each speaker in the training data set into N parts, and marking different marks on each part of voice so that the number of marks of the whole training data set is N times of the number of the speakers, wherein N is more than or equal to 2;

step 2, constructing a neural network model, wherein the dimension of an output layer of the model is consistent with the total number of the training data set labels;

step 3, inputting the training data set in the step 1 into the neural network model in the step 2, comparing the output result with the label corresponding to the voice data, and solving a cross entropy loss function so as to perform supervised training;

and 4, presetting N marks which are regarded as effective identification for the voice data of each speaker according to the corresponding relation between the voice of the training data set and the marks in the step 1 by the data of the test set, inputting the data of the test data set into the neural network model, comparing the predicted marks of the model with the N marks which are regarded as effective identification and are set before, and obtaining the speaker identification rate of the model as long as the predicted marks are one of the N marks which are regarded as effective identification, namely correct identification.

Further, the neural network model comprises a voice enhancement network and a speaker recognition network, wherein the voice enhancement network is used for enhancing noise-containing voice and improving the robustness of the neural network model; in the step 3, the speaker recognition network is pre-trained, the parameters are locked after the training is converged, and then the end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.

Further, in the step 3, a specific formula of the cross entropy loss function is:

wherein, C is the total number of speakers; p is p _i According to the voice data and the label y _i The true identity being confirmed, i.e. only in N X C classification locationsOne position is 1, and the other positions are zero; q _i Refers to the probability of systematic prediction at each classification location.

The invention changes the previous method of speaker voice and speaker labels corresponding to each other based on deep learning speaker recognition method, and adopts a multi-label mode. The method can further improve the speaker recognition performance of the model in a pure environment and a noise environment.

Drawings

FIG. 1 is a schematic diagram of a neural network in an embodiment of the present invention; where N is the number of fractions of the training dataset segmentation.

Fig. 2 is a specific structure of a voice enhancement network according to an embodiment of the present invention.

FIG. 3 is a flow chart of a text independent multi-label speaker recognition method based on deep learning of the present invention.

FIG. 4 is a flowchart of a specific algorithm of the method of the present invention, wherein n1 and n2 represent the training set and the test lumped number of voices, respectively; d is training set, T is test set, and C is speaker number.

FIG. 5 is a comparison of a prior deep learning-based speaker recognition scheme with the method of the present invention for clean speech and various types of noisy speech.

FIG. 6 is a line graph of the comparison of the prior deep learning-based speaker recognition scheme with the inventive method for clean speech and various types of noisy speech.

Detailed Description

The text irrelevant multi-label speaker identification method based on deep learning mainly comprises the following parts:

1. segmentation training data set

1) Defining training data sets

D＝{(x ₁ ,y ₁ ),...,(x _n1 ,y _n1 )}， (1)

Wherein D is a training set, x and y are respectively voice and corresponding original labels, and n1 represents the total voice number of training;

2) Training set phonetic data label

Dividing the voice of each speaker in the training data set into N parts, marking different marks on each part of voice, so that the number of the marks of the whole training data set is N times of the number of the speakers, and the number of the marks can be expressed as;

wherein y is _i Is the label of the voice data in the training set in the initial state,

for the modified labels of the speech data in the training set, C is the total number of speakers, and m is a value between 0 and N-1.

2. Construction of neural networks

Corresponding neural network models are constructed, including a voice enhancement network and a speaker recognition network, and the dimension of the output layer is ensured to be consistent with the total number of the training data set labels. The voice enhancement network can enhance the noise-containing voice, and the recognition performance of the enhanced noise-containing voice can be obtained after the voice enhancement network inputs the voice. The whole network can compare the recognition performance of the model on pure voice, noise-containing voice and enhanced noise-containing voice, and the recognition performance and the robustness of the model can be obtained.

3. Constructing noisy speech data sets

The data set (which is regarded as the pure voice data set at the moment) is mixed with different types of noise voice data according to a certain signal-to-noise ratio, so as to obtain the noise-containing voice data set.

4. Training

1) Pre-training

Firstly, training a speaker recognition network by using pure voice, inputting training data into a neural network, comparing an output layer result with a label corresponding to the data, and solving a cross entropy loss function, so as to perform supervised training, and locking parameters of the speaker recognition network after convergence; the number of classification positions involved in calculating the cross entropy loss function is correspondingly changed into N times of the initial state because the number of classification types distinguished by the constructed neural network is N times of the number of speakers in the data set, and the method comprises the following specific formula:

wherein N is the number of divided training data sets, C is the total number of speakers, and p _i According to the arrangement as described above

To confirm the true identity, namely that only one position is 1 in N multiplied by C classification positions, and the other positions are zero. And q _i Refers to the probability of systematic prediction at each classification location.

2) Training

A speech enhancement network is introduced and a neural network model including the speech enhancement network and a parameter-locked speaker recognition network is trained end-to-end with noisy speech data. The voice enhancement network calculates a ratio mask by using a SIGMOID function, and the ratio mask is multiplied by input voice data to achieve the enhancement effect, wherein the SIGMOID function can be expressed as:

where z is the value at any point in the tensor output by the speech enhancement network.

5. Testing

1) Defining test data sets

T＝{(x ₁ ,y ₁ ),...,(x _n2 ,y _n2 )}， (5)

Wherein T is a test set, x and y are respectively voice and corresponding original labels, and n2 represents the number of voice in the training set.

2) Presetting effective identification mark

Each speaker is preset with N labels that can be considered valid recognition according to equation (2).

3) Clean speech recognition

Inputting the clean test voice into a speaker recognition network without a voice enhancement network, comparing the test result of the network with a preset effective recognition mark, and if one of the test result is correct recognition, otherwise, the test result is incorrect recognition, thereby obtaining the clean voice recognition rate.

4) Noisy speech recognition

The test voice with noise is directly input into the speaker recognition network without the voice enhancement network, the test result of the network is compared with the preset effective recognition mark, one of the test result is the correct recognition, and the other test result is the wrong recognition, so that the recognition rate of the voice with noise is obtained.

5) Enhanced speech recognition

Inputting the noise-containing voice into a voice enhancement network, inputting the enhanced voice into a speaker recognition network, comparing the obtained result with a preset effective recognition mark, and if one of the obtained result is correct recognition, otherwise, carrying out error recognition, thereby obtaining the enhanced noise-containing voice recognition rate.

Comparing the system identification performance under the three conditions of 3), 4) and 5), and judging whether the result meets the expectations or not: the clean voice recognition rate is highest, the noise-containing voice data recognition rate is lowest, and the enhanced noise-containing voice data recognition rate is between the clean voice recognition rate and the noise-containing voice data recognition rate.

Therefore, the method changes the prior method for identifying the speaker based on the deep learning text irrelevant to the speaker, changes the one-to-one correspondence between the voice of the speaker and the labels of the speaker, and adopts a multi-label mode.

Examples

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.

1. Test sample and objective evaluation criterion

The clean speech of the sample of this embodiment is derived from the Voxceleb1 dataset, and the Voxceleb1 data is extracted from YouTube video, which contains 1251 speakers and approximately 150000 speech segments, each segment having an average speech length of 7.8 seconds. To test the robustness of the model, this embodiment uses a Noise-92 Noise dataset. Seven different types of Noise, namely White, babble, leopard (military vehicle Noise), volvo (in-vehicle Noise) and Factory, tank, gun (gun Noise), are selected from a Noise-92 Noise data set, and are mixed with pure voice in a proportion of 10dB in signal-to-Noise ratio to obtain Noise-containing voice, and the Noise-containing voice is used for training and testing the performance of the system in a Noise-containing environment. When the noise data is mixed with the pure voice, the noise data is firstly divided into two parts, one part is used for being mixed with the training set data, and the other part is mixed with the test set data, so that the situation that the same noise data set is used by the training set and the test set at the same time is avoided.

The invention adopts Accuracy (Top-1 Accuracy) score as objective evaluation standard. When calculating the Accurcry of the multi-label speaker recognition method, the output result only meets one of the preset effective recognition labels, so that the method belongs to effective recognition and meets the Top-1 Accurcry requirement.

2. Parameter setting

1) Speaker recognition network

Referring to fig. 1, the speaker recognition network used in this embodiment is composed of four layers of one-dimensional convolution and one layer of full-connection layer (dimension 1500). The convolution kernel sizes of the 4 one-dimensional convolution layers are respectively as follows: 5. 7, 1; the step sizes are respectively as follows: 1. 2, 1; a global average pooling layer is also passed between the final convolutional layer and the full join layer. Here the 257-dimensional spectrum is fetched as input directly using a window function with a window length of 25ms frame shift of 10 ms. The present embodiment does not normalize the input data, but only multiplies the magnitude spectrum by an index of 0.3. A fixed length of 298 frames (257 per frame length) is used as input for a piece of speech during training.

2) Speech enhancement network

The enhancement section model consists of 11 layers of expanded convolution layers, see fig. 2 for a specific structure. For the final layer convolution output, the SIGMOID function is used for generating the rate mask with the same size, and the rate mask is multiplied with the original input, so that the purpose of voice enhancement is achieved. This embodiment uses RELU as the nonlinear activation. And finally outputting a predicted result obtained after the layer passes through SOFTMAX.

3) Data set construction

This requires first partitioning the original dataset into a training set and a testing set, and both the training set and the testing set need to contain the same number of total speakers, and the proportions of the training set and testing set sizes for each set of embodiments need to be kept consistent. In this embodiment, the test set and the training set are constructed in a 1:3 ratio. Before the algorithm starts, the training data sets are ordered according to the corresponding speaker ID, i.e. the voices belonging to the same speaker are ordered together.

3) N value setting

Because on a data set with limited size, N is too large, the data of the training subset is too sparse, and the overall recognition performance of the system is finally reduced. In this embodiment, therefore, only cases where N is 2 and 3 are tested.

3. Specific implementation flow of method

1) Referring to fig. 3 and 4, the algorithm is initialized according to formulas (1), (5) and the above parameter settings, training data and test data buffers are established for buffering data used in speaker recognition, speaker label buffers are established for buffering labels used in training and testing, and for any moment of model training, the following calculation is performed: a new piece of voice data is obtained, short-time Fourier transform is carried out on the new piece of voice data by windowing, a group of data containing 298 frames and 257 frequency points of each frame is obtained, the amplitude spectrum is exponentially multiplied by 0.3, and the data cache is updated.

2) Referring to the newly set label in step 4 of fig. 4, the corresponding label of the voice data is obtained, and the label buffer is updated.

3) Inputting data into a neural network, comparing an output result with labels in a label cache, calculating a cross entropy loss function, and back-propagating optimization model parameters.

For any time of the model test, the following is calculated:

1) A new piece of voice data is obtained, short-time Fourier transform is carried out on the new piece of voice data by windowing, a group of data containing 298 frames and 257 frequency points of each frame is obtained, the amplitude spectrum is exponentially multiplied by 0.3, and the data cache is updated.

2) And obtaining a corresponding label newly set by the voice data, and updating a label cache.

3) Inputting data into the neural network, referring to steps 11 to 13 of fig. 4, comparing the output result with the labels in the label cache, and judging whether one of them is satisfied:

31 If yes, the identification is correct;

32 If the judgment result is negative, the error identification is performed;

4) Referring to steps 13 to 17 of fig. 4, the speaker recognition rate of the model on the test set is found.

In order to embody the performance of the method of the invention in speaker recognition in a pure environment and a noisy environment, the embodiment compares the existing speaker recognition based on deep learning with the method of the invention. Fig. 5 shows results of Accuracy scoring before and after enhancement of the existing deep learning-based speaker recognition and the method of the present invention in a pure environment and various types of noisy environments, and fig. 6 shows a line graph of the results.

In FIG. 5, D represents training using clean speech, D ^N Noisy speech training is used. "baseine" represents an existing scheme, "Proposed (N=3)" and "Proposed (N=2)" represent the method of the present invention employed, and N takes the

values

3 and 2, respectively. From the results, the results of the invention are better than the prior art, both for clean speech and for enhanced noise-containing speech. The noise-containing voice recognition rate after enhancement is obviously higher than that before enhancement, and the enhancement effect of the invention is more obvious.

In fig. 6, "baseline (original)", "proposed (n=3) (original)" and "proposed (n=2) (original)" represent the speaker recognition rate of the clean speech and the noisy speech, respectively, when the speech enhancement is not performed, and "baseline (enhanced)", "proposed (n=3) (enhanced)" and "proposed (n=2) (enhanced)" represent the speaker recognition rate of the noisy speech after the speech enhancement, respectively. From the results, it can be seen that, on the data set used in this embodiment, for the noisy speech which is not enhanced, the effect of the method of the present invention is not greatly different when the value of N is 2 and 3; for the enhanced noise-containing voice and the pure voice, the method has better effect when the value of N is 2, and is obviously superior to the existing scheme.

As can be seen from the results of FIG. 5 and FIG. 6, the speaker recognition method based on deep learning can further improve the recognition performance of the model in both pure environment and noisy environment.

Claims

1. The text irrelevant multi-label speaker recognition method based on deep learning is characterized by comprising the following steps:

2. The deep learning-based text-independent multi-label speaker recognition method of claim 1, wherein the neural network model comprises a voice enhancement network and a speaker recognition network, the voice enhancement network is used for enhancing noisy voices, and robustness of the neural network model is improved; in the step 3, the speaker recognition network is pre-trained, the parameters are locked after the training is converged, and then the end-to-end training comprises a complete neural network model of the voice enhancement network and the speaker recognition network.

3. The text-independent multi-label speaker recognition method based on deep learning of claim 1, wherein in the step 3, the specific formula of the cross entropy loss function is:

wherein, C is the total number of speakers; p is p _i According to the voice data and the label y _i The confirmed true identity, namely, only one position is 1 in N multiplied by C classification positions, and the other positions are zero; q _i Refers to the probability of systematic prediction at each classification location.