AU2019101222A4

AU2019101222A4 - A Speaker Recognition System Based on Deep Learning

Info

Publication number: AU2019101222A4
Application number: AU2019101222A
Authority: AU
Inventors: Yuyao Feng; Haowei LI; Haotian WANG; Zihan Yi; Sijie ZHANG; Chao Zhu
Original assignee: Zhang Sijie Miss
Current assignee: Zhang Sijie Miss
Priority date: 2019-10-05
Filing date: 2019-10-05
Publication date: 2020-01-16
Anticipated expiration: 2027-10-05

Abstract

This invention lies in the field of digital signal processing. This is a speech recognition system that identifies the different speakers based on deep learning. The invention consists of the following steps: Firstly, we collect the voice data from different people. Secondly, the data having been selected is preprocessed by extracting their Mel Frequency Cepstral Coefficients (MFCC) and is divided into training set and test set randomly. Thirdly, we cut the training set into batches, and put them into the convolutional neural network which consists of convolutional layers, max pooling layers and fully connected layers. After repeatedly adjusting the parameters of the network such as learning rate, dropout rate and decay rate, the model will reach the optimal performance. Finally, the testing set is also cut into batches and put into the trained neural network. The final recognition accuracy rate is 70.23%. In brief, this invention can automatically recognize different speakers efficiently.

Description

A Speaker Recognition System Based on Deep Learning

FIELD OF THE INVENTION

This invention is in the field of digital signal processing and serves as identification of different speakers’ identity by deep learning.

BACKGROUND

Speech is an important biological information and the most effective way of human communication. With the rapid development of information technology, computer is widely used in various aspects of human social life and speech recognition can improve work efficiency in many situations such as telephone communication, mobile payments, machine control and airport security. How to make the computer recognize the identity of the speaker in the human-computer interaction is of great significance to public safety, information security and property security.

In the 1950s, a speech recognition system that recognized 10 English digital pronunciations was accomplished in Bell Labs, marking the beginning of modem speech recognition research. With the wide application of computer technology in the field of speech recognition, a series of important research results have been obtained. The concept of Mel Frequency Cepstral Coefficients (MFCC) was proposed in the 1980s and is still an important feature parameter in speech recognition. The introduction of artificial neural network (ANN) has injected new vitality

2019101222 05 Oct 2019 into speech recognition technology and promoted its development.

Deep learning is a branch of machine learning research which can be understood as the development of ANN. It is essentially a method of training deep structural models and it is also an algorithm for modeling complex relationships between data through multiple layers. With the development of high-performance computing platform and big data processing technique, deep learning has become a powerful method for speech recognition. In many popular models, such as VGG, more convolutional layers and pooling layers are needed, and the processing of data is more complicated.

In this invention, TensorFlow is used to implement the deep learning framework. We first collect voice clips of different people in the environment of daily life and use MFCC to preprocess the data. After that, we randomly feed the training data set into a simpler convolutional neural network in batches. Parameters including learning rate, dropout rate and decay rate are optimized by observing the model's average accuracy and variance of the testing set until the model achieves the best performance. In addition, the model can also be used in fields such as image recognition.

2019101222 05 Oct 2019

SUMMARY

The term of this project is to recognize the identity of the speaker. The technical solution of this invention is implemented as follows:

We collect voice data from different people. All data is cut into a few segments. In order to get sufficient data, each fragment is cut into more parts with each part overlapping a little with each other. Next, we extract Mel Frequency Cepstral Coefficients (MFCC) for all segments. The output of MFCC is reshaped and appended with labels. Then, all data is saved as MATFAB(.mat) files and divided into training set and testing set.

Before training, the .mat files are loaded into storage to do some preprocessing work such as one-hot encoding and normalization. Both methods allow the computer to process the characteristic values more efficiently.

Then we design our convolutional network frame. Our convolutional network consists of several layers. There are in sum four convolutional layers, two max pooling layers and two fully connected layers. Figure 1 shows the order of these layers.

The function of convolutional layer is to extract features from input. After convolutional layers, an output with large dimension is obtained.

The function of the max pooling layer is to decrease the dimension of data. It cuts the feature into several regions, take the maximum value in

2019101222 05 Oct 2019 each region and combine them into a new feature with a smaller dimension.

We use Rectified Linear Unit (ReLU) in convolutional layers and fully connected layer 1. The using of ReLU unit can effectively alleviate the issue of gradient disappearance and overfitting.

Fully connected layer connects every node in one layer to every node in the next layer. The input matrix goes through a fully connected layer, and the activations of nodes can thus be computed. Eventually, the classify output is calculated.

Softmax is a function that can output the probability that each classification is taken. It maps the output of multiple neurons to the interval of (0, 1)

For parameter optimization, we utilize regularization form L2 to calculate tensor loss value and improve the recognition. Then our team use function dropout to drop some nodes randomly so that the overfitting phenomenon will be eliminated. We prepared three methods to update loss and find the deep point— the steepest descent method, Momenta and Adam optimization algorithm (Adam). Compared to the other two methods, Adam is more suitable for our project. Therefore, we choose Adam for optimization.

To keep learning rate medium, we use function decay to reduce learning rate at a stable decay step. In the end, we optimize our

2019101222 05 Oct 2019 framework for us to operate the system easier and more concisely

When all the settings for data, network and optimization parameters are done, all the data is loaded from the .mat file, and then the testing and training data are divided into batches divided by an appropriate amount. After being handled by the structure of convolution and max pooling layer that mentioned above, the shape of the data will be converted into the corresponding set. At the same time, the data loss and average accuracy are calculated from the session of training and testing respectively, and the parameters are modified in the program to get the best results.

Finally, we are able to recognize the speaker at an optimal accuracy rate.

DESCRIPTION OF DRAWING

Figure 1 shows the data flow of our convolutional neural network.

Figure 2 is the structure of the convolutional neural network.

Figure 3 shows the data flow of convolutional layer 1 and 2.

Figure 4 shows the data flow of max pooling layer 1.

Figure 5 shows the data flow of convolutional layer 3 and 4.

Figure 6 shows the data flow of max pooling layer 2.

Figure 7 is the structure of the fully connected layers.

Figure 8 shows the principle of dropout

Figure 9 shows the procedure of our project.

2019101222 05 Oct 2019

DESCRIPTION OF PREFERRED EMBODIMENT

Data Preprocessing

In this project, we collect voice data from 5 people. All data is saved as WAV files and cut into 2000 segments of 7 seconds. The total amount of recording for each person is more than 45 minutes. In order to get sufficient data, each 7-second fragment is cut into 21 2-second parts with each part overlapping a little with each other. Then, we obtain 8400 recording files for each person. All files are renamed in form of X_Y where X stands for different people and Y is its serial number. This step helps us to load in data more efficiently.

Next, we extract Mel Frequency Cepstral Coefficients (MFCC) for all 42000 segments. There are mainly three steps to gain MFCC from WAV files.

(1) the spectrum of the signal is extracted by short-time Fourier transform.

(2) the energy spectrum is obtained by squaring the original spectrum and effective fragments is extracted by bandpass filtering.

(3) the logarithm of the output signal is taken from the filter and inverse Fourier transform is applied to get MFCC. A simplification of the formula is as follow:

C„=EwlogXWcos[i^] «=1,2„..Λ (1)

L is the coefficient of step numbers, and M is the number of triangle

2019101222 05 Oct 2019 filters.

The final output of MFCC is a matrix of shape [Zxl6] where Z is related to the bit rate of the source and its value should be more than 2000. However, we just need the first [1024x16] characteristic values to get a matrix of [32x32x16] for each fragment, so the surplus parts are cut off. We append a label corresponding to each person who recorded their voice. Finally, 80% of the whole data is randomly chosen as the training set while the other data composes the testing set. Both sets are saved as MATFAB files (.mat) for further use in the following steps.

Before training, we load the .mat files into storage to do some preprocessing work.

(1) The sets are reshaped and labels are transformed into one-hot encoding. For example, we transfer the label of the first speaker to [1,0,0,0,0], and the label of the third speaker to [0,0,1,0,0] (2) All characteristic values are normalized to data ranging between -1 and 1. The range of all characteristic values are unknown at the beginning. We obtain the maximum and minimum of the characteristic values which are -112 and 85 respectively. So we normalize all characteristic values by:

(2)

98.5 ’

N is the normalized value, and C is the original characteristic value.

Both methods allow the computer to process the characteristic values

2019101222 05 Oct 2019 more efficiently.

Network Design

Figure 1 shows the structure of our neural network. The network has in sum four convolutional layers, two max pooling layers and two fully connected layers. The input each passes through convolution layer 1, convolution layer 2, max pooling layer 1, convolution layer 3, convolution layer 4, max pooling layer 2, fully connected layer 1 and fully connected layer 2.

The function of convolutional layer is to extract diverse features from input. The first convolutional layer can only extract elementary features such as edge, line and comer. More layers of network can extract more complex features from low-level features through iteration.

There are some parameters referred to the calculation of convolutional layer that needs to be declared:

(1) Filter: the convolution kernel in convolutional neural network.

(2) Depth: it controls the depth of the output unit, which is equal to the number of filters.

(3) Stride: step size of the convolution kernel in convolution (4) Receptive Field: the width(height) of the convolutional kemal.

(5) Zero-padding: the value of zero-padding has two situations: “Valid” means there is no padding, while “Same” means the output image is the same size as the input image. In this program we use “Same”.

2019101222 05 Oct 2019

The width(height) of the output data matrix can be calculated by the formula below:

W-F+2P —^+l ™ where W is the width(height) of input, F is the width(height) of the convolution kernel, P is the value of zero-padding and S is step size.

After convolutional layers, an output with large dimension is obtained. The function of the max pooling layer is to cut the matrix into several regions, take the maximum value in each region and combine them into a new feature with a smaller dimension.

We use Rectified Einear Unit (ReEU) in convolutional layers and fully connected layer 1. The function of ReEU is [, x, x > 0

ReEU(x) = max(0, x) = θ < θ (4)

The using of ReEU unit can effectively alleviate the issue of gradient disappearance and overfitting.

The specific situation of each layer in this project is introduced below.

(1) Convolutional Fay er 1

The input data of convolutional layer 1 has the shape [32x32x1], It is

2019101222 05 Oct 2019 convoluted by a [3><3xl] convolution kernel. The convolution kernel moves through the horizontal and vertical direction of the input data. Its stride is 1 and zero-padding is 1. Therefore, the width(height) of the output data is

The number of filters is 16, namely the depth of output data is 16. As a result, the shape of output data is [32x32x16].

After convolution, we input the result into ReLU unit. After this procedure, the size of data is still [32x32x16], (2) Convolutional Layer 2

The input data of convolutional layer 2 has the shape [32x32x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, same as the calculation of convolution layer 1, the width(height) of the output data is 32. The number of filters is also 16, namely the depth of output data is 16. As a result, the shape of output data is [32x32x16].

After convolution, we input the result into ReLU unit. Still, after this procedure, the size of data is [32x32x16], (3) Max Pooling Layer 1

In this invention, we set max pooling layer 1 with filter of size [2x2], with a stride of 2. It slices the input by 2 along both width and height.

2019101222 05 Oct 2019

When doing max operation every time, a 2x2 region will be taken and the max value will be calculated to substitute the region. The depth dimension remains the same. Therefore in the project the initial input volume has the size [32x32x16], and it is pooled with filter of size 2, stride 2. The output of size [16x16x16] is finally produced.

(4) Convolutional Layer 3

The input data of convolutional layer 3 has the shape [16x16x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, the width(height) of the output data is

16+2x1-3

-------+1=16

The number of filters is 16, namely the depth of output data is 16. As a result, the shape of output data is [16x16x16].

After convolution, we input the result into ReLU unit. After this procedure, the size of data is [16x16x16], (5) Convolutional Layer 4

The input data of convolutional layer 4 has the shape [16x16x16], It is convoluted by a [3x3x16] convolution kernel. The convolution kernel’s stride is 1 and zero-padding is 1. Therefore, same as the calculation of convolution layer 3, the width(height) of the output data is 16. The number of filters is also 16, namely the depth of output data is 16. As a

2019101222 05 Oct 2019 result, the shape of output data is [16x16x16].

After convolution, we input the result into ReLU unit. The size of output data is still [16x16x16], (6) Max Pooling Layer 2

Same as max pooling layer 1, max pooling layer 2 has the filter of size [2x2], with a stride of 2. When doing max operation every time, it will substitute a 2x2 region with the max value of the four values. The depth dimension remains the same. Therefore it changes the input volume of size [32x32x16] with filter of size 2, stride 2. The output size is [16x16x16], (7) Fully Connected Layer 1

After finishing convolution, we rearrange our data. We reshape the data matrix from shape [batch ,8, 8, 16] to [batch, 1024], We took the batch size of training set as 64, and test set’s as 400. Therefore for the training data, the input matrix has the shape of [64,1024] and for test data it is [400,1024], Accordingly, the input of the fully connected layer 1 has 1024 nodes, each of them is connected to all nodes of the next layer.

Then we define the value of weights of fully connected layer 1. We give each weight a value according to the law of normal distribution. The values of weights are stored in a matrix with the shape of [sizex sizex channels, number of hidden nodes], which in this program is [1024, 128], We also define the referring bias. The value is 0.1 with shape [number of

2019101222 05 Oct 2019 hidden nodes], namely [128],

To calculate the output of fully connected layer 1, we use the formula: y = w^T x + b (5 in which y is the output of fully connected layer 1, w is the weight matrix, x is the input matrix, and b is bias. Here we choose ReLU as the active function. From this procedure we get the output from the fully connected layer 1. Its shape is [64, 128], (8) Fully Connected Layer 2

We put the result above into the fully connected layer 2 as its input. Similarly, we define the value of weights of fully connected layer 2 according to normal distribution. The value of weights are stored in a matrix with the shape of [number of hidden nodes, number of layers], which in this layer is [128, 5], We also define the referring bias. The value is 0.1 with shape [number of layers], namely [5],

In this layer, we use Softmax function to calculate the output. Softmax is a function that can output the probability that each classification is taken. Its formula is:

Softmax(x_;) = exp(x_;)

Σ”.,«^χρ(ν) (6) where is the value of the z-th element. The output gives out the probability of each label, namely the probability for the data to be judged as label 0 to 4.

2019101222 05 Oct 2019

To adjust weights to optimal values, we use back propagation. Back propagation is achieved by calculating the loss of each nerve node, which is:

di di di,di.

— = w--, — = x- (—) dx dy dwdy where R j_s the output of node, ^^xl is the input of node, nnxm/?

^A is the weight of node, and is the loss.

optimization

During the implementation of our project, we found that the overfitting phenomenon occurred during the running of the program. Overfitting is that the accuracy of the training is very high, while the accuracy of the test is low. So we have these several ways to solve the problem.

(1) Regularization

Regularization have two forms- LI loss and L2 loss. L2 loss has a unique value, while LI loss doesn’t have this feature. We finally use L2 loss to optimize because it can distinguish the result which is better. The formulas for LI loss and L2 loss are:

LI: 5 = Σ|ζ-*ω|(8) /=0 «2

L2: 5 = Σ(ζ-Α(*)) ·(9) /=0

L2 can minimize the sum of the square of the differences between the target value and the estimate values. We use TensorFlow’ s function to

2019101222 05 Oct 2019 make L2 loss work. The function utilizes the norm of L2 to calculate tensor loss value.

(2) Dropout

Dropout is dropping some nodes randomly in fixed probability at fully connected layer in training. If we drop some nodes randomly, it may avoid overfitting. We use TensorFlow’s dropout function to drop nodes. The function can input tensor x and output value after applying drop ratio.

(3) Update

We have three methods to update loss and find the deep point.

1) The steepest descent method. We need to find a minimum value, so we can use the method to iteration search reversely in a fixed step size on the basis of current point gradient.

2) Momenta. Momenta gives a direction which combines with the steepest descent method in parallelogram law. And the final direction obeys the law.

3) Adam. Adam is one of the fastest algorithms in all methods. It can find the deep point effectively.

Finally, we use Adam to update loss. Adam is more effectively and wider than the other methods. For our thousands of data, Adam can optimization the problem directly.

Adam calculate gradient of time, the formula is:

2019101222 05 Oct 2019 g,=V.7(^_i)(10)

First, it calculates the exponential moving average of the gradient, the formula is:

w_t = Aw-i + (lA)^(11)

Second, it calculates the exponential moving average of the square of the gradient, the formula is:

⁺Ο²

Third, we make y_o and initialization which are 0, so we need to correct average of the gradient and average of the square of the gradient, the formula is:

=(13) =(I⁴

Finally, Adam updates the parameter.

(4) Learning rate optimization

We hope the learning rate is neither high nor low. TensorFlow’ s decay function can keep the learning rate. We use this function to make learning rate rise first then decay to make sure learning rate keeps medium. The learning rate will decay once in decay rate at a stable decay step. Our team use staircase decay mode, because its scale is larger, and it is easier to reduce outfitting.

(5) Framework optimization

In file ‘dp_refines.py’, models after training are saved to make us

2019101222 05 Oct 2019 record. We also use API to simplify the code so that we can change or modify the parameters easier.

Training and Testing

Firstly, the file is converted into the form of mat according to the ratio that train and test is 4:1 before the training. The amount of collected data is 8400x5x16 and initial size of each data is defined as [32x32x1], And then, to keep it from running out of memory, all the data cannot be processed simultaneously, we take the approach of batching data: define a variable of ‘batch size’ as the size of each batch, the computer only processes one batch of data at a time. We use 400 as the batch size of testing set, and 64 as the batch size of training set. Therefore, testing shape is [400x32x32x1], and training shape is [64x32x32x1], After being handled by the structure of convolution layers and max pooling layers, the size of output data is changed to [8x8x16], The loss generated during the training is what we use to judge the average accuracy rate of training and the value of loss depends on the parameters- base learning rate (initial valued 0.001). At the end of training, we adopt the method of ‘Adam’ to optimize result and reduce the loss generated. Besides, Softmax function gets the probability that each data belongs to each label. The progress of testing is basically same as training except data set. The accuracy of testing is also presented in the confusion matrix, which shows the accuracy rate of each label.

2019101222 05 Oct 2019

Table 1 shows the result when we take different values of parameters. We can achieve the optimal recognition accuracy of 70.23% when dropout rate is 0.95, base learning rate is 0.001, decay rate is 0.3, iteration steps is 20000, and λ is 5.Ox IO'⁴.

Table 1 Recognition Result

Dropout rate	Base learning rate	Decay rate	Iteration steps	λ	average accuracy	standard deviation
0.95	0.001	0.3	23000	5.00E-04	67.97	2.23
0.98	0.001	0.3	23000	5.00E-04	66.84	2.3
0.92	0.001	0.3	23000	5.00E-04	66.47	2.26
0.95	0.0018	0.3	23000	5.00E-04	67.35	2.26
0.95	0.0002	0.3	23000	5.00E-04	64.37	2.31
0.95	0.001	0.36	23000	5.00E-04	66.36	2.29
0.95	0.001	0.26	23000	5.00E-04	66.54	2.43
0.95	0.001	0.2	23000	5.00E-04	69.17	2.38
0.95	0.001	0.18	23000	5.00E-04	67.49	2.28
0.95	0.001	0.3	26000	5.00E-04	66.86	2.41
0.95	0.001	0.3	22000	5.00E-04	65.1	2.27
0.95	0.001	0.3	20000	5.00E-04	70.23	2.13
0.95	0.001	0.3	19000	5.00E-04	64.87	2.41
0.95	0.001	0.3	23000	5.00E-04	67.97	2.23
0.95	0.001	0.3	23000	6.00E-04	67.39	1.8

Claims

1. A speaker recognition system based on deep learning, wherein use mel frequency cepstral coefficients to process audio data shape into 32x32x1, this approach makes data processing in subsequent convolutional networks more efficient and faster.

2. According to system of claim 1, wherein randomly divide the original data into training set and testing set proportionally, which ensures the rigor and reliability of the experimental results.

3. According to system of claim 1, wherein introduces regularization and dropout to avoid overfitting, the regularization makes the neural network intend to learn smaller weights, the dropout helps lessen dependence between neurons which makes the network more robust, by optimizing the parameters, our invention can effectively identify the speakers.